Innovating Low-Resource Speech Synthesis & Combating Ambiguity at the ACL 2022

From May 22 to 27, students and researchers from all over the world gathered in Dublin—or stayed in the comforts of their homes—to attend the 60th annual Meeting of the Association for Computation Linguistics. The ACL is the most renowned organization when it comes to natural language processing, and this year saw more than 2,000 authors contribute 604 long papers and 98 short papers to the conference, as well as a number of exciting events including keynote speeches and a panel on supporting linguistic diversity. 

Given the organization’s history as a machine translation-focused group (when it was founded in 1962, the organization was called the Association for Machine Translation and Computation Linguistics), a good portion of the accepted articles dealt with machine translation. And of those, many interesting titles and ideas caught our attention, such as Antoine Nzeyimana and Andre Niyongabo Rubungo’s “KinyaBERT: a Morphology-aware Kinyarwanda Language Model”—which was selected for the Best Linguistic Insight Paper—and “Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity” by Yao Lu et al.

ACL—alongside WMT (the Conference on Machine Translation)—yearly introduces some of the best, most innovative, and most creative research on MT and NLP the world has to offer. We find it very helpful to examine some of the highlighted papers; reading through them, we see the breadth of topics and values that are being pioneered by researchers in the realm of natural language processing. In other words, seeing what topics are covered here is a good gauge of what the future of natural language processing looks like.

Without further ado, here are some of the papers we would like to share with you.


“DiBiMT: A Novel Benchmark for Measuring Word Sense Diambiguation Biases in Machine Translation” (Best Resource Paper)

Authors: Niccolò Campolungo, Federico Martelli, Francesco Saina, Roberto Navigli

One of the most common errors we face when utilizing front-end machine translation is ambiguity. Machines have a much harder time distinguishing between the multiple meanings of a polysemous word. The authors give an example: “He poured a shot of whiskey.” The proper Italian translation, taking into account the fact that the word “shot” here means “a small quantity,” is “Versò uno goccio di whiskey.” However, it wouldn’t be strange to see a machine translation such as “Versò uno sparo di whiskey,” in which “sparo” refers to “gunshot.”

According to the authors, research has been carried out in the last few decades to investigate this issue, such as Gonzales et al.’s 2017 development of ContraWSD, a dataset of German-to-English and German-to-French instances of lexical ambiguity complete with contrastive examples of incorrect translations. Other studies such as Liu et al. in 2018 have improved on language models “via context-aware word embeddings,” culminating in Emelin et al.’s 2020 research that introduces a “statistical method for the identification of disambiguation errors in neural MT (NMT)” and demonstrates that “models capture data biases within the training corpora.”

However, the authors point out problems with previous research: “i) they are not based on entirely manually-curated benchmarks; ii) they rely heavily on automatically-generated resources to determine the correctness of a translation; and iii) they do not cover multiple language combinations.” To combat these shortcomings, the authors introduce “DiBiMT,” which, to their knowledge, is the “first fully manually-curated evaluation benchmark aimed at investing the impact of semantic biases in MT in five language combinations, covering both nouns and verbs.” In other words, the authors provide a framework for evaluating lexical ambiguity within a text, allowing other researchers to better understand the phenomenon of ambiguity and bias in machine translation and come up with ways to combat it.

To build DiBiMT, the authors needed “i) a set of unambiguous and grammatically-correct sentences containing a polysemous target word” and “ii) a set of correct and incorrect translations of each target word into the languages to be covered.” To select the sentence sets they were to use, the authors carried out a number of steps, starting with item structurization and notation, followed by dataset annotation. With a fully prepared DiBiMT, the authors evaluated 7 machine translation systems—from frontend ones such as Google Translate and DeepL to non-commercial ones such as MBart50 and M2M100—and reviewed the results.

And the results were interesting, to say the least. DeepL outperformed every single other model, but those other models performed particularly poorly, earning extremely low scores—between 20% and 33%. Most shocking of all: Google Translate performs the worst across all languages. Alongside general accuracy, the DiBiMT also measures semantic biases using four novel metrics, where DeepL also vastly outperforms its competitors. Another interesting result of the research is that DiBiMT’s conclusions support the existing literature’s stance regarding verbs, which are considered harder for MTs to translate than nouns.

Campolungo et al.’s research point out a very important flaw in the current state of MT: natural language is much too often ambiguous, and machines don’t have enough—or any—understanding of context to respond to this urgent need. The ambiguity can range anywhere from simple miscommunications (a gunshot of whiskey instead of a shot, for example) to straight-up offensive translations (“se taper le cul par terre”—to laugh out loud—translated as “banging ass on the floor.”) All jokes aside, ambiguity hinders MT greatly, and this research is one step in the right direction: eliminating bias and ambiguity to make for more accurate, context-aware translation models. After, Sun Tzu says that you have to know your enemy to defeat them.


“Requirements and Motivations for Low-Resource Speech Synthesis for Language Revitalization” (Best Special Theme Paper Award)

Authors: Aidan Pine, Dan Wells, Nathan Thanyehténhas Brinklow, Patrick Littell, Korin Richmond

This paper grabbed our attention, as it focuses on building speech synthesis systems for Kanien’kéha, Gitksan, and SENĆOŦEN, three Indigenous languages spoken in Canada, and the motivation and development behind such systems. The paper, while maintaining its technical integrity, reads almost like a sociology paper, given its in-depth insight into the sociological reasons why such speech synthesis systems are necessary.

The authors start out with an overview of the current state of Indigenous languages spoken in Canada. While there are about 70 Indigenous languages spoken in the country, most of these languages have less than 500 fluent speakers as a result of the “residential school system and other policies of cultural suppression.” This isn’t to lament the death of Indigenous languages; on the contrary, the authors note that such “‘doom and gloom’ rhetoric that often follows endangered languages over-represents vulnerability and under-represents the enduring strength of Indigenous communities who have refused to stop speaking their languages despite over a century of colonial policies against their use.”

In other words, Pine et al. are less concerned with the death of these Indigenous languages, as they have survived thus far through the will and perseverance of their speakers; furthermore, there has been growing interest in the revival of Indigenous languages. The authors also note that such language revitalization efforts “extend far beyond memorizing verb paradigms to broader goals of nationhood and self-determination,” and that programs endorsing language revitalization can have “immediate and important impacts on factors including community health and wellness.” Also taking into account the fact that the UN has declared 2022-2032 to be an International Decade of Indigenous Languages and the amazing news that the number of Indigenous language speakers in Canada has increased by 8% between 1996 and 2016, these concerns regarding language revitalization are more relevant than ever.

The authors argue for the necessity of speech synthesis for these Indigenous languages, as it would complement previously existing pedagogical grammar models (like the Kanien’kéha verb conjugator, Kawennón:nis) and “add supplementary audio to other text-based pedagogical tools.” However, speech synthesis itself isn’t exactly a much-developed field; for Indigenous languages in Canada, the problems are compounded by the fact that most Canadian Indigenous languages are low-resource, as they lack data and speakers. 

With these issues in mind, the research delves into a more technical aspect of language revitalization. How much data is exactly needed to configure such a system so that it meets certain pedagogical standards? What would the evaluation of such a system look like? And, “how is the resulting system best integrated into the classroom?”  To answer these questions, the paper takes a number of steps to gauge the current state of speech synthesis models and investigate the feasibility of speech synthesis given the amount of data necessary to support the authors’ goals. The paper also delineates the steps taken to build what is now “the first neural speech synthesis systems for Indigenous languages spoken in Canda.”

The authors then carried out listening test evaluations of the speech synthesis models, to which participants reacted positively. The evaluations “showed encouraging results for the naturalness and acceptability of voices for two languages, Kanien’kéha and Gitkan, despite limited training data availability.” In conclusion, the findings of the research “show great promise for future work in low-resource TTS for language revitalization, especially as they come from systems trained from scratch on such limited data, rather than pre-training on a high-resource language and subsequent fine-tuning on limited target language data.”