Last week, Meta announced that they had made their first breakthrough in their No Language Left Behind project, which is their effort to “develop high-quality machine translation capabilities for most of the world’s languages. Their first breakthrough, which is explained in full detail in this blog post and this paper, is a single AI model called NLLB-200, a state-of-the-art model capable of translating between 200 different languages, many of which are low-resource languages. Compared to the >25 African languages covered by front-end translation tools, NLLB-200 supports 55 African languages with “high-quality results.”
That has been Meta’s goal: to center previously marginalized low-resource languages by incorporating them into the general trend towards MT improvement. In the introductory video posted on Meta’s NLLB website, research engineer Vedanuj Goswami speaks of his grandmother, who used to write poetry in Assamese—a low-resource language with little online presence yet spoken by over 23 million speakers in Northeast India. “So my grandmother writes poems in Assamese, which is a low-resource language,” Goswami says. “and I envision a future in which where I can easily translate her poems into a high-resource language, so that the world can appreciate her poems.”
We have covered Meta’s frequent forays into advancing machine translation technology for quite some time, and we are surprised each time at how faithful Meta stays to its mission of uniting the world through the elimination of language barriers. More than anything, we are struck by how human-centered and intrinsically meaningful their work is; the researchers at Meta have a profound understanding of the historical and political reasons that have affected the currently biased state of machine translation and focus on democratizing and leveling the playing field for speakers of all languages. Meta’s research paper on NLLB-200 claims the following: “those who communicate in English, French, German, or Russian—languages which have long enjoyed institutional investments and data availability—stand to gain substantially more from the maturation of machine translation than those who speak Catalan, Assamese, Ligurian, or Kinyarwanda… Many of these languages escape researchers’ gaze for a confluence of reasons, including constraints conjured up by past investments (or lack thereof), research norms, organizational priorities, and Western-centrism to name a few. Without an effort to course correct, much of the internet could continue to be inaccessible to speakers of these languages.”
While numerous technical difficulties plague MT development for low-resource languages (data scarcity, high-cost data procurement, for example), Meta is determined to overcome these challenges by utilizing multilingual systems. NLLB-200 is one big step in the journey to a universal machine translation model. However, it is important to identify the specific challenges that low-resource languages face, especially as they are integrated into a multilingual framework such as NLLB-200.
Modern MT systems rely heavily on large amounts of data; these data-hungry systems require “millions of sentences carefully matched between languages.” To account for the lack of such parallel datasets, current models mine data from the web, leading to poor data quality due to a difference in the source text for each language. Furthermore, the researchers note that such data is “often full of incorrect or inconsistent spellings and is missing accent marks and other diacritical marks.”
How it works
Structure-wise, the Meta researchers utilize a four-step approach in developing No Language Left Behind. First, the researchers analyze and identify problems in low-resource translation from the viewpoint of the speakers of those languages. Second, the researchers learn to create training data for the low-resource languages. Third, they use this data to develop cutting-edge translation models. Lastly, the researchers evaluate their efforts and results. The process is summed out neatly in the following diagram:
That is not all, however; NLLB-200 was not created out of a vacuum devoid of ideas, but rather, builds on years of research previously done. The researchers note the importance of FLORES-200, a “many-to-many multilingual dataset that allows [researchers] to measure translation quality through any of the 40,602 total translation directions.” According to the Meta blog post, FLORES-200 has the power to evaluate translation systems across various media, including “health pamphlets, films, books, and online content.” With FLORES-200, the model undergoes extensive scrutiny so that the language standards that evaluate the model are rigid and of high quality.
There’s also LASER3, which helps researchers “mine web data to create parallel datasets for low-resource languages.” LASER3 is an upgraded version of LASER, previously used for “zero-shot transfer in natural language processing (NLP)”; LASER3 utilizes a Transformer model “trained in a self-supervised manner with a masked language modeling objective.” In applying LASER3 to NLLB, the researchers were able to produce massive amounts of sentence pairs for all languages, including low-resource ones.
Developing and utilizing such evaluation methods and data creation techniques allowed the researchers to develop “mixture-of-experts networks” which allow for smoother integration of low-resource languages into the multilingual models. Such networks allowed the researchers to “iterate, scale, and experiment with a single multilingual model much more easily than with hundreds or even thousands of different bilingual models.”
At the same time, the researchers note that they were always wary of data toxicity and quality. Using LID-200 models, the researchers filtered out the data and eliminated noise from the data gleaned from the internet. Then, the researchers composed toxicity lists, which were used to “assess and filter potential and hallucinated toxicity.” All this serves to clean up and improve the data and its quality, “reducing the risk of what is known as hallucinated toxicity, where the system mistakenly introduces toxic content during the translation process.” Example “hallucinations” can be found below.
The researchers proudly announce that “in total, NLLB-200’s BLEU scores improve on the previous state of the art by an average of 44 percent across all 10k directions of the FLORES-101 benchmark.” What is more exciting, however, is that for some of the African and Indian languages covered by NLLB-200, the increase in score is more than 70 percent compared to modern translation systems. Shown below is the comparison between NLLB-200 and existing state of the art models.
Lastly, the researchers conducted human evaluations on the various languages covered by the models, assessing the effectiveness of the NLLB-200. Through a process of analyzing the results and evaluations of both human and machine, the researchers were able to “reflect on the creation process, analyzing the risks and benefits of our research from a societal standpoint.” To that end, the researchers have open-sourced all the data, models, and benchmarks relevant to the NLLB-200 to aid further research carried out by individuals and third parties.
All this makes for an efficient multilingual system that effectively leverages new techniques of data generation and language integration to power low-resource language translation in ways never attempted before. The integrated diagram of the entire process is detailed below, taken from the paper. Below that is a timeline of all the previous developments that have come before NLLB-200.
Meta has partnered with the Wikimedia Foundation—renowned for hosting Wikipedia and other relevant sources of free information—to improve the status of low-resource languages in Wikipedia. The researchers note that there are large disparities in information that don’t reflect the true population of low-resource language speakers. For example, “there are around 3,260 Wikipedia articles in Lingala, a language spoken by 45 million people… contrast that with a language like Swedish, which as 10 million speakers in Sweden and Finland and more than 2.5 million articles.”
This will be the first major application of NLLB-200, although Meta’s decision to open-source their entire project means other groups are welcome to utilize the model for better communication purposes. In fact, Meta AI is also providing “up to $200,000 of grants to nonprofit organizations for real world applications for NLLB-200,” in hopes that their “slew of research tools” will “enable other researchers to extend this work to more languages and build more inclusive technologies.”
Meta’s dream of a linguistically connected world doesn’t just stop at facilitating communication between speakers of different languages, nor does it only pertain to social media, commerce, and other conventionally important fields in today’s rather capitalistic societies. While written media today are dominated by a handful of languages, the researchers believe that “NLLB will help preserve language as it was intended to be shared rather than always requiring an intermediary language that often gets the sentiment/content wrong.”
The word “preserve” catches our attention. Not only will NLLB-200 translate low-resource languages into high-resource ones, but in the reverse process—of high-resource languages being translated into low-resource ones—written data in the low-resource languages will have been created. And in fostering the use of low-resource languages (as a result of mitigating conformity to high-resource languages), the world will manage to stay as linguistically diverse as possible, all the while giving speakers of low-resource languages the power to scribe their languages in writing (or perhaps, speech-format data), effectively “preserving” their linguistic existence in some shape or form.
The researchers also dream of the NLLB project helping to “advance other NLP tasks, beyond translation.” They give examples of “building assistants that work well in languages such as Javanese and Uzbek or creating systems to take Bollywood movies and add accurate subtitles in Swahili or Oromo.” For speakers of high-resource languages, it is hard to fathom a world in which a piece of writing or visual media has not yet been translated into their language.
Many people are under the assumption that adopting a lingua franca (English, in this era) is the most effective, democratic way to communicate with others. After all, what is more democratic than a shared language? However, the Meta researchers’ efforts aim to deconstruct this notion of a lingua franca. With the technological and scientific developments of this age, there is no reason for people to succumb to the power of a more influential language. In this sense, the researchers announce their mission statement: “the ability to build technologies that work well in hundreds or even thousands of languages will truly help to democratize access to new, immersive experiences.” In 2022, democratization is not assimilation. It is accepting that the future is many, not one.