Meta’s Open-Source LLM, 24 New Languages for Google Translate, and Amazon NLP: Language Industry Updates for May 2022

We here at Sprok DTS hope everyone is having a beautiful spring. We are back with some exciting news in the language industry, which seems to forge ahead with brilliant new developments in the realm of machine translation despite all the hardships plaguing the world. From Amazon’s grandiose plans to increase the number of languages supported by Alexa to Google’s coverage of 24 new Asian and African languages, machine translation researchers work harder than ever to bridge the gap between nations and cultures, bringing us closer in these desperate times.


Developing Zero-Resource Machine Translation for Wider Coverage in Google Translate

Research scientists Isaac Caswell and Ankur Bapna recently announced that Google’s front-end translation service Google Translate will now be covering an additional 24 under-resourced languages. Included are Asian, South American, and African languages spoken by large populations but lack substantial data for proper model training, such as the Assamese (spoken by 25 million people in Northeast India), Quechua (spoken by 10 million people in Peru, Bolivia, Ecuador, etc.), and Luganda (spoken by 20 million people in Uganda and Rwanda). This brings the total number of covered languages to an impressive 133.

Before, Google Translate’s language offerings were predominantly European; for example, the service provided support for Frisian, Maltese, Icelandic, and Corsican, all of which have less than 1 million native speakers, but not for Bhojpuri (nearly 51 million speakers) or Oromo (nearly 24 million). This latest update is an effort to represent the world’s languages in a more proportionate manner and envision a more linguistically diverse world in which native speakers in South Asia, Africa, and South America can better enjoy the benefits of machine translation to aid with communication.

While the update was long overdue, it was no easy task for the researchers. The 24 languages added are some of what research scientists call “long-tail languages,” or languages with scarce data sets that require machine learning techniques that can “generalize beyond the languages for which ample training data is available” (Bapna et al., 4). If languages previously offered by Google Translate were trained—as language models usually are—using parallel text datasets, these 24 languages required the use of monolingual text.

We’ve covered this kind of “zero-shot,” monolingual model training before on this blog; bilingual language data training is becoming outdated due to its inefficacy in lesser-spoken languages, and these 24 languages are testament to how successfully zero-shot machine translation can function when given ample time and resources. In the case of Google Translate, researchers “train[ed] a single giant translation model on all available data for over 1000 languages” (Caswell and Bapna). 

A graph comparing the amount of parallel data and the monolingual data available for major languages. Image Credits:

There are other important, more scientific, parts to the process. Finding high-quality, effective data was difficult for these under-resourced languages, as “many publicly available datasets crawled from the web often contain more noise than usable data” (Caswell and Bapna). As a result, researchers instead “trained a Transformer-based, semi-supervised LangID model… to better generalize over noisy web data.” In doing so, the researchers were able to end up with high-quality content and data to train the model.

Caswell and Bapna note that communicating with native speakers was critical to developing machine translation services for these under-resourced languages. The researchers “collaborated with over 100 people at Google and other institutions who spoke these languages” (Caswell and Bapna), receiving help with tasks like developing filters to remove out-of-language content or transliterating between various scripts used by a language, among other tasks. 

There is still a long way to go for these languages—and multilingual, zero-shot language models. Caswell and Bapna stress that “the quality of translations produced by these models still lags far behind that of the higher-resource languages supported by Google Translate,” and advise users to practice caution when interpreting translation outputs.

A graph indicating the quality (RTTLangIDChrF) of translations by the number of monolingual sentences available (horizontal axis.) Image Credits:


Amazon’s MASSIVE Dataset for a Multilingual Alexa

While Google’s artificial intelligence research is admirable, it doesn’t offer its users a virtual assistant that’s as well known as Apple’s Siri or Amazon’s Alexa. Late last month, Amazon’s senior applied scientist Jack G. M. FitzGerald over at the Alexa AI’s Natural Understanding Group made an exciting announcement: Amazon is releasing its very own dataset called MASSIVE, which is “composed of one million labeled utterances spanning 51 languages, along with open-source code, which provides examples of how to perform massively multilingual NLU modeling.” 

In contrast to Google’s focus on natural language processing—after all, Google Translate is, above anything, a translating processor of language—Amazon focuses more on natural language understanding (NLU), given that users of Alexa are always in conversation with Alexa, who must first understand the intent and purpose of its users’ utterances before providing an apt answer. FitzGerald explains this in more simple terms: “given the utterance “What is the temperature in New York?”, an NLU model might classify the intent as “weather_query” and recognize relevant entities as “weather_descriptor: temperature” and “place_name: new york.”” (FitzGerald)

MASSIVE is Amazon’s answer to developing a more robust multilingual paradigm for Alexa. MASSIVE is short for Multilingual Amazon SLURP (SLU resource package) for Slot Filling, Intent Classification, and Virtual-Assistant Evaluation—what a mouthful—and contains “one million realistic, parallel, labeled virtual-assistant text utterances spanning 51 languages, 18 domains, 60 intents, and 55 slots” (FitzGerald). With this, Alexa would be able to carry out similar intent-identification and response-evaluation processes like the one above in 51 languages. Furthermore, the creation of such a dataset and paradigm means that Alexa—and other virtual assistants—can continue to build on preexisting datasets to cover more languages, including under-resourced ones that have previously been largely overlooked due to their lack of sample data and text.

Amazon has released and open-sourced the dataset and processes on GitHub, which means researchers and scholars can learn from, study, and make improvements to the dataset. Alongside the release of the dataset, Amazon is also hosting the Massively Multilingual NLU 2022 competition, inviting competitors to participate in two tasks. In the first task, competitors will be able to train and test a single model on all 51 languages based on the full MASSIVE dataset. In the second one, competitors will be able to fine-tune a pretrained model only with English-labeled data and test it on all 50 non-English languages, according to FitzGerald, who hopes that wide participation in these tasks can aid in developing and improving zero-shot learning for machine translation and natural language understanding.


Meta Democratizes Access to Large-Scale Language Models

Earlier this month, Meta announced that it will be open-sourcing its Open Pretrained Transformer (OPT-175B), which is a “language model with 175 billion parameters trained on publicly available data sets.” In a Meta AI blog, research engineer Susan Zhang, research scientist Mona Diab, and research director Luke Zettlemoyer explain that large language models have been fundamental to recent developments in NLP and AI research, allowing machines to “generate creative text, solve basic math problems, answer reading comprehension questions, and more.” However, these large language models are heavily-guarded secrets, accessible to only a few high-resource labs. This is a problem, claims Zhang et al., as secretive, isolationist policies on model access largely hinder developments. 

With Meta’s release of OPT-175B, researchers and scholars in the larger NLP and AI community can access and pry into the inner workings of a large language model—the largest to ever be publicly released—and examine “both the pretrained models and the code needed to train and use them” (Zhang et al.). Meta does note that the model will come with a noncommercial license, only to be used for academic, government, civil society, and industry research purposes around the world, so as to prevent misuse.

While OPT-175B signifies a new chapter in community engagement in NLP research, it is also a trial, examining how the AI community will manage and uphold the tenets of integrity and fair use as such important information is shared online. In the words of Zhang et al.: “We believe the entire AI community — academic researchers, civil society, policymakers, and industry — must work together to develop clear guidelines around responsible AI in general and responsible large language models in particular, given their centrality in many downstream language applications” (Zhang et al.). 

The concerns are not only political and moral in nature; AI research can be power-hungry, costly, and harmful to the environment. Meta had this in mind, and as such, OPT-175B has been designed so that it only uses a fraction of the carbon footprint as a previous model. Community engagement and usage of such large-size models is a new frontier, and companies and individual researchers are advised to be careful of their intentions and purposes as they utilize the powers of OPT-175B.