Sprok DTS Blog

Pioneering Low-Resource Language Translation with NeuralSpace

A couple weeks ago, we posted a news update about the London-based NeuralSpace raising 1.7 million USD in a seed round. A SaaS (Software as a Service) platform offering groundbreaking applications for 87 languages, many of which are low-resource languages, NeuralSpace allows its customers to train, deploy, and use their AI models for language processing applications without having prior knowledge of machine learning. This is useful for companies who need straightforward NLP solutions for their websites or products—especially enterprises working in low-resource languages.

NeuralSpace’s CEO Felix Laumann recently sat down with SlatorPod’s hosts—Esther Bond and Florian Faes—to speak about the history of his company and share some insights about the services it offers. In a landscape that’s unforgiving for low-resource languages, it’s important to hear the stories of pioneers who shy away from mainstream high-resource languages (English, FIGS) and cater directly to a market for low-resource languages.

The Birth of NeuralSpace

With a strong background in computer engineering, mechanical engineering, and statistics, Laumann has always been fascinated by the mathematical models that power language technology. Laumann also mentions his experiences and interests in foreign languages, noting the beautiful differences between them and the important details that often go unseen in low-resource languages. Specifically, he mentions the way Indonesian has five different words for the word “ocean,” and how beautifully, markedly different the Tamil and Tibetan scripts are from the English alphabet.

Given all this, Laumann partnered up with some colleagues and founded NeuralSpace three years ago. Driven by and sharing a passion for low-resource languages, NeuralSpace’s initial members focused their attention on providing NLP solutions specifically for low-resource languages, despite the small amount of data available.

Laumann notes that the data collection process is abysmal for low-resource languages. A low-resource language, by definition, is a language without a substantial dataset available for analysis and training, comprising less than 1% of the internet’s written material. Data is mostly taken from the internet—it being the most freely available source of written material—but NeuralSpace also works with data acquisition companies who help collect spoken and written language data for low-resource languages. In some cases, NeuralSpace works directly with linguists to scribe their own data, asking them to translate sentences line by line. What’s important is that the dataset covers everyday words and phrases—or in Laumann’s words, kitchen conversations.

It’s a noble cause it’s working for, and begs the question of how the company came into such a great amount of funding. According to Laumann, raising capital for the company was a relatively straightforward process; NeuralSpace was parts of an accelerated program that allowed for a facilitated process in the funding process. Laumann recalls pitching his company’s ideas and missions at various online and offline events around London, where he is based.

NeuralSpace, Now & Later

Three years after its founding, NeuralSpace now is a popular SaaS option for chatbot or conversational AI development companies, although the actual use cases vary widely. Given NeuralSpace’s unique position in the market as a provider of low-resource language solutions, customers look to NeuralSpace as an efficient, effective alternative or complement to frontend translation engines such as Google Translate.

Where does NeuralSpace see itself in the future? Laumann talks in depth about his aspirations for the company; his current interests lie primarily in voice-to-voice live translation for low-resource language pairs, which is something Google Translate has yet to do. But it’s an arduous task, says Laumann; voice-to-voice live translation requires near-perfect text-to-speech, text-to-text translations, working near-perfectly in tandem. But voice-to-voice translation is a highly sought-after function in the modern global language market: any company that cracks it is bound to rise to the top. Building on this, Laumann hopes that NeuralSpace can cover any fundamental NLP project or problem—a one-stop solution to any NLP needs a company might have. To do this, he hopes to make NeuralSpace’s NLP functionalities more customizable, offering a wide array of solutions that can be fitted together to provide optimal NLP interfaces for companies with varying needs.

NeuralSpace’s mission boils down to a simple cause: to “democratize NLP and make sure any developer can create software with advanced language processing in any language and not just English.” This mission highlights two important, virulent problems plaguing NLP: the obscurity of NLP concepts like BERT, Lemmatization, or Tokenization and the subsequent lack of machine learning knowledge which inhibits training, deploying, and scaling NLP models; and the lack of NLP solutions in low-resource languages spoken in major parts of the world.

How Does NeuralSpace Do It?

None of this answers a question that, at this point, lies at the heart of NeuralSpace. How on earth does NeuralSpace deal with the challenges that low-resource languages face? How does the company cope with the lack of annotated or unlabeled datasets, or the myriad of dialects present in some wider language families? There are clear, surefire processes that NeuralSpace employs to deal with these problems systematically and fundamentally.

First is transfer learning, or leveraging prior knowledge to solve new tasks—the way humans do. NeuralSpace’s applications are based on language models that are highly adaptable, even in low-resource settings. These language models do not require annotated data and generate language abilities via unsupervised learning (as compared to supervised learning with bilateral and bilingual corpora, which is the predominant method for high-resource languages). Despite the possibilities that unsupervised, adaptable models offer, they aren’t exactly useful for specific tasks like “classifying user intents off-the-shelf,” so NeuralSpace fine-tunes these models to the point where they can solve user-specific tasks with limited amounts of data. In the process, the models learn to solve tasks accurately despite data scarcity—all through transfer learning.

Laumann gives an example of this in a blog post about low-resource language models:

…let’s take the case of an e-commerce chatbot. The chatbot is supposed to answer queries and resolve customer issues around delivery time, refunds and product specifications. To form a conversation, the chatbot must first understand the intent of the customer, then a few entities. For example, “Where is my Razer Blade 14 that I ordered on the 4th of December?”, whose intent is classified as “check order status” with the entities “laptop”: “Razer Blade 14” and “date”: ”4th of December”. Thus, we will need a simple intent classification model.

The problem, however, is that there are hundreds of intents and corresponding actions that follow such intents. Laumann confesses that it’s “expensive to annotate huge training datasets and time-taking to train a well-performing model from scratch.” But with transfer learning and fine-tuning, NeuralSpace’s models can give accurate solutions with fewer inputs, thereby saving data annotation costs and gaining model performance. The best part, says Laumann, is that “developers do not even need to think about transfer learning” because NeuralSpace’s optimization algorithm (AutoNLP) takes care of everything all on its own.

NeuralSpace’s second answer to low-resource language processing is multilingual learning, in which a single model is trained on multiple languages. Multilingual learning works off of the assumption that “the model will learn representations that are very similar for similar words and sentences of different languages.” For example, NeuralSpace’s language model can transfer knowledge from a high-resource language (e.g. English) to a low-resource language (e.g. Swahili) via transfer learning, utilizing similarities between the languages. This process is much easier to scale and requires less storage, claims Laumann, and allows the model to upgrade to better architectures much easier.

NeuralSpace’s multilingual models result in higher performance across languages and also allows NLP models to make generalizations and inferences on languages it has not fine-tuned. Laumann gives the example of Tamil, which can be predicted with high accuracy using previously trained English, Hindi, and Marathi data. Finally, multilingual models are simply cheaper to host—only one model is necessary for numerous languages, instead of one model for each and every single language.

The third and last answer to low-resource language processing is data augmentation, which is a “data pre-processing strategy that automatically creates new data without collecting it explicitly.” By synthesizing various iterations of the same sentence, a model can diversity its own training data in a cheap, fast, and unsupervised manner. NeuralSpace’s own data augmentation application, called NeuralAug, allows models to switch out words, word order, and translations to enrich the dataset of any given low-resource language, enhancing the overall robustness of the model.

Conclusion

These are fundamental, crucial steps that NeuralSpace takes to ensure that its offering of low-resource languages are effective and accurate. This isn’t to say that low-resource language processing is nearing perfection; much work has yet to be done to render it more commercially applicable. However, the existence of NeuralSpace in the language industry landscape—a voice of reason calling for the democratization of NLP—is meaningful in that it sheds light on the inherent biases of English- and FIGS-centric language models and offers a more level playing field for low-resource languages.

In that sense, NeuralSpace is true to Laumann’s old fascination with the beauty of languages—languages that are rarely read or heard in a cyberspace dominated by major languages. To appreciate the beauty of language is to celebrate the diversity of spoken and written languages, using them not only in local and ceremonial settings but also in hard, cold industry settings. NeuralSpace’s efforts to implement low-resource language application in business is but the start of the rise of low-resource languages; English is counting down its years as the reigning lingua franca of the internet.

Sprok DTS is dedicated to providing fast, accurate, and professional localization and translation for all your language needs. Our services come in 72 languages, including high-resource ones such as English, Spanish, and Japanese, and low-resource ones, such as Kyrgyz, Galician, and Lao. Powered by the world’s leading neural machine translation technology, our localization experts and translators ensure speed and accuracy on all your projects. Ask for a free quote today on our website.

References
https://medium.com/neuralspace/challenges-in-using-nlp-for-low-resource-languages-and-how-neuralspace-solves-them-54a01356a71b
https://www.neuralspace.ai
https://slator.com/api-platform-neuralspace-raises-usd-1-7m-in-seed-round/
https://slator.com/neuralspace-ceo-felix-laumann-on-democratizing-nlp/