In a groundbreaking development, Meta has unveiled AI models capable of recognizing and generating speech in over 1,000 languages—a remarkable tenfold increase compared to existing capabilities. This significant advancement marks a crucial step towards preserving endangered languages, according to the company.
Meta has made these models available to the public through the widely-used code hosting service, GitHub. By releasing them as open-source, the company aims to empower developers working across diverse languages to create innovative speech applications. From messaging services with universal understanding to virtual reality systems accessible in any language, the possibilities for language inclusivity are vast.
Currently, there are approximately 7,000 languages spoken worldwide. However, existing speech recognition models effectively cover only around 100 of these languages. The limitation stems from the substantial amounts of labeled training data typically required for such models, which are only available for a handful of languages like English, Spanish, and Chinese. To address this challenge, Meta's researchers took a novel approach. They retrained a pre-existing AI model developed by the company in 2020, enabling it to learn speech patterns directly from audio without relying on extensive labeled data or transcripts.
The researchers utilized two distinct datasets for training purposes. The first dataset comprised audio recordings of the New Testament Bible and its corresponding text, sourced from the internet, spanning 1,107 languages. The second dataset consisted of unlabeled New Testament audio recordings encompassing 3,809 languages. To enhance the data quality, the team processed both the speech audio and text data before leveraging an algorithm to align the audio recordings with the corresponding text. This process was repeated using a second algorithm trained on the newly aligned data. Through this iterative method, the researchers successfully trained the algorithm to learn new languages more efficiently, even in the absence of accompanying text.
Michael Auli, a research scientist at Meta involved in the project, explains, "We can utilize what the model has learned to quickly build speech systems with minimal data. While we have extensive datasets for English and a few other languages, we lack this wealth of resources for languages spoken by smaller populations, perhaps only a thousand individuals."
The researchers assert that their models can engage in conversations in over 1,000 languages while recognizing more than 4,000 languages. In a comparative analysis with competing models such as OpenAI Whisper, Meta's models demonstrated a remarkably lower error rate, despite covering a staggering 11 times more languages.
Nevertheless, the research team acknowledges potential challenges. The model is susceptible to misinterpretations of certain words or phrases, which may result in inaccurate or potentially offensive outcomes. Additionally, they admit that their speech recognition models exhibit slightly more biased language compared to other models, although the difference amounts to only 0.7%.
While the scope of this research is impressive, the utilization of religious texts for training AI models can be a subject of controversy, notes Chris Emezue, a researcher at Masakhane—an organization focused on natural language processing for African languages—unaffiliated with the project.
Meta's breakthrough in AI speech recognition models opens up new frontiers for language preservation and inclusivity. By enabling communication in a multitude of languages, Meta paves the way for cultural preservation and improved accessibility, ensuring that no language is left behind.