When you ask a language model to translate a sentence from Swahili to Dutch, or answer a question in Bengali, itâs not magic. Itâs transfer learning-a technique that lets models use what theyâve learned in English to understand languages with far less data. But hereâs the problem: models still struggle badly with languages spoken by hundreds of millions of people. Why? And more importantly, can we fix it?
How Multilingual Models Actually Work
Large language models like XLM-RoBERTa and mT5 donât learn each language from scratch. Instead, theyâre trained on massive datasets containing text from over 100 languages at once. The model sees English sentences next to Spanish ones, then Hindi, then Arabic, and so on. Over time, it starts noticing patterns that repeat across languages-like how questions often end with a certain word order, or how nouns and verbs relate in different contexts. This is where transfer learning kicks in. If the model learns that âthe cat is sleepingâ in English and âel gato estĂĄ durmiendoâ in Spanish, it starts to understand that âcatâ and âgatoâ are similar concepts, even if the words look nothing alike. The middle layers of the transformer model begin to encode meaning, not just words. Thatâs the breakthrough: language-agnostic representations. These layers donât care if the input is in Tamil or Turkish-they respond to the underlying idea. But hereâs the catch. The model doesnât learn all languages equally. High-resource languages like English, Chinese, and Spanish have millions of pages of text in training data. Low-resource languages like Yoruba, Quechua, or Khmer might have only a few thousand. That imbalance creates a massive performance gap.The Performance Gap: What the Numbers Show
On benchmarks like XNLI (a test that checks if a model understands whether one sentence logically follows another), top models score 87% accuracy in English. For Swahili? Around 58%. For Urdu? 61%. Thatâs not a small difference-itâs a chasm. And itâs not because these languages are harder to learn. Itâs because the model hasnât seen enough of them. This is called the âcurse of multilinguality.â Every time you add a new language to the training set, the model has to split its attention. Research from 2024 shows that adding 50% more languages typically reduces performance in each language by 3 to 7 percentage points. Think of it like teaching a student 100 subjects at once-theyâll know a little about everything, but not deeply about anything. The problem gets worse with writing systems. Models trained mostly on Latin script (used in English, Spanish, French) struggle with Arabic, Cyrillic, or Chinese characters. Why? Because the tokenizer-the part that breaks text into chunks the model can process-was built for Latin letters. In Arabic, a single word can have dozens of forms. In Turkish, a word can be 20 characters long and still be one unit of meaning. The default tokenizer splits these into too many tiny pieces, confusing the model.Whatâs Working: New Techniques to Close the Gap
Researchers arenât just accepting this gap. Theyâre building smarter ways to transfer knowledge. One of the most promising is code-switching curriculum learning (CSCL). Instead of throwing all languages at the model at once, CSCL teaches them in order. Start with high-resource languages. Then mix in low-resource ones with simple sentences. Gradually increase complexity. Itâs like teaching a child to read: start with short words, then sentences, then stories. In one study, CSCL improved performance on Indonesian (a low-resource language) by 12.7 percentage points on a question-answering task. Another technique, multi-level knowledge distillation, trains a smaller model to mimic a larger one. The big model learns from lots of data; the small one learns how to copy its understanding-without needing the same amount of text. This approach boosted low-resource language accuracy on XNLI from 68.2% to 73.5%. Some teams are even building custom tokenizers. One developer on GitHub spent weeks retraining the tokenizer for Turkish, adding 5,000 new subwords. The result? A 19% jump in accuracy. But now the model doesnât work with standard pipelines. Thatâs the trade-off: better performance, but more work to maintain.
Real-World Failures and Wins
A bank in Europe tried using XLM-RoBERTa for customer service in 12 languages. For English and Spanish, customer satisfaction was 82%. For Vietnamese? 58%. For Tagalog? Just 47%. The model kept misunderstanding phrases like âI want to close my accountâ because it had never seen that exact wording in Tagalog. It didnât know that âpapalitâ (to close) and âbawasanâ (to reduce) were used interchangeably in financial contexts. But there are wins. A startup in Indonesia used CSCL to train a chatbot for farmers asking about crop prices. What took three months with standard fine-tuning took three weeks with CSCL. Accuracy jumped from 68% to 82%. The model could now handle code-switched inputs-like mixing Indonesian with Javanese-because it had been trained on exactly those patterns. On Reddit, developers complain about âtokenization nightmares.â One user wrote: âI spent 40 hours trying to get mT5 to work with Amharic. The tokenizer kept splitting every third letter. It was useless.â Thatâs not a bug-itâs a design flaw. Most models assume all languages use spaces. They donât.The Hidden Bias Problem
Thereâs another issue nobody talks about enough: safety. When a model learns from English data, it also learns English biases. If English text has more toxic language about certain groups, the model might copy that pattern into other languages-even if those languages didnât have that bias to begin with. A Stanford NLP team found that outputs in low-resource languages were 23% more likely to contain harmful content than outputs in English. Why? Because the model was trained on data where toxic English phrases were paired with translations in Swahili or Bengali. The model didnât learn âthis is badâ-it learned âthis phrase in English goes with this phrase in Swahili.â So it started generating toxic Swahili responses. CSCL helps a little-it reduces this by forcing the model to see balanced examples. But it doesnât fix the root problem. We still donât have good ways to detect or remove bias across languages. Thatâs a major blind spot.
Whatâs Next: The Future of Multilingual AI
The industry is shifting. Metaâs XLM-RoBERTa 2.0, released in March 2024, now handles script conversion better-improving zero-shot performance for unseen writing systems by over 11%. Googleâs new dynamic script embedding reduces the gap between Latin and non-Latin scripts by nearly 19%. These are big steps. But the real change is architectural. Experts predict that by 2027, monolithic models-single models trained on everything-will be replaced by modular systems. Imagine a core model that understands meaning, plus small, language-specific adapters. You plug in a Tamil adapter, and it handles Tamil grammar, script, and idioms. You plug in a Swahili one, and it adapts. This approach uses less compute and avoids the curse of multilinguality. The EU AI Act, effective in 2025, will force companies to prove their systems treat all languages fairly. Thatâs pushing Microsoft, Google, and Meta to spend millions on low-resource language research. Microsoft alone invested $47 million in 2023.Can You Build This Yourself?
Yes-but itâs not easy. If youâre trying to support a low-resource language, hereâs what you need:- At least 5,000 labeled examples-more if you want good accuracy.
- A custom tokenizer trained on your languageâs text. Donât use the default one.
- Code-switching data-real examples of how people mix languages in daily use.
- At least 100 hours of development time to tweak the pipeline.
- GPU power-CSCL training needs 1.5x more time than standard fine-tuning.
Why This Matters
Over 5,000 languages are spoken today. Only about 100 have decent AI support. Thatâs not just a technical problem. Itâs a social one. If your grandmother speaks only Quechua, and the only digital services available are in Spanish or English, sheâs locked out. If a farmer in Nepal canât ask a chatbot about weather patterns in his own language, heâs at a disadvantage. Multilingual LLMs have the potential to change that. But only if we stop treating them like tools for English speakers with translation plugins. They need to be built for the full spectrum of human language-from the most spoken to the most forgotten.Why do multilingual models perform worse in low-resource languages?
Theyâre trained on far less text. High-resource languages like English have billions of words in training data. Low-resource languages may have only a few million. The model doesnât see enough examples to learn grammar, idioms, or context. Even if the model is multilingual, it still needs exposure to each language to understand it well.
Is XLM-RoBERTa better than mT5 for low-resource languages?
Yes, generally. XLM-RoBERTa has a smaller performance gap between high- and low-resource languages-about 12 percentage points. mT5âs gap is closer to 28 points. Thatâs because XLM-RoBERTa uses a more balanced training approach and a better tokenizer. It also handles code-switching more naturally. For most low-resource use cases, XLM-RoBERTa is the better starting point.
Can I use a pre-trained model without training it myself?
You can, but performance will be poor for low-resource languages. Pre-trained models are optimized for general use, not specific languages. Without fine-tuning on your target languageâs data, the model wonât understand local expressions, slang, or grammar. For example, a model might correctly translate âIâm hungryâ in Spanish but fail on âNakauwi na akoâ in Tagalog because itâs never seen that phrase.
Whatâs the biggest technical hurdle in multilingual LLMs?
Tokenization. Most models use SentencePiece, which works well for languages with spaces. But for agglutinative languages like Turkish or polysynthetic ones like Inuktitut, words are long and complex. The tokenizer splits them into too many pieces, breaking meaning. Custom tokenizers are needed, but they break compatibility with standard pipelines. This is the #1 pain point for developers.
Do multilingual models understand languages theyâve never seen before?
Not well. Zero-shot performance-handling a language not in training-drops by 35-45% compared to languages theyâve seen. Models can guess based on script or word structure, but they often get grammar, tone, and intent wrong. For example, a model might generate grammatically correct Arabic from a Chinese prompt, but miss cultural context entirely. True cross-lingual transfer remains a major challenge.
Are there ethical risks in using multilingual LLMs?
Yes. Models can inherit and amplify biases from high-resource languages. Toxic or sexist content in English training data can appear in translations for low-resource languages. Worse, since those languages have less data for safety filtering, the harmful outputs are harder to detect. This creates a double injustice: low-resource communities get worse AI, and itâs more likely to hurt them.
Jess Ciro
24 January, 2026 - 17:29 PM
This is all just corporate theater. They train models on English and call it 'multilingual' like it's some kind of miracle. Meanwhile, my grandma's Yoruba dialect gets zero love. They don't care about us. They care about metrics. And guess what? The next AI winter is coming, and it'll bury this whole 'multilingual' scam under a pile of untrained tokenizers.
saravana kumar
25 January, 2026 - 06:40 AM
The tokenization issue is not a bug-it is a systemic failure of Western-centric AI design. SentencePiece assumes space-delimited languages. This is colonialism in algorithmic form. For languages like Tamil, where morphemes stack like bricks, the default tokenizer fractures meaning. Custom tokenizers are not optional. They are existential. And yet, no one funds them. Because profit does not care about grammar.
Tamil selvan
25 January, 2026 - 19:05 PM
I appreciate the depth of this analysis. The point about code-switching curriculum learning is particularly well-articulated. It is not merely a technical improvement-it is a pedagogical revolution. By introducing low-resource languages gradually, we respect their complexity rather than overwhelming them. This approach mirrors how human beings learn languages: with patience, context, and incremental exposure. Thank you for highlighting this path forward.
Kate Tran
27 January, 2026 - 16:24 PM
i just tried to use xlm-roberta for igbo and it kept turning 'nwa' into 'n w a'. like... why. i spent 3 days. now i just talk to my phone in english and hope it understands. đ
amber hopman
28 January, 2026 - 06:08 AM
I think the bias issue is way more serious than people admit. If a model learns that 'she is emotional' in English gets paired with 'she is irrational' in Bengali translations, then itâs not just misinterpreting-itâs reinforcing patriarchal stereotypes across cultures. We need bias audits in every language, not just English. And we need native speakers leading them, not translators.