Non-English Evaluation: Testing Large Language Models Across Languages

Most of us assume that if an AI model is smart in English, it’s smart everywhere. You ask it a question in Spanish or Japanese, and you expect the same high-quality answer. But here is the hard truth: that assumption is usually wrong. When large language models (LLMs) switch from English to other languages, their performance often takes a massive hit. This isn't just about bad grammar; it's about deeper issues with cultural nuance, medical accuracy, and even coding logic.

This gap creates real risks for businesses and developers deploying AI globally. If you are building a customer service bot for Brazil or a medical assistant in China, relying on English-centric tests is like driving blindfolded. Non-English evaluation has moved from a nice-to-have research topic to a critical necessity for anyone serious about global AI deployment. It involves systematic testing of models on tasks conducted in languages other than English to quantify performance gaps and improve multilingual capabilities.

The Reality Check: Why Performance Drops Outside English

You might wonder why this happens. After all, these models are trained on vast amounts of internet data. The problem lies in training data imbalance. English dominates the high-quality web content used to pretrain LLMs. When a model sees millions of examples of nuanced English conversation but only thousands for a specific dialect of Arabic or Vietnamese, it simply doesn't learn as well.

Language intelligence company LILT reports consistent quality degradation when enterprises use LLMs for translation and content generation outside English. They identify four main culprits:

Data Imbalance: English makes up the majority of curated data.
Tokenization Issues: Languages with non-Latin scripts or complex morphology (like agglutinative languages) get broken into more tokens, increasing sequence length and making learning harder.
Domain Mismatch: Non-English data is often narrower or more informal, lacking specialized vocabulary.
Evaluation Scarcity: There aren't enough good test sets to measure progress in low-resource languages.

As researchers argue in papers like "Why Not Transform Chat Large Language Models to Non-English?", limited performance is a direct consequence of this imbalance. Without fixing the data foundation, fine-tuning alone won't save you.

Moving Beyond Translation: The Menlo Framework

For years, we judged non-English output by translating it back to English and checking for errors. That approach fails because it misses culture. A sentence can be grammatically perfect in French but sound incredibly rude or awkward to a native speaker in Quebec versus Paris.

This is where the Menlo framework changes the game. Developed to evaluate "native-like" quality, Menlo tests across 47 language varieties using 6,423 human-annotated prompt-response preference pairs. Instead of asking "Is this correct?", it asks "Does this sound like a local wrote it?"

Menlo uses a four-dimensional rubric:

Language Quality: Coherence and fluency.
Cultural Alignment: Does it fit the specific locale's nuances?
Local Factuality: Is the information grounded in local context?
Style & Helpfulness: Tone and utility.

They found that zero-shot LLM judges (using one AI to grade another without training) struggle here. However, when they used pairwise evaluation-asking the judge to choose between two responses-and applied reinforcement learning (RL), the automatic scores started matching human preferences much better. This proves that we need sophisticated, locale-specific metrics, not just generic ones.

Comparison of Evaluation Approaches
Approach	Focus	Strengths	Weaknesses
Translation-Based	Grammar/Accuracy	Easy to automate	Misses cultural nuance
Menlo Framework	Native-likeness	Captures tone/culture	Requires heavy annotation
Standardized Exams	Professional Competence	Clear pass/fail thresholds	Limited to specific domains

Diverse people evaluating cultural nuance in AI responses

High-Stakes Testing: Medical Licensing Exams

In casual chat, a bad joke is annoying. In medicine, a misunderstanding can be fatal. Evaluating LLMs in healthcare requires rigorous standards. Researchers have turned to national licensing exams to test this. For example, studies using the Chinese National Medical Licensing Examination (NMLE) reveal stark differences in performance.

A model might score above the passing threshold on the USMLE (American exam) but fail miserably on the NMLE. Why? Because medical reasoning isn't universal. It depends on local guidelines, drug names, and abbreviations. One study compared performance across multiple languages and used t-tests to show that these gaps are statistically significant. This highlights a major safety risk: an AI that seems competent in English may be dangerously unreliable in Chinese or other clinical contexts.

The Developer's Dilemma: Code Generation in Native Languages

Developers don't always think in English. Many describe bugs or request features in their native tongue. The "LLM of Babel" study from Delft University tested exactly this scenario. They asked models to generate code based on prompts in non-English languages.

The results were sobering. Even strong models deteriorated significantly. The researchers created an error taxonomy for these failures:

Incorrect Functional Behavior: The code doesn't work at all.
Partial Fulfillment: Only part of the task is done.
Misinterpretation: Technical terms are misunderstood due to language barriers.
Language Mixing: Comments or variable names mix languages inappropriately.

To measure this automatically, they used cosine similarity metrics to compare generated code with expected outputs. This shows that non-English evaluation must cover multimodal tasks like "non-English prompt → code," not just text-to-text conversations.

Developer struggling with multilingual code and medical exams

Best Practices for Global AI Deployment

If you are planning to deploy LLMs internationally, you cannot rely on default settings. Here is how to build a robust evaluation strategy:

1. Use Native Speakers for Annotation
Don't hire generalists. You need annotators who understand the specific dialect and cultural context. As Menlo demonstrates, detailed rubrics tailored to each variety are essential.

2. Adopt Pairwise Comparisons
Instead of asking humans to rate a response from 1 to 5, give them two options and ask which is better. This method yields higher consistency and provides cleaner data for training RL-based judges.

3. Leverage Domain-Specific Benchmarks
For specialized fields like law or medicine, use existing standardized exams or create similar professional tests. These provide clear pass/fail criteria that matter for safety and compliance.

4. Implement Error Taxonomies
For technical tasks, categorize errors specifically. Knowing whether a model failed due to translation artifacts or logical errors helps you fix the root cause rather than just patching symptoms.

5. Combine Human and Automatic Evaluation
While RL-trained LLM judges are promising, they still overestimate improvements compared to humans. Keep humans in the loop for calibration, especially when launching in new markets.

Future Directions: Toward Truly Multilingual AI

The field is moving fast. We are seeing a shift from ad-hoc checks to structured infrastructure. Menlo’s work suggests that generative reward models could help push models toward native-like quality at scale. Meanwhile, medical evaluations will likely expand to more languages, tracking whether English improvements actually transfer elsewhere.

Researchers are also exploring ways to transform chat models to be truly non-English-native, rather than just English models with a language layer. But as the OpenReview paper notes, adaptation must be accompanied by rigorous evaluation. Without proper benchmarks, we risk creating models that feel improved but are actually drifting away from alignment.

Ultimately, non-English evaluation is not just a technical challenge; it's a business imperative. Enterprises that ignore these gaps face mistranslations, legal risks, and frustrated users. By adopting frameworks like Menlo, leveraging professional exams, and understanding developer workflows, we can build AI that respects and serves every language community equally.

Why do LLMs perform worse in non-English languages?

The primary reason is training data imbalance. English dominates the high-quality datasets used for pretraining. Additionally, tokenization issues in non-Latin scripts and a lack of culturally grounded evaluation metrics contribute to performance drops.

What is the Menlo framework?

Menlo is a multilingual evaluation framework designed to assess "native-like" quality across 47 language varieties. It uses human-annotated preference pairs and a four-dimensional rubric focusing on language quality, cultural alignment, local factuality, and style.

How are LLMs evaluated in medical contexts?

Researchers use standardized national medical licensing exams, such as the Chinese NMLE, to test models. This reveals significant performance gaps compared to English exams, highlighting risks in cross-language clinical deployment.

Can LLM judges replace human evaluators?

Not entirely. While RL-trained LLM judges show promise and benefit from pairwise evaluation, they still underperform humans on complex multilingual tasks and tend to overestimate improvements. Human oversight remains crucial for calibration.

What challenges exist in non-English code generation?

Studies like "LLM of Babel" show that models struggle when coding tasks are described in non-English natural language. Common errors include misinterpreting technical terms, partial fulfillment of requirements, and inappropriate mixing of languages in code comments.