Summary
DeepEval currently has no support for evaluating RAG systems in non-English languages. All built-in metrics (GEval, Faithfulness, ContextualPrecision, ContextualRecall, HallucinationMetric, etc.) use hardcoded English prompts with no mechanism to adapt them to other languages.
This is a significant gap for practitioners building RAG systems in Arabic, Chinese, Uzbek, Hindi, and other non-Latin script languages — which represent a large and growing share of real-world RAG deployments.
Problem Description
When evaluating a multilingual RAG pipeline with DeepEval, all LLM judge prompts are sent in English regardless of the language of the input/output being evaluated. For example:
from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase
# Uzbek-language RAG output — but DeepEval judges it with English prompts
test_case = LLMTestCase(
input="O'zbekistonning poytaxti qaysi shahar?", # Uzbek: "What is the capital of Uzbekistan?"
actual_output="Toshkent O'zbekistonning poytaxtidir.", # "Tashkent is the capital of Uzbekistan."
retrieval_context=["Toshkent — O'zbekiston Respublikasining poytaxti va eng yirik shahri."]
)
metric = FaithfulnessMetric(threshold=0.7)
metric.measure(test_case)
# Result: unreliable — the LLM judge is evaluating Uzbek content using English reasoning prompts
This produces unreliable scores because:
- The judge LLM is asked to reason about non-English content using English metacognitive prompts
- Non-Latin script inputs can confuse decomposition/entailment chains designed for English token patterns
- There is no way to pass a
language parameter to adapt judge prompt templates
Proposed Solution
Add a language parameter to DeepEval metric constructors and test cases, similar to how RAGAS implements adapt_prompts():
from deepeval.metrics import FaithfulnessMetric, GEval
from deepeval.test_case import LLMTestCase
# Option A: Language parameter on metric
metric = FaithfulnessMetric(
threshold=0.7,
language="uzbek" # adapts all internal judge prompts to the target language
)
# Option B: Language-aware GEval with translated criteria
metric = GEval(
name="Faithfulness-UZ",
criteria="Berilgan kontekst asosida javob to'g'ri va ishonchli ekanligini baholang", # Uzbek
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.RETRIEVAL_CONTEXT],
language="uzbek"
)
# Option C: Multilingual test case flag
test_case = LLMTestCase(
input="O'zbekistonning poytaxti qaysi shahar?",
actual_output="Toshkent O'zbekistonning poytaxtidir.",
retrieval_context=[...],
language="uz" # ISO 639-1 code
)
Motivation / Real-World Context
I work on multilingual RAG systems for educational and public-sector applications across Central Asian contexts (English + Uzbek). In benchmarking experiments on a 400-item English/Uzbek retrieval dataset, I found that:
This issue affects anyone building RAG for Arabic, Chinese, Japanese, Korean, Turkish, Central Asian, or any non-Latin script deployment.
Affected Metrics (Priority Order)
| Metric |
Current Behaviour |
Priority |
FaithfulnessMetric |
English judge prompts only |
High |
ContextualPrecisionMetric |
English judge prompts only |
High |
ContextualRecallMetric |
English judge prompts only |
High |
HallucinationMetric |
English judge prompts only |
High |
GEval |
Accepts custom criteria (partial workaround) |
Medium |
AnswerRelevancyMetric |
English judge prompts only |
Medium |
SummarizationMetric |
English judge prompts only |
Low |
| Dataset/testset generation |
No multilingual support |
High |
Workaround (Current)
The only current workaround is GEval with manually translated criteria strings. This is brittle and requires prompt engineering expertise in the target language for every evaluation dimension.
Related Issues / Prior Art
Happy to Contribute
I'm willing to open a PR implementing this. Starting point would be:
- Adding a
language parameter to DeepEvalBaseLLM judge call chain
- Adding translated
FaithfulnessMetric prompt templates for 3–5 languages as a proof of concept
- Adding a
multilingual_evaluate() helper that wraps the standard evaluate() with language-aware metric construction
Would welcome maintainer input on the preferred architecture before opening a PR.
— Dr Rajan Tripathi, AUT AI Innovation Lab
Summary
DeepEval currently has no support for evaluating RAG systems in non-English languages. All built-in metrics (GEval, Faithfulness, ContextualPrecision, ContextualRecall, HallucinationMetric, etc.) use hardcoded English prompts with no mechanism to adapt them to other languages.
This is a significant gap for practitioners building RAG systems in Arabic, Chinese, Uzbek, Hindi, and other non-Latin script languages — which represent a large and growing share of real-world RAG deployments.
Problem Description
When evaluating a multilingual RAG pipeline with DeepEval, all LLM judge prompts are sent in English regardless of the language of the input/output being evaluated. For example:
This produces unreliable scores because:
languageparameter to adapt judge prompt templatesProposed Solution
Add a
languageparameter to DeepEval metric constructors and test cases, similar to how RAGAS implementsadapt_prompts():Motivation / Real-World Context
I work on multilingual RAG systems for educational and public-sector applications across Central Asian contexts (English + Uzbek). In benchmarking experiments on a 400-item English/Uzbek retrieval dataset, I found that:
adapt_prompts()(with open bugs: NoiseSensitivity prompts cannot be adapted to other languages in v0.4+ (regression from v0.3) vibrantlabsai/ragas#2569) — DeepEval has no equivalentThis issue affects anyone building RAG for Arabic, Chinese, Japanese, Korean, Turkish, Central Asian, or any non-Latin script deployment.
Affected Metrics (Priority Order)
FaithfulnessMetricContextualPrecisionMetricContextualRecallMetricHallucinationMetricGEvalAnswerRelevancyMetricSummarizationMetricWorkaround (Current)
The only current workaround is
GEvalwith manually translated criteria strings. This is brittle and requires prompt engineering expertise in the target language for every evaluation dimension.Related Issues / Prior Art
adapt_prompts()mechanism: the general pattern is to convert hardcoded English prompt templates toPydanticPromptsubclasses withinstruction+examplesattributes, then allow LLM-based translation at runtimeHappy to Contribute
I'm willing to open a PR implementing this. Starting point would be:
languageparameter toDeepEvalBaseLLMjudge call chainFaithfulnessMetricprompt templates for 3–5 languages as a proof of conceptmultilingual_evaluate()helper that wraps the standardevaluate()with language-aware metric constructionWould welcome maintainer input on the preferred architecture before opening a PR.
— Dr Rajan Tripathi, AUT AI Innovation Lab