Feature Request: Multilingual RAG Evaluation Support (Non-Latin Scripts, Low-Resource Languages)

## Summary

DeepEval currently has no support for evaluating RAG systems in non-English languages. All built-in metrics (GEval, Faithfulness, ContextualPrecision, ContextualRecall, HallucinationMetric, etc.) use hardcoded English prompts with no mechanism to adapt them to other languages.

This is a significant gap for practitioners building RAG systems in Arabic, Chinese, Uzbek, Hindi, and other non-Latin script languages — which represent a large and growing share of real-world RAG deployments.

---

## Problem Description

When evaluating a multilingual RAG pipeline with DeepEval, all LLM judge prompts are sent in English regardless of the language of the input/output being evaluated. For example:

```python
from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase

# Uzbek-language RAG output — but DeepEval judges it with English prompts
test_case = LLMTestCase(
    input="O'zbekistonning poytaxti qaysi shahar?",  # Uzbek: "What is the capital of Uzbekistan?"
    actual_output="Toshkent O'zbekistonning poytaxtidir.",  # "Tashkent is the capital of Uzbekistan."
    retrieval_context=["Toshkent — O'zbekiston Respublikasining poytaxti va eng yirik shahri."]
)

metric = FaithfulnessMetric(threshold=0.7)
metric.measure(test_case)
# Result: unreliable — the LLM judge is evaluating Uzbek content using English reasoning prompts
```

This produces unreliable scores because:
1. The judge LLM is asked to reason about non-English content using English metacognitive prompts
2. Non-Latin script inputs can confuse decomposition/entailment chains designed for English token patterns
3. There is no way to pass a `language` parameter to adapt judge prompt templates

---

## Proposed Solution

Add a `language` parameter to DeepEval metric constructors and test cases, similar to how RAGAS implements `adapt_prompts()`:

```python
from deepeval.metrics import FaithfulnessMetric, GEval
from deepeval.test_case import LLMTestCase

# Option A: Language parameter on metric
metric = FaithfulnessMetric(
    threshold=0.7,
    language="uzbek"  # adapts all internal judge prompts to the target language
)

# Option B: Language-aware GEval with translated criteria
metric = GEval(
    name="Faithfulness-UZ",
    criteria="Berilgan kontekst asosida javob to'g'ri va ishonchli ekanligini baholang",  # Uzbek
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.RETRIEVAL_CONTEXT],
    language="uzbek"
)

# Option C: Multilingual test case flag
test_case = LLMTestCase(
    input="O'zbekistonning poytaxti qaysi shahar?",
    actual_output="Toshkent O'zbekistonning poytaxtidir.",
    retrieval_context=[...],
    language="uz"  # ISO 639-1 code
)
```

---

## Motivation / Real-World Context

I work on multilingual RAG systems for educational and public-sector applications across Central Asian contexts (English + Uzbek). In benchmarking experiments on a 400-item English/Uzbek retrieval dataset, I found that:

- Corpus coverage is the dominant factor in multilingual RAG failure (not embedding model choice)  
- But **evaluation tooling** is the next bottleneck — there is currently no reliable way to evaluate Uzbek-language RAG outputs with automated metrics
- RAGAS partially addresses this via `adapt_prompts()` (with open bugs: explodinggradients/ragas#2569) — DeepEval has no equivalent

This issue affects anyone building RAG for Arabic, Chinese, Japanese, Korean, Turkish, Central Asian, or any non-Latin script deployment.

---

## Affected Metrics (Priority Order)

| Metric | Current Behaviour | Priority |
|---|---|---|
| `FaithfulnessMetric` | English judge prompts only | High |
| `ContextualPrecisionMetric` | English judge prompts only | High |
| `ContextualRecallMetric` | English judge prompts only | High |
| `HallucinationMetric` | English judge prompts only | High |
| `GEval` | Accepts custom criteria (partial workaround) | Medium |
| `AnswerRelevancyMetric` | English judge prompts only | Medium |
| `SummarizationMetric` | English judge prompts only | Low |
| Dataset/testset generation | No multilingual support | High |

---

## Workaround (Current)

The only current workaround is `GEval` with manually translated criteria strings. This is brittle and requires prompt engineering expertise in the target language for every evaluation dimension.

---

## Related Issues / Prior Art

- RAGAS multilingual adaptation: explodinggradients/ragas#2569 (similar bug, partial fix in PR #2592)
- RAGAS testset synthesizer multilingual: explodinggradients/ragas#1970
- RAGAS `adapt_prompts()` mechanism: the general pattern is to convert hardcoded English prompt templates to `PydanticPrompt` subclasses with `instruction` + `examples` attributes, then allow LLM-based translation at runtime

---

## Happy to Contribute

I'm willing to open a PR implementing this. Starting point would be:
1. Adding a `language` parameter to `DeepEvalBaseLLM` judge call chain
2. Adding translated `FaithfulnessMetric` prompt templates for 3–5 languages as a proof of concept
3. Adding a `multilingual_evaluate()` helper that wraps the standard `evaluate()` with language-aware metric construction

Would welcome maintainer input on the preferred architecture before opening a PR.

— Dr Rajan Tripathi, AUT AI Innovation Lab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Multilingual RAG Evaluation Support (Non-Latin Scripts, Low-Resource Languages) #2578

Summary

Problem Description

Proposed Solution

Motivation / Real-World Context

Affected Metrics (Priority Order)

Workaround (Current)

Related Issues / Prior Art

Happy to Contribute

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Current Behaviour	Priority
`FaithfulnessMetric`	English judge prompts only	High
`ContextualPrecisionMetric`	English judge prompts only	High
`ContextualRecallMetric`	English judge prompts only	High
`HallucinationMetric`	English judge prompts only	High
`GEval`	Accepts custom criteria (partial workaround)	Medium
`AnswerRelevancyMetric`	English judge prompts only	Medium
`SummarizationMetric`	English judge prompts only	Low
Dataset/testset generation	No multilingual support	High

Feature Request: Multilingual RAG Evaluation Support (Non-Latin Scripts, Low-Resource Languages) #2578

Description

Summary

Problem Description

Proposed Solution

Motivation / Real-World Context

Affected Metrics (Priority Order)

Workaround (Current)

Related Issues / Prior Art

Happy to Contribute

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions