Skip to content

Feature Request: Multilingual RAG Evaluation Support (Non-Latin Scripts, Low-Resource Languages) #2578

@rajantripathi

Description

@rajantripathi

Summary

DeepEval currently has no support for evaluating RAG systems in non-English languages. All built-in metrics (GEval, Faithfulness, ContextualPrecision, ContextualRecall, HallucinationMetric, etc.) use hardcoded English prompts with no mechanism to adapt them to other languages.

This is a significant gap for practitioners building RAG systems in Arabic, Chinese, Uzbek, Hindi, and other non-Latin script languages — which represent a large and growing share of real-world RAG deployments.


Problem Description

When evaluating a multilingual RAG pipeline with DeepEval, all LLM judge prompts are sent in English regardless of the language of the input/output being evaluated. For example:

from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase

# Uzbek-language RAG output — but DeepEval judges it with English prompts
test_case = LLMTestCase(
    input="O'zbekistonning poytaxti qaysi shahar?",  # Uzbek: "What is the capital of Uzbekistan?"
    actual_output="Toshkent O'zbekistonning poytaxtidir.",  # "Tashkent is the capital of Uzbekistan."
    retrieval_context=["Toshkent — O'zbekiston Respublikasining poytaxti va eng yirik shahri."]
)

metric = FaithfulnessMetric(threshold=0.7)
metric.measure(test_case)
# Result: unreliable — the LLM judge is evaluating Uzbek content using English reasoning prompts

This produces unreliable scores because:

  1. The judge LLM is asked to reason about non-English content using English metacognitive prompts
  2. Non-Latin script inputs can confuse decomposition/entailment chains designed for English token patterns
  3. There is no way to pass a language parameter to adapt judge prompt templates

Proposed Solution

Add a language parameter to DeepEval metric constructors and test cases, similar to how RAGAS implements adapt_prompts():

from deepeval.metrics import FaithfulnessMetric, GEval
from deepeval.test_case import LLMTestCase

# Option A: Language parameter on metric
metric = FaithfulnessMetric(
    threshold=0.7,
    language="uzbek"  # adapts all internal judge prompts to the target language
)

# Option B: Language-aware GEval with translated criteria
metric = GEval(
    name="Faithfulness-UZ",
    criteria="Berilgan kontekst asosida javob to'g'ri va ishonchli ekanligini baholang",  # Uzbek
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.RETRIEVAL_CONTEXT],
    language="uzbek"
)

# Option C: Multilingual test case flag
test_case = LLMTestCase(
    input="O'zbekistonning poytaxti qaysi shahar?",
    actual_output="Toshkent O'zbekistonning poytaxtidir.",
    retrieval_context=[...],
    language="uz"  # ISO 639-1 code
)

Motivation / Real-World Context

I work on multilingual RAG systems for educational and public-sector applications across Central Asian contexts (English + Uzbek). In benchmarking experiments on a 400-item English/Uzbek retrieval dataset, I found that:

This issue affects anyone building RAG for Arabic, Chinese, Japanese, Korean, Turkish, Central Asian, or any non-Latin script deployment.


Affected Metrics (Priority Order)

Metric Current Behaviour Priority
FaithfulnessMetric English judge prompts only High
ContextualPrecisionMetric English judge prompts only High
ContextualRecallMetric English judge prompts only High
HallucinationMetric English judge prompts only High
GEval Accepts custom criteria (partial workaround) Medium
AnswerRelevancyMetric English judge prompts only Medium
SummarizationMetric English judge prompts only Low
Dataset/testset generation No multilingual support High

Workaround (Current)

The only current workaround is GEval with manually translated criteria strings. This is brittle and requires prompt engineering expertise in the target language for every evaluation dimension.


Related Issues / Prior Art


Happy to Contribute

I'm willing to open a PR implementing this. Starting point would be:

  1. Adding a language parameter to DeepEvalBaseLLM judge call chain
  2. Adding translated FaithfulnessMetric prompt templates for 3–5 languages as a proof of concept
  3. Adding a multilingual_evaluate() helper that wraps the standard evaluate() with language-aware metric construction

Would welcome maintainer input on the preferred architecture before opening a PR.

— Dr Rajan Tripathi, AUT AI Innovation Lab

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions