-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Labels
Description
Issue Summary
In DeepEval version 3.2.1, ConversationRelevancyMetric works correctly for multi-turn conversational evaluation.
However, in DeepEval 3.7.x and above, this metric is no longer available, and the documentation does not provide any explanation or migration path.
I attempted to switch to TurnRelevancyMetric, but it produces incorrect and inflated scores, especially for negative feedback evaluation datasets.
What I Expected
- Clear information on whether
ConversationRelevancyMetricwas deprecated or replaced. - A recommended metric for multi-turn conversation relevancy (task coverage + relevance).
- Similar scoring behavior between multi-turn vs windowed-turn metrics.
- A migration guide or official note in the documentation.
What Actually Happened
- ImportError:
ImportError: cannot import name 'ConversationRelevancyMetric' from 'deepeval.metrics'-
TurnRelevancyMetricgives very high scores (0.90+) for negative samples, even when responses are clearly irrelevant. -
No documentation explains:
- why the metric was removed,
- why TurnRelevancyMetric behaves differently,
- how to replicate ConversationRelevancyMetric behavior in newer versions.
Minimal Reproduction
from deepeval.test_case import ConversationalTestCase, Turn
from deepeval.metrics import TurnRelevancyMetric
turns = [
Turn(role="user", content="What is X?"),
Turn(role="assistant", content="This is unrelated and incorrect."),
]
test_case = ConversationalTestCase(turns=turns)
metric = TurnRelevancyMetric()
result = metric.measure(test_case)
print(result)This produces a relevance score that is unexpectedly high for irrelevant answers.
Additional Context
- Positive feedback datasets → Both metrics behave similarly.
- Negative feedback datasets → TurnRelevancyMetric produces inflated scores, while ConversationRelevancyMetric (3.2.1) gives more realistic values.
- I need guidance from maintainers on which metric should be used for multi-turn conversation evaluation going forward.
###Key Questions
- Why does TurnRelevancyMetric score negative feedback so highly?
- Is there any proper replacement for ConversationRelevancyMetric in the version 3.7.x and above?
- Do you recommend implementing a custom metric to replicate the older behavior?
Reactions are currently unavailable