Skip to content

ConversationRelevancyMetric Removed in DeepEval 3.7.x, Need Official Replacement & Clarification #2321

@sacjha-star

Description

@sacjha-star

Issue Summary

In DeepEval version 3.2.1, ConversationRelevancyMetric works correctly for multi-turn conversational evaluation.
However, in DeepEval 3.7.x and above, this metric is no longer available, and the documentation does not provide any explanation or migration path.

I attempted to switch to TurnRelevancyMetric, but it produces incorrect and inflated scores, especially for negative feedback evaluation datasets.

What I Expected

  • Clear information on whether ConversationRelevancyMetric was deprecated or replaced.
  • A recommended metric for multi-turn conversation relevancy (task coverage + relevance).
  • Similar scoring behavior between multi-turn vs windowed-turn metrics.
  • A migration guide or official note in the documentation.

What Actually Happened

  1. ImportError:
ImportError: cannot import name 'ConversationRelevancyMetric' from 'deepeval.metrics'
  1. TurnRelevancyMetric gives very high scores (0.90+) for negative samples, even when responses are clearly irrelevant.

  2. No documentation explains:

    • why the metric was removed,
    • why TurnRelevancyMetric behaves differently,
    • how to replicate ConversationRelevancyMetric behavior in newer versions.

Minimal Reproduction

from deepeval.test_case import ConversationalTestCase, Turn
from deepeval.metrics import TurnRelevancyMetric

turns = [
    Turn(role="user", content="What is X?"),
    Turn(role="assistant", content="This is unrelated and incorrect."),
]

test_case = ConversationalTestCase(turns=turns)
metric = TurnRelevancyMetric()
result = metric.measure(test_case)
print(result)

This produces a relevance score that is unexpectedly high for irrelevant answers.

Additional Context

  • Positive feedback datasets → Both metrics behave similarly.
  • Negative feedback datasets → TurnRelevancyMetric produces inflated scores, while ConversationRelevancyMetric (3.2.1) gives more realistic values.
  • I need guidance from maintainers on which metric should be used for multi-turn conversation evaluation going forward.

###Key Questions

  • Why does TurnRelevancyMetric score negative feedback so highly?
  • Is there any proper replacement for ConversationRelevancyMetric in the version 3.7.x and above?
  • Do you recommend implementing a custom metric to replicate the older behavior?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions