Feat: adding llm judge reference free quality metric #1802

grgkovac · 2026-01-21T14:18:57Z

PR: Add Reference-Free Quality Metric for LLM Text Evaluation

Summary

This PR introduces QualityLLMEval, a new reference-free (w/o ref) quality descriptor.

Prompt/Criteria

Inspired by [1]:

 "A LQ indicates that the post is of very low quality, semantically meaningless, and contains broken-off or repetitive text."
 "A HQ indicates the post is of very high quality, addressing a complex topic with advanced vocabulary, phrasing, and style."

Validation

Following the methodology in [1] with data from [2].
Validated using the OpenMEVA dataset of story continuation.

Correlations with human judgements:

Judge / Model	Pearson (r)	Spearman (ρ)
GPT-5-mini	0.440	0.420
GPT-5	0.519	0.517
[1]	0.516	0.522
[2]	0.535	0.508

Note: [2] uses task specific prompt (story continuation), so the comparison is skewed in it's factor. The fact that the performance of GPT-5 with the current metric matches that performance validates the metric.

Example of usage

import pandas as pd
from evidently import Dataset, DataDefinition
from evidently.descriptors import QualityLLMEval

eval_df = pd.DataFrame(
    data=[
        ["sakldjflksefess"],
        ["I know that but I dont now about that if I know"],
        ["The Greensburg tornado struck on May 4, 2007, in Kiowa County, Kansas, United States, heavily damaging the town of Greensburg. It tracked 28.8 miles (46.3 km) through the area, killing 12 people and injuring 63. The tornado was the first to be rated EF5 on the Enhanced Fujita scale. The tornado heavily damaged Greensburg; 662 structures in the town sustained some form of damage and 95 percent of the town was damaged or destroyed."],
    ],
    columns=["text"]
)


eval_dataset = Dataset.from_pandas(
    eval_df, data_definition=DataDefinition(),
    descriptors=[QualityLLMEval("text", alias="Quality", include_score=True, model="gpt-5-mini")]
)

eval_dataset_df = eval_dataset.as_dataframe()
print(eval_dataset_df)

# Quality scores: 0.00, 0.05, 0.95

[1] Grgur Kovač, Jérémy Perez, Rémy Portelas, Peter Ford Dominey, and Pierre-Yves Oudeyer. 2025. Recursive Training Loops in LLMs: How training data properties modulate distribution shift in generated data?. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 32290–32309, Suzhou, China. Association for Computational Linguistics.

[2] Yi Chen, Rui Wang, Haiyun Jiang, Shuming Shi, and Ruifeng Xu. 2023. Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: An Empirical Study. In Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings), pages 361–374, Nusa Dua, Bali. Association for Computational Linguistics.

Closes #1801

grgkovac added 5 commits January 20, 2026 18:50

Quality metric added

3e24b10

Stringdoc update

d7c3e1d

Import order change

81c593e

fix indentation and remove trailing comma

64e08fe

add QualityLLMEval in _registry.py

aee9f7b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: adding llm judge reference free quality metric #1802

Feat: adding llm judge reference free quality metric #1802

Uh oh!

grgkovac commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Feat: adding llm judge reference free quality metric #1802

Are you sure you want to change the base?

Feat: adding llm judge reference free quality metric #1802

Uh oh!

Conversation

grgkovac commented Jan 21, 2026

Summary

Prompt/Criteria

Validation

Example of usage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant