Skip to content

Conversation

@grgkovac
Copy link

PR: Add Reference-Free Quality Metric for LLM Text Evaluation

Summary

This PR introduces QualityLLMEval, a new reference-free (w/o ref) quality descriptor.

Prompt/Criteria

Inspired by [1]:

 "A LQ indicates that the post is of very low quality, semantically meaningless, and contains broken-off or repetitive text."
 "A HQ indicates the post is of very high quality, addressing a complex topic with advanced vocabulary, phrasing, and style."

Validation

Following the methodology in [1] with data from [2].
Validated using the OpenMEVA dataset of story continuation.

Correlations with human judgements:

Judge / Model Pearson (r) Spearman (ρ)
GPT-5-mini 0.440 0.420
GPT-5 0.519 0.517
[1] 0.516 0.522
[2] 0.535 0.508

Note: [2] uses task specific prompt (story continuation), so the comparison is skewed in it's factor. The fact that the performance of GPT-5 with the current metric matches that performance validates the metric.

Example of usage

import pandas as pd
from evidently import Dataset, DataDefinition
from evidently.descriptors import QualityLLMEval

eval_df = pd.DataFrame(
    data=[
        ["sakldjflksefess"],
        ["I know that but I dont now about that if I know"],
        ["The Greensburg tornado struck on May 4, 2007, in Kiowa County, Kansas, United States, heavily damaging the town of Greensburg. It tracked 28.8 miles (46.3 km) through the area, killing 12 people and injuring 63. The tornado was the first to be rated EF5 on the Enhanced Fujita scale. The tornado heavily damaged Greensburg; 662 structures in the town sustained some form of damage and 95 percent of the town was damaged or destroyed."],
    ],
    columns=["text"]
)


eval_dataset = Dataset.from_pandas(
    eval_df, data_definition=DataDefinition(),
    descriptors=[QualityLLMEval("text", alias="Quality", include_score=True, model="gpt-5-mini")]
)

eval_dataset_df = eval_dataset.as_dataframe()
print(eval_dataset_df)

# Quality scores: 0.00, 0.05, 0.95

[1] Grgur Kovač, Jérémy Perez, Rémy Portelas, Peter Ford Dominey, and Pierre-Yves Oudeyer. 2025. Recursive Training Loops in LLMs: How training data properties modulate distribution shift in generated data?. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 32290–32309, Suzhou, China. Association for Computational Linguistics.

[2] Yi Chen, Rui Wang, Haiyun Jiang, Shuming Shi, and Ruifeng Xu. 2023. Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: An Empirical Study. In Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings), pages 361–374, Nusa Dua, Bali. Association for Computational Linguistics.


Closes #1801

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Reference-free Quality Metric

1 participant