Feat: adding llm judge reference free quality metric #1802
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR: Add Reference-Free Quality Metric for LLM Text Evaluation
Summary
This PR introduces QualityLLMEval, a new reference-free (w/o ref) quality descriptor.
Prompt/Criteria
Inspired by [1]:
Validation
Following the methodology in [1] with data from [2].
Validated using the OpenMEVA dataset of story continuation.
Correlations with human judgements:
Note: [2] uses task specific prompt (story continuation), so the comparison is skewed in it's factor. The fact that the performance of GPT-5 with the current metric matches that performance validates the metric.
Example of usage
[1] Grgur Kovač, Jérémy Perez, Rémy Portelas, Peter Ford Dominey, and Pierre-Yves Oudeyer. 2025. Recursive Training Loops in LLMs: How training data properties modulate distribution shift in generated data?. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 32290–32309, Suzhou, China. Association for Computational Linguistics.
[2] Yi Chen, Rui Wang, Haiyun Jiang, Shuming Shi, and Ruifeng Xu. 2023. Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: An Empirical Study. In Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings), pages 361–374, Nusa Dua, Bali. Association for Computational Linguistics.
Closes #1801