Skip to content

Conversation

@ziqing-huang
Copy link
Contributor

This PR add Uncheatable Eval to the task list. It evaluats LLMs on fresh, real-time internet data so the tests are leak-proof. Unlike static benchmarks that risk pretraining contamination, Uncheatable Eval uses fresh, real-time internet data, ensuring the model has never seen the test samples. This makes the evaluation leak-proof and gives a more accurate measure of true generalization, especially for base models. Integrating it into LM Eval Harness lets us run these contamination-resistant tests in a standard, unified interface.

@CLAassistant
Copy link

CLAassistant commented Dec 2, 2025

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants