Add Uncheatable Eval #3442

ziqing-huang · 2025-12-02T07:53:03Z

This PR add Uncheatable Eval to the task list. It evaluats LLMs on fresh, real-time internet data so the tests are leak-proof. Unlike static benchmarks that risk pretraining contamination, Uncheatable Eval uses fresh, real-time internet data, ensuring the model has never seen the test samples. This makes the evaluation leak-proof and gives a more accurate measure of true generalization, especially for base models. Integrating it into LM Eval Harness lets us run these contamination-resistant tests in a standard, unified interface.

CLAassistant · 2025-12-02T07:53:25Z

All committers have signed the CLA.

ziqing-huang requested a review from baberabb as a code owner December 2, 2025 07:53

Add uncheatable eval

2e37292

ziqing-huang force-pushed the uncheatable branch from 57648ba to 2e37292 Compare December 2, 2025 07:56

lint

eaa1036

ziqing-huang mentioned this pull request Dec 2, 2025

Run all lm-evaluation-harness evals marin-community/marin#1602

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Uncheatable Eval #3442

Add Uncheatable Eval #3442

ziqing-huang commented Dec 2, 2025

Uh oh!

CLAassistant commented Dec 2, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add Uncheatable Eval #3442

Are you sure you want to change the base?

Add Uncheatable Eval #3442

Conversation

ziqing-huang commented Dec 2, 2025

Uh oh!

CLAassistant commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CLAassistant commented Dec 2, 2025 •

edited

Loading