You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**NewtonBench** is the first benchmark designed to rigorously evaluate LLMs' ability to discover scientific laws through **interactive experimentation** rather than static function fitting. Our benchmark resolves the fundamental trilemma between scientific relevance, scalability, and memorization resistance through **metaphysical shifts**—systematic alterations of canonical physical laws.
-**Extreme noise sensitivity**: Even 0.0001 noise level causes 13-15% accuracy drop
34
+
35
+
### 🏆 **Why It Matters**
36
+
NewtonBench reveals that while LLMs are beginning to develop scientific reasoning skills, **robust, generalizable discovery in complex environments remains the core challenge** for automated science.
We introduce NewtonBench, a benchmark comprising 324 scientific law discovery tasks across 12 physics domains. NewtonBench is designed to rigorously evaluate the scientific reasoning capabilities of Large Language Models (LLMs) by moving beyond memorization toward true discovery.
27
-
28
-
It combines two core innovations: a metaphysical shift, which systematically modifies canonical physical laws to create conceptually grounded yet novel problems, and an interactive, system-oriented environment, where agents must design experiments and interpret feedback within confounded systems. The benchmark provides two independent dimensions of difficulty: the complexity of the target law, and the complexity of the model systems.
69
+
-[📈 Analyzing Results](#analyzing-results)
70
+
-[🌟 Citation](#-citation)
29
71
30
-
By optionally integrating a code execution interface, NewtonBench isolates reasoning from computational constraints, revealing the genuine frontiers of LLMs’ ability to discover scientific laws in complex, interactive settings.
0 commit comments