Skip to content

Commit d454f70

Browse files
committed
update readme.md
1 parent 060f10b commit d454f70

File tree

2 files changed

+55
-17
lines changed

2 files changed

+55
-17
lines changed

README.md

Lines changed: 55 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,59 @@
1-
# NewtonBench: A Benchmark for Generalizable Scientific Law Discovery in LLM Agents
1+
# NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents
22

3-
[![GitHub Repo stars](https://img.shields.io/github/stars/your-username/NewtonBench?style=social)](https://github.com/your-username/NewtonBench)
3+
[![GitHub Repo stars](https://img.shields.io/github/stars/HKUST-KnowComp/NewtonBench?style=social)](https://github.com/HKUST-KnowComp/NewtonBench)
44
[![arXiv](https://img.shields.io/badge/arXiv-XXXX.XXXXX-b31b1b.svg)](https://arxiv.org/abs/XXXX.XXXXX)
55

6+
<div align="center">
7+
8+
### 🔭 **Can LLMs Rediscover Newton's Laws?**
9+
10+
**324 Scientific Law Discovery Tasks • 12 Physics Domains • Interactive Model Systems**
11+
12+
*✨Moving beyond memorization toward true scientific discovery in complex, interactive environments✨*
13+
14+
</div>
15+
16+
---
17+
18+
## 🚀 **TL;DR**
19+
20+
**NewtonBench** is the first benchmark designed to rigorously evaluate LLMs' ability to discover scientific laws through **interactive experimentation** rather than static function fitting. Our benchmark resolves the fundamental trilemma between scientific relevance, scalability, and memorization resistance through **metaphysical shifts**—systematic alterations of canonical physical laws.
21+
22+
### 🎯 **Key Features**
23+
- **324 tasks** across 12 physics domains (Gravitation, Coulomb's Law, Fourier's Law, etc.)
24+
- **Interactive model systems** requiring active experimentation and hypothesis testing
25+
- **Two difficulty dimensions**: law complexity (easy/medium/hard) × system complexity (vanilla/simple/complex)
26+
- **Code-assisted evaluation** to isolate reasoning from computational constraints
27+
- **Memorization-resistant** through metaphysical shifts of canonical laws
28+
29+
### 🔬 **What We Discovered**
30+
- **Frontier models** (GPT-5, Gemini-2.5-pro) show **clear but fragile** discovery capabilities
31+
- **Performance degrades precipitously** with increasing system complexity and noise
32+
- **Paradoxical tool effect**: Code assistance helps weaker models but hinders stronger ones
33+
- **Extreme noise sensitivity**: Even 0.0001 noise level causes 13-15% accuracy drop
34+
35+
### 🏆 **Why It Matters**
36+
NewtonBench reveals that while LLMs are beginning to develop scientific reasoning skills, **robust, generalizable discovery in complex environments remains the core challenge** for automated science.
37+
38+
---
39+
<div align="center">
40+
<figure>
41+
<img src="./images/main_dark.png" alt="Framework" style="max-width: 100%; height: auto;">
42+
<br>
43+
<figcaption><em>Quick Overview of NewtonBench.</em></figcaption>
44+
</figure>
45+
</div>
46+
47+
48+
49+
50+
51+
## 🔥 News
52+
* **10 Oct, 2025**: The paper is released on arXiv!
53+
654
## 📋 Table of Contents
755

8-
- [📖 Introduction](#-introduction)
9-
- [🔄 Updates](#-updates)
56+
- [🔥 News](#-news)
1057
- [🚀 Get Started](#-get-started)
1158
- [1. Clone the Repository](#1-clone-the-repository)
1259
- [2. Create and Activate a Conda Environment](#2-create-and-activate-a-conda-environment)
@@ -19,20 +66,11 @@
1966
- [Method 1: Using `models.txt`](#method-1-using-modelstxt)
2067
- [Method 2: Specifying a Single Model](#method-2-specifying-a-single-model)
2168
- [Controlling Parallelism](#controlling-parallelism)
22-
- [Analyzing Results](#analyzing-results)
23-
- [📚 Citation](#-citation)
24-
25-
## 📖 Introduction
26-
We introduce NewtonBench, a benchmark comprising 324 scientific law discovery tasks across 12 physics domains. NewtonBench is designed to rigorously evaluate the scientific reasoning capabilities of Large Language Models (LLMs) by moving beyond memorization toward true discovery.
27-
28-
It combines two core innovations: a metaphysical shift, which systematically modifies canonical physical laws to create conceptually grounded yet novel problems, and an interactive, system-oriented environment, where agents must design experiments and interpret feedback within confounded systems. The benchmark provides two independent dimensions of difficulty: the complexity of the target law, and the complexity of the model systems.
69+
- [📈 Analyzing Results](#analyzing-results)
70+
- [🌟 Citation](#-citation)
2971

30-
By optionally integrating a code execution interface, NewtonBench isolates reasoning from computational constraints, revealing the genuine frontiers of LLMs’ ability to discover scientific laws in complex, interactive settings.
3172

32-
![Project Illustration](images/design_illustration.jpg)
3373

34-
## 🔄 Updates
35-
* **8 Oct, 2025**: The paper is released on arXiv.
3674

3775
## 🚀 Get Started
3876

@@ -188,7 +226,7 @@ The `--parallel` argument controls the number of concurrent processes. A higher
188226
python run_master.py --parallel 8
189227
```
190228

191-
### Analyzing Results
229+
### 📈 Analyzing Results
192230

193231
After running experiments, you can use the `result_analysis/summarize_results.py` script to process and aggregate the results into a summary CSV file.
194232

@@ -208,7 +246,7 @@ You can also generate the summary for a single model by specifying its name. For
208246
python result_analysis/summarize_results.py --model_name gpt41mini
209247
```
210248

211-
## 📚 Citation
249+
## 🌟 Citation
212250

213251
If you use NewtonBench in your research, please cite our paper:
214252

images/main_dark.png

2.66 MB
Loading

0 commit comments

Comments
 (0)