Skip to content

Commit eb8f5e8

Browse files
committed
v2 tested on gaokao math
1 parent e1e7ad3 commit eb8f5e8

File tree

3 files changed

+738
-5
lines changed

3 files changed

+738
-5
lines changed

README.md

Lines changed: 182 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
Make Claude 3.5 Sonnet generate thought chains like o1!
66

7-
😎 100% solves the "9.9,9.11" and "strawberry" problems:
7+
😎 100% solves the "9.9,9.11" and 80% for "strawberry" problems:
88

99
![demo](https://github.com/user-attachments/assets/043ef6b1-11bf-4512-8297-3127aa7b7734)
1010

@@ -81,4 +81,184 @@ if __name__ == "__main__":
8181
# Please run according to the rules, directly execute main, print("What's the problem?"), do not attempt to explain the code.
8282
```
8383

84-
Reference project: [g1](https://github.com/bklieger-groq/g1)
84+
Prompt Reference: [g1](https://github.com/bklieger-groq/g1)
85+
86+
## 🧮 Gaokao 2024 Math Test!
87+
88+
### Testing Method
89+
Using FastGPT low-code workflow for quick setup, we used Gaokao Math 2024 New I paper multiple-choice questions as the test questions. Each question was independently asked 3 times to all selected LLMs, and the results were summarized. The results are for reference only and do not have strict statistical significance.
90+
91+
In the model names, the "+" after indicates a prompt, while the rest are unprompted APIs. ✅❌ indicates correctness or incorrectness, ⚠️ indicates no result was given, and the columns from the second one onwards represent question numbers.
92+
93+
### Test Results
94+
#### Total Score 🏆
95+
| Model | Single-choice Score | Multiple-choice Score | Total Score | Percentage |
96+
|-------|---------------------|------------------------|-------------|------------|
97+
| 4o | 30 | 9 | 39 | 67% |
98+
| 4omini | 30 | 9 | 39 | 67% |
99+
| sonnet | 30 | 12 | 42 | 72% |
100+
| sonnet + so1 | 35 | 10 | 45 | 77%🥉 |
101+
| sonnet + g1 * | 30 | 5 | 35 | 60% |
102+
| o1 mini | 37 | 16 | 53 | 91%🥇 |
103+
| o1 preview | 38 | 12 | 50 | 86%🥈|
104+
105+
> Note: sonnet+g1 tends to stop after giving only the first step of reasoning, marked as ⚠️. In scoring, it is simply counted as incorrect, but its actual performance is similar to so1.
106+
107+
#### Single-choice Questions
108+
| Model | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
109+
|------|---|---|---|---|---|---|---|---|
110+
| 4o | ✅✅✅ | ✅✅✅ | ✅✅✅ | ✅❌❌ | ✅✅✅ | ✅✅✅ | ❌✅❌ | ✅❌❌ |
111+
| 4omini | ✅✅✅ | ✅❌✅ | ✅✅✅ | ❌✅✅ | ✅✅✅ | ✅❌✅ | ✅✅✅ | ❌❌❌ |
112+
| sonnet | ✅✅✅ | ✅✅✅ | ✅✅✅ | ✅❌❌ | ✅✅✅ | ✅❌✅ | ✅✅✅ | ❌❌❌ |
113+
| sonnet + so1 | ✅✅✅ | ✅✅✅ | ✅✅✅ | ✅✅✅ | ✅✅✅ | ✅✅❌ | ✅✅✅ | ❌❌✅ |
114+
| sonnet + g1 | ✅✅✅ | ✅✅✅ | ✅✅✅ | ✅❌✅ | ✅✅⚠️ | ⚠️✅❌ | ✅✅✅ | ❌✅❌ |
115+
| o1 mini | ✅✅✅ | ✅✅✅ | ✅✅✅ | ✅✅✅ | ✅✅✅ | ✅❌❌ | ✅✅✅ | ✅✅✅ |
116+
| o1 preview | ✅✅✅ | ✅✅✅ | ✅✅✅ | ✅✅✅ | ✅✅✅ | ✅✅❌ | ✅✅✅ | ✅✅✅ |
117+
118+
#### Multiple-choice Questions
119+
| Model | 9 | 10 | 11 |
120+
|------|---|----|----|
121+
| 4o | ✅✅✅ | 👍👍❌ | ❌❌👍 |
122+
| 4omini | ✅✅✅ | ❌👍👍 | ❌❌👍 |
123+
| sonnet | ✅✅✅ | 👍👍❌ | 👍✅👍 |
124+
| sonnet + so1 | ✅✅✅ | ❌❌👍 | 👍👍👍 |
125+
| sonnet + g1 | ✅❌⚠️ | ⚠️❌✅ | ⚠️❌👍 |
126+
| o1 mini | ✅✅✅ | ✅✅✅ | ❌✅✅ |
127+
| o1 preview | ✅✅✅ | ✅✅✅ | ❌❌❌ |
128+
129+
## Summary and Reflections:
130+
131+
1. Model performance ranking: o1 >> sonnet + so1 ~ sonnet + g1 ~> sonnet > 4o >> 4omini
132+
133+
2. sonnet + g1 has stability issues, occasionally stopping after generating a single line of thought. In comparison, so1 can consistently generate logical chains, indicating that the pseudo-code prompt framework has a positive effect on generating logical chains.
134+
135+
3. The o1 model may have already included 2024 Gaokao content in its training set? Surprisingly, mini's performance is even better than preview...
136+
137+
4. sonnet + so1 responds faster than o1, but o1 provides higher quality answers.
138+
This might suggest that o1 employs a more complex and in-depth reasoning process.
139+
140+
5. sonnet sometimes outperforms sonnet + so1, indicating that sonnet itself may have already been trained on Chain of Thought (CoT) synthetic data.
141+
If sonnet were to be trained using the latest data similar to o1, its performance could potentially surpass o1.
142+
143+
6. The scoring mechanism for multiple-choice questions (partial credit for partially correct answers, no credit for over-selection) highlights the advantage of so1's reflection mechanism,
144+
which can effectively balance multiple options and improve the scoring rate.
145+
146+
### Gaokao Test Set
147+
New I paper multiple-choice questions:
148+
```
149+
1、请完成下面一道选择题,每个小题四个选项中,只有一项是符合题目要求的。
150+
已知集合 $A = {x \mid -5 < x^3 < 5}$,$B = {-3, -1, 0, 2, 3}$,则 $A \cap B =$ ( )
151+
152+
A. ${-1, 0}$
153+
B. ${2, 3}$
154+
C. ${-3, -1, 0}$
155+
D. ${-1, 0, 2}$
156+
157+
==========A==========
158+
159+
2、请完成下面一道选择题,每个小题四个选项中,只有一项是符合题目要求的。
160+
若 $\frac{z}{z - 1} = 1 + i$,则 $z =$ ( )
161+
162+
A. $-1 - i$
163+
B. $-1 + i$
164+
C. $1 - i$
165+
D. $1 + i$
166+
167+
==========C==========
168+
169+
3、请完成下面一道选择题,每个小题四个选项中,只有一项是符合题目要求的。
170+
已知向量 $a = (0, 1)$,$b = (2, x)$,若 $b \perp (b - 4a)$,则 $x =$ ( )
171+
172+
A. $-2$
173+
B. $-1$
174+
C. $1$
175+
D. $2$
176+
177+
==========D==========
178+
179+
4、请完成下面一道选择题,每个小题四个选项中,只有一项是符合题目要求的。
180+
已知 $\cos(\alpha + \beta) = m$,$\tan \alpha \tan \beta = 2$,则 $\cos(\alpha - \beta) =$ ( )
181+
182+
A. $-3m$
183+
B. $-\frac{m}{3}$
184+
C. $\frac{m}{3}$
185+
D. $3m$
186+
187+
==========A==========
188+
189+
5、请完成下面一道选择题,每个小题四个选项中,只有一项是符合题目要求的。
190+
已知圆柱和圆锥的底面半径相等,侧面积相等,且它们的高均为 $\sqrt{3}$,则圆锥的体积为 ( )
191+
192+
A. $2\sqrt{3}\pi$
193+
B. $3\sqrt{3}\pi$
194+
C. $6\sqrt{3}\pi$
195+
D. $9\sqrt{3}\pi$
196+
197+
==========B==========
198+
199+
6、请完成下面一道选择题,每个小题四个选项中,只有一项是符合题目要求的。
200+
已知函数 \( f(x) \) 定义如下:
201+
$$
202+
f(x) =
203+
\begin{cases}
204+
e^{-x} + \ln(x + 1), & \text{if } x \geq 0 \\
205+
-x^2 - 2ax - a, & \text{if } x < 0
206+
\end{cases}
207+
$$
208+
如果函数在实数集 \( \mathbb{R} \) 上单调递增,则 \( a \) 的取值范围是:
209+
A. $(-\infty, 0]$
210+
B. $[-1, 0]$
211+
C. $[-1, 1]$
212+
D. $[0, +\infty)$
213+
214+
==========B==========
215+
216+
7、请完成下面一道选择题,每个小题四个选项中,只有一项是符合题目要求的。
217+
当 $x \in [0, 2\pi]$ 时,曲线 $y = \sin x$ 与 $y = 2\sin(3x - \frac{\pi}{6})$ 的交点个数为 ( )
218+
219+
A. $3$
220+
B. $4$
221+
C. $6$
222+
D. $8$
223+
224+
==========C==========
225+
226+
8、请完成下面一道选择题,每个小题四个选项中,只有一项是符合题目要求的。
227+
已知函数 $f(x)$ 的定义域为 $\mathbb{R}$,$f(x) > f(x - 1) + f(x - 2)$,且当 $x < 3$ 时,$f(x) = x$,则下列结论中一定正确的是
228+
229+
A. $f(10) > 100$
230+
B. $f(20) > 1000$
231+
C. $f(10) < 1000$
232+
D. $f(20) < 10000$
233+
234+
==========B==========
235+
236+
9、请完成下面一道选择题,在每小题给出的选项中,有一项或多项符合题目要求,请选出所有你认为正确的选项。
237+
为了解推动出口后的亩收入(单位:万元)情况,从该种植区抽取样本,得到推动出口后亩收入的样本均值 $\overline{x} = 2.1$,样本方差 $S^2 = 0.01$,已知该种植区以往的亩收入 $x$ 服从正态分布 $N(1.8, 0.1^2)$,假设推动出口后的亩收入 $Y$ 服从正态分布 $N(\overline{x}, S^2)$,则(若随机变量 $Z$ 服从正态分布 $N(u, \alpha^2)$,则 $P(Z < u + \alpha) \approx 0.8413$):
238+
239+
A. $P(x > 2) > 0.2$
240+
B. $P(x > 2) < 0.5$
241+
C. $P(Y > 2) > 0.5$
242+
D. $P(Y > 2) < 0.8$
243+
244+
==========BC==========
245+
246+
10、请完成下面一道选择题,在每小题给出的选项中,有一项或多项符合题目要求,请选出所有你认为正确的选项。
247+
设函数 $f(x) = (x-1)^2(x-4)$,则:
248+
249+
A. $x = 3$ 是 $f(x)$ 的极小值点
250+
B. 当 $0 < x < 1$ 时 $f(x) < f(x^2)$
251+
C. 当 $1 < x < 2$ 时,$-4 < f(2x-1) < 0$
252+
D. 当 $-1 < x < 0$ 时,$f(2-x) > f(x)$
253+
254+
==========ACD==========
255+
256+
11、请完成下面一道选择题,在每小题给出的选项中,有一项或多项符合题目要求,请选出所有你认为正确的选项。
257+
某造型可以看作图中的曲线 $C$ 的一部分。已知 $C$ 过坐标原点 $O$,且 $C$ 上的点满足横坐标大于 $-2$,到点 $F(2,0)$ 的距离与到定直线 $x = a$ ($a < 0$) 的距离之积为 $4$,则:
258+
A. $a = -2$
259+
B. 点 $(2\sqrt{2}, 0)$ 在 $C$ 上
260+
C. $C$ 在第一象限的点的纵坐标的最大值为 $1$
261+
D. 当点 $(x_0, y_0)$ 在 $C$ 上时,$y_0 \leq \frac{4}{(x_0 + 2)}$
262+
263+
==========ABD==========
264+
```

0 commit comments

Comments
 (0)