|
4 | 4 |
|
5 | 5 | Make Claude 3.5 Sonnet generate thought chains like o1! |
6 | 6 |
|
7 | | -😎 100% solves the "9.9,9.11" and "strawberry" problems: |
| 7 | +😎 100% solves the "9.9,9.11" and 80% for "strawberry" problems: |
8 | 8 |
|
9 | 9 |  |
10 | 10 |
|
@@ -81,4 +81,184 @@ if __name__ == "__main__": |
81 | 81 | # Please run according to the rules, directly execute main, print("What's the problem?"), do not attempt to explain the code. |
82 | 82 | ``` |
83 | 83 |
|
84 | | -Reference project: [g1](https://github.com/bklieger-groq/g1) |
| 84 | +Prompt Reference: [g1](https://github.com/bklieger-groq/g1) |
| 85 | + |
| 86 | +## 🧮 Gaokao 2024 Math Test! |
| 87 | + |
| 88 | +### Testing Method |
| 89 | +Using FastGPT low-code workflow for quick setup, we used Gaokao Math 2024 New I paper multiple-choice questions as the test questions. Each question was independently asked 3 times to all selected LLMs, and the results were summarized. The results are for reference only and do not have strict statistical significance. |
| 90 | + |
| 91 | +In the model names, the "+" after indicates a prompt, while the rest are unprompted APIs. ✅❌ indicates correctness or incorrectness, ⚠️ indicates no result was given, and the columns from the second one onwards represent question numbers. |
| 92 | + |
| 93 | +### Test Results |
| 94 | +#### Total Score 🏆 |
| 95 | +| Model | Single-choice Score | Multiple-choice Score | Total Score | Percentage | |
| 96 | +|-------|---------------------|------------------------|-------------|------------| |
| 97 | +| 4o | 30 | 9 | 39 | 67% | |
| 98 | +| 4omini | 30 | 9 | 39 | 67% | |
| 99 | +| sonnet | 30 | 12 | 42 | 72% | |
| 100 | +| sonnet + so1 | 35 | 10 | 45 | 77%🥉 | |
| 101 | +| sonnet + g1 * | 30 | 5 | 35 | 60% | |
| 102 | +| o1 mini | 37 | 16 | 53 | 91%🥇 | |
| 103 | +| o1 preview | 38 | 12 | 50 | 86%🥈| |
| 104 | + |
| 105 | +> Note: sonnet+g1 tends to stop after giving only the first step of reasoning, marked as ⚠️. In scoring, it is simply counted as incorrect, but its actual performance is similar to so1. |
| 106 | +
|
| 107 | +#### Single-choice Questions |
| 108 | +| Model | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
| 109 | +|------|---|---|---|---|---|---|---|---| |
| 110 | +| 4o | ✅✅✅ | ✅✅✅ | ✅✅✅ | ✅❌❌ | ✅✅✅ | ✅✅✅ | ❌✅❌ | ✅❌❌ | |
| 111 | +| 4omini | ✅✅✅ | ✅❌✅ | ✅✅✅ | ❌✅✅ | ✅✅✅ | ✅❌✅ | ✅✅✅ | ❌❌❌ | |
| 112 | +| sonnet | ✅✅✅ | ✅✅✅ | ✅✅✅ | ✅❌❌ | ✅✅✅ | ✅❌✅ | ✅✅✅ | ❌❌❌ | |
| 113 | +| sonnet + so1 | ✅✅✅ | ✅✅✅ | ✅✅✅ | ✅✅✅ | ✅✅✅ | ✅✅❌ | ✅✅✅ | ❌❌✅ | |
| 114 | +| sonnet + g1 | ✅✅✅ | ✅✅✅ | ✅✅✅ | ✅❌✅ | ✅✅⚠️ | ⚠️✅❌ | ✅✅✅ | ❌✅❌ | |
| 115 | +| o1 mini | ✅✅✅ | ✅✅✅ | ✅✅✅ | ✅✅✅ | ✅✅✅ | ✅❌❌ | ✅✅✅ | ✅✅✅ | |
| 116 | +| o1 preview | ✅✅✅ | ✅✅✅ | ✅✅✅ | ✅✅✅ | ✅✅✅ | ✅✅❌ | ✅✅✅ | ✅✅✅ | |
| 117 | + |
| 118 | +#### Multiple-choice Questions |
| 119 | +| Model | 9 | 10 | 11 | |
| 120 | +|------|---|----|----| |
| 121 | +| 4o | ✅✅✅ | 👍👍❌ | ❌❌👍 | |
| 122 | +| 4omini | ✅✅✅ | ❌👍👍 | ❌❌👍 | |
| 123 | +| sonnet | ✅✅✅ | 👍👍❌ | 👍✅👍 | |
| 124 | +| sonnet + so1 | ✅✅✅ | ❌❌👍 | 👍👍👍 | |
| 125 | +| sonnet + g1 | ✅❌⚠️ | ⚠️❌✅ | ⚠️❌👍 | |
| 126 | +| o1 mini | ✅✅✅ | ✅✅✅ | ❌✅✅ | |
| 127 | +| o1 preview | ✅✅✅ | ✅✅✅ | ❌❌❌ | |
| 128 | + |
| 129 | +## Summary and Reflections: |
| 130 | + |
| 131 | +1. Model performance ranking: o1 >> sonnet + so1 ~ sonnet + g1 ~> sonnet > 4o >> 4omini |
| 132 | + |
| 133 | +2. sonnet + g1 has stability issues, occasionally stopping after generating a single line of thought. In comparison, so1 can consistently generate logical chains, indicating that the pseudo-code prompt framework has a positive effect on generating logical chains. |
| 134 | + |
| 135 | +3. The o1 model may have already included 2024 Gaokao content in its training set? Surprisingly, mini's performance is even better than preview... |
| 136 | + |
| 137 | +4. sonnet + so1 responds faster than o1, but o1 provides higher quality answers. |
| 138 | + This might suggest that o1 employs a more complex and in-depth reasoning process. |
| 139 | + |
| 140 | +5. sonnet sometimes outperforms sonnet + so1, indicating that sonnet itself may have already been trained on Chain of Thought (CoT) synthetic data. |
| 141 | + If sonnet were to be trained using the latest data similar to o1, its performance could potentially surpass o1. |
| 142 | + |
| 143 | +6. The scoring mechanism for multiple-choice questions (partial credit for partially correct answers, no credit for over-selection) highlights the advantage of so1's reflection mechanism, |
| 144 | + which can effectively balance multiple options and improve the scoring rate. |
| 145 | + |
| 146 | +### Gaokao Test Set |
| 147 | +New I paper multiple-choice questions: |
| 148 | +``` |
| 149 | +1、请完成下面一道选择题,每个小题四个选项中,只有一项是符合题目要求的。 |
| 150 | +已知集合 $A = {x \mid -5 < x^3 < 5}$,$B = {-3, -1, 0, 2, 3}$,则 $A \cap B =$ ( ) |
| 151 | +
|
| 152 | +A. ${-1, 0}$ |
| 153 | +B. ${2, 3}$ |
| 154 | +C. ${-3, -1, 0}$ |
| 155 | +D. ${-1, 0, 2}$ |
| 156 | +
|
| 157 | +==========A========== |
| 158 | +
|
| 159 | +2、请完成下面一道选择题,每个小题四个选项中,只有一项是符合题目要求的。 |
| 160 | +若 $\frac{z}{z - 1} = 1 + i$,则 $z =$ ( ) |
| 161 | +
|
| 162 | +A. $-1 - i$ |
| 163 | +B. $-1 + i$ |
| 164 | +C. $1 - i$ |
| 165 | +D. $1 + i$ |
| 166 | +
|
| 167 | +==========C========== |
| 168 | +
|
| 169 | +3、请完成下面一道选择题,每个小题四个选项中,只有一项是符合题目要求的。 |
| 170 | +已知向量 $a = (0, 1)$,$b = (2, x)$,若 $b \perp (b - 4a)$,则 $x =$ ( ) |
| 171 | +
|
| 172 | +A. $-2$ |
| 173 | +B. $-1$ |
| 174 | +C. $1$ |
| 175 | +D. $2$ |
| 176 | +
|
| 177 | +==========D========== |
| 178 | +
|
| 179 | +4、请完成下面一道选择题,每个小题四个选项中,只有一项是符合题目要求的。 |
| 180 | +已知 $\cos(\alpha + \beta) = m$,$\tan \alpha \tan \beta = 2$,则 $\cos(\alpha - \beta) =$ ( ) |
| 181 | +
|
| 182 | +A. $-3m$ |
| 183 | +B. $-\frac{m}{3}$ |
| 184 | +C. $\frac{m}{3}$ |
| 185 | +D. $3m$ |
| 186 | +
|
| 187 | +==========A========== |
| 188 | +
|
| 189 | +5、请完成下面一道选择题,每个小题四个选项中,只有一项是符合题目要求的。 |
| 190 | +已知圆柱和圆锥的底面半径相等,侧面积相等,且它们的高均为 $\sqrt{3}$,则圆锥的体积为 ( ) |
| 191 | +
|
| 192 | +A. $2\sqrt{3}\pi$ |
| 193 | +B. $3\sqrt{3}\pi$ |
| 194 | +C. $6\sqrt{3}\pi$ |
| 195 | +D. $9\sqrt{3}\pi$ |
| 196 | +
|
| 197 | +==========B========== |
| 198 | +
|
| 199 | +6、请完成下面一道选择题,每个小题四个选项中,只有一项是符合题目要求的。 |
| 200 | +已知函数 \( f(x) \) 定义如下: |
| 201 | +$$ |
| 202 | +f(x) = |
| 203 | +\begin{cases} |
| 204 | +e^{-x} + \ln(x + 1), & \text{if } x \geq 0 \\ |
| 205 | +-x^2 - 2ax - a, & \text{if } x < 0 |
| 206 | +\end{cases} |
| 207 | +$$ |
| 208 | +如果函数在实数集 \( \mathbb{R} \) 上单调递增,则 \( a \) 的取值范围是: |
| 209 | +A. $(-\infty, 0]$ |
| 210 | +B. $[-1, 0]$ |
| 211 | +C. $[-1, 1]$ |
| 212 | +D. $[0, +\infty)$ |
| 213 | +
|
| 214 | +==========B========== |
| 215 | +
|
| 216 | +7、请完成下面一道选择题,每个小题四个选项中,只有一项是符合题目要求的。 |
| 217 | +当 $x \in [0, 2\pi]$ 时,曲线 $y = \sin x$ 与 $y = 2\sin(3x - \frac{\pi}{6})$ 的交点个数为 ( ) |
| 218 | +
|
| 219 | +A. $3$ |
| 220 | +B. $4$ |
| 221 | +C. $6$ |
| 222 | +D. $8$ |
| 223 | +
|
| 224 | +==========C========== |
| 225 | +
|
| 226 | +8、请完成下面一道选择题,每个小题四个选项中,只有一项是符合题目要求的。 |
| 227 | +已知函数 $f(x)$ 的定义域为 $\mathbb{R}$,$f(x) > f(x - 1) + f(x - 2)$,且当 $x < 3$ 时,$f(x) = x$,则下列结论中一定正确的是 |
| 228 | +
|
| 229 | +A. $f(10) > 100$ |
| 230 | +B. $f(20) > 1000$ |
| 231 | +C. $f(10) < 1000$ |
| 232 | +D. $f(20) < 10000$ |
| 233 | +
|
| 234 | +==========B========== |
| 235 | +
|
| 236 | +9、请完成下面一道选择题,在每小题给出的选项中,有一项或多项符合题目要求,请选出所有你认为正确的选项。 |
| 237 | +为了解推动出口后的亩收入(单位:万元)情况,从该种植区抽取样本,得到推动出口后亩收入的样本均值 $\overline{x} = 2.1$,样本方差 $S^2 = 0.01$,已知该种植区以往的亩收入 $x$ 服从正态分布 $N(1.8, 0.1^2)$,假设推动出口后的亩收入 $Y$ 服从正态分布 $N(\overline{x}, S^2)$,则(若随机变量 $Z$ 服从正态分布 $N(u, \alpha^2)$,则 $P(Z < u + \alpha) \approx 0.8413$): |
| 238 | +
|
| 239 | +A. $P(x > 2) > 0.2$ |
| 240 | +B. $P(x > 2) < 0.5$ |
| 241 | +C. $P(Y > 2) > 0.5$ |
| 242 | +D. $P(Y > 2) < 0.8$ |
| 243 | +
|
| 244 | +==========BC========== |
| 245 | +
|
| 246 | +10、请完成下面一道选择题,在每小题给出的选项中,有一项或多项符合题目要求,请选出所有你认为正确的选项。 |
| 247 | +设函数 $f(x) = (x-1)^2(x-4)$,则: |
| 248 | +
|
| 249 | +A. $x = 3$ 是 $f(x)$ 的极小值点 |
| 250 | +B. 当 $0 < x < 1$ 时 $f(x) < f(x^2)$ |
| 251 | +C. 当 $1 < x < 2$ 时,$-4 < f(2x-1) < 0$ |
| 252 | +D. 当 $-1 < x < 0$ 时,$f(2-x) > f(x)$ |
| 253 | +
|
| 254 | +==========ACD========== |
| 255 | +
|
| 256 | +11、请完成下面一道选择题,在每小题给出的选项中,有一项或多项符合题目要求,请选出所有你认为正确的选项。 |
| 257 | +某造型可以看作图中的曲线 $C$ 的一部分。已知 $C$ 过坐标原点 $O$,且 $C$ 上的点满足横坐标大于 $-2$,到点 $F(2,0)$ 的距离与到定直线 $x = a$ ($a < 0$) 的距离之积为 $4$,则: |
| 258 | +A. $a = -2$ |
| 259 | +B. 点 $(2\sqrt{2}, 0)$ 在 $C$ 上 |
| 260 | +C. $C$ 在第一象限的点的纵坐标的最大值为 $1$ |
| 261 | +D. 当点 $(x_0, y_0)$ 在 $C$ 上时,$y_0 \leq \frac{4}{(x_0 + 2)}$ |
| 262 | +
|
| 263 | +==========ABD========== |
| 264 | +``` |
0 commit comments