You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
self.state ="confirm user question"# Initialize step 1
31
+
self.state ="understand and analyze the problem"# Initialize step 1
32
32
self.input =input
33
33
34
34
defstep_by_step_reasoning(self):
35
-
"""You will explain each step of the reasoning process and provide a conclusion"""
36
-
37
-
deftitle(state):
38
-
"""Generate a title for each step of reasoning, including exploration of alternative answers. Consider cases where you might be wrong, and where errors might occur if the reasoning is incorrect."""
35
+
36
+
deftitle(state, input):
37
+
"""Generate the topic you need to reason about for this step based on the current state and input"""
39
38
return title
40
39
41
-
defcontent_description(state, input):
42
-
"""Conduct careful and detailed reasoning, noting your limitations as an LLM and what you can and cannot do. Use best practices."""
40
+
defreasoning(state, input):
41
+
"""**Conduct careful and detailed reasoning, noting your limitations as an LLM and what you can and cannot do. Use at least 3 different methods to reason. When you say you are examining, actually execute the examination process. Use best practices. Include exploration of alternative answers, carefully check where you might be wrong, and where errors might occur if the reasoning is incorrect. Fully explore all possible answers. Perform at least 5 steps of reasoning, the more detailed reasoning steps the better.**"""
43
42
return reasoning_process
44
43
45
44
defdecide_next_step(state, input, current_step):
46
45
"""Dynamically decide the next step based on the state, input, and current step"""
47
-
# At least 3 steps of reasoning
48
-
if current_step >=3and can_conclude(state, input) or current_step >=8:
49
-
return"final conclusion"
46
+
if can_conclude(state, input):
47
+
return"conclusion"
50
48
else:
51
-
return generate_next_step(state, input)
52
-
53
-
defcan_conclude(state, input):
54
-
"""Determine if a conclusion can be drawn"""
55
-
returnTrueorFalse
56
-
57
-
defgenerate_next_step(state, input):
58
-
"""Generate the next step of reasoning based on the input and current reasoning step"""
Using FastGPT low-code workflow for quick setup, we used Gaokao Math 2024 New I paper multiple-choice questions as the test questions. Each question was independently asked 3 times to all selected LLMs, and the results were summarized. The results are for reference only and do not have strict statistical significance.
90
+
91
+
In the model names, the "+" after indicates a prompt, while the rest are unprompted APIs. ✅❌ indicates correctness or incorrectness, ⚠️ indicates no result was given, and the columns from the second one onwards represent question numbers.
92
+
93
+
### Test Results
94
+
#### Total Score 🏆
95
+
| Model | Single-choice Score | Multiple-choice Score | Total Score | Percentage |
> Note: sonnet+g1 tends to stop after giving only the first step of reasoning, marked as ⚠️. In scoring, it is simply counted as incorrect, but its actual performance is similar to so1.
2. sonnet + g1 has stability issues, occasionally stopping after generating a single line of thought. In comparison, so1 can consistently generate logical chains, indicating that the pseudo-code prompt framework has a positive effect on generating logical chains.
134
+
135
+
3. The o1 model may have already included 2024 Gaokao content in its training set? Surprisingly, mini's performance is even better than preview...
136
+
137
+
4. sonnet + so1 responds faster than o1, but o1 provides higher quality answers.
138
+
This might suggest that o1 employs a more complex and in-depth reasoning process.
139
+
140
+
5. sonnet sometimes outperforms sonnet + so1, indicating that sonnet itself may have already been trained on Chain of Thought (CoT) synthetic data.
141
+
If sonnet were to be trained using the latest data similar to o1, its performance could potentially surpass o1.
142
+
143
+
6. The scoring mechanism for multiple-choice questions (partial credit for partially correct answers, no credit for over-selection) highlights the advantage of so1's reflection mechanism,
144
+
which can effectively balance multiple options and improve the scoring rate.
0 commit comments