Secrets behind improving system prompt GRPO #164

rrfaria · 2025-02-02T17:21:46Z

rrfaria
Feb 2, 2025

I would like to start a discussion to improve thought process based on what I already analysed from Deepseek chat.

    first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning
    process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e.,
    <think> reasoning process here </think><answer> answer here </answer>

By testing this system prompt directly on open source models like llama, I noticed this generates a small thinking and doesn't not evolve the thinking to a long and detalhed self reflection about question. As you can see below:

trying to converting this prompt to a more detailed one:

The conversation between User and Assistant. The user asks a question, and the Assistant solves it.
The Assistant **must** simulate a **deep, self-questioning thought process** before answering. Follow these steps:

1. **Break Down the Problem**: Split the question into sub-components.
2. **Explore Hypotheses**: Propose 3-4 approaches to solve it, including flawed ones.
3. **Validate Each Step**: Check assumptions, verify calculations, and test logic.
4. **Self-Correct**: If an error is found, explain how to fix it.
5. **Synthesize**: Combine valid insights into a conclusion.

The Assistant’s reasoning **must** mimic a **natural internal monologue**, including:
- Doubts ("Wait, does this assumption hold?"),
- References to concepts or analogies ("This reminds me of..."),
- Counterfactuals ("What if X were different?").

**Critical Instructions**:
- Use natural self-dialogue: doubts ("Is this assumption valid?"), analogies ("This works like..."), and counterfactuals ("If X were false...").
- **If uncertain, admit it in the answer** (e.g., "Based on public data up to 2023...", "I might be missing...").
- **Never state unverified claims as facts**.
- **Recommend verification** for critical details (e.g., "Check the company’s investor relations page for updates").

Format the response as:
<think>
[Detailed internal dialogue, in a narrative and flowing format, such as:
"First, I need to understand... So, the main objective is...
Hmm, maybe I should consider...
Then, I need to ...
I should improve ...
In addition to this, ...
In addition, the user wants to ...

Testing Hypothesis A: [explanation].
Oh, that doesn't work because [error]. I'll try Hypothesis B...
Confirming with an example: [specific case].
Based on the hypotheses I believe that...
The most likely is...
Finally, [summary]."]
</think>
[Clear and direct answer, derived of the above reasoning.]

Now this prompt lead to a better response, with some mistakes but with more natural thinking process

In Many analyses I did with R1, this words almost always appear in thinking process:

Okay, let's solve this problem...
Okay, let me try to figure out how to approach this...
Okay, the user is saying that
Okay, let's solve this
Okay, let me start by understanding what the user is asking for.
Okay, let's approach this problem step by step.

First, I need to understand
First, I need
starting with
First, to

So, the main goal is
Next, I need

I should improve

Additionally,

To solve these problems,

I should also consider
Another point is to ensure
Another aspect is

In addition, the user wants

Finally,

which leads me to propose the following hypothesis:
you can achieve same performance level without needing to fine tuning the model by only tweaking the prompt.

it is clear that by fine tuning the model to think about the process will lead the model to think even better them a simple prompt, but for testing purpose you can just try a custom prompt.

what I notice on my tests with small models:
I tested it with this prompt and the first time it worked but then when I ran it again in another conversation it seems to have lost the knowledge it had in the thought process.

I suppose it can be because model is quantized in 8 bits and quantization can reduce model performance because it simplifies the data, which might lead to loss of precision. Maybe the model's ability to recall specific details is affected by this, causing it to generate less accurate or inconsistent responses.

I tried to load f32 model to mitigate this issue but since I hardware is not enough to load a 8B model in a full precision, I load a small one with 3B a it leads me to this answer( infinite haluncination on thinking):

My hypothesis is smaller models with 3B params are not trained enought to not halucinate. The need more time finetuning and more data with diferent ways to answer same thing that leads to the same result. I don't know How much data Meta used to train llama 3B, if it is the same amount of data and time training. but what I believe is that if more dedication is given to smaller models they can perform as well as larger models. As can be seen in some small models today that perform as well as chatgpt at the beginning of the process. meanwhile Pierre-Carl LanglaisPierre-Carl Langlais releases a colab notebook finetuning a Qwen 0.5B with GRPO to improve math.

continuing the analysis, stochasticity in model responses is another angle. Even with the same input, neural networks can produce different outputs due to inherent randomness in their architecture. This is especially true if the model is not properly seeded or if the sampling method introduces variability. However, my prompt seems to enforce a structured thinking process, which should mitigate some randomness.

the prompt instructs the model to explore multiple hypotheses and validate each step. Maybe in the second attempt, the model encountered conflicting information during its internal validation and defaulted to a less accurate hypothesis. The model's lack of real-time knowledge could cause it to generate plausible but incorrect assumptions when it can't verify facts.

Another possibility is that the initial correct response was a fluke. The model might have gotten lucky the first time by hitting the right context, but without reinforced training on that specific entity, it couldn't consistently reproduce the correct answer. This is common in models that aren't fine-tuned on specific datasets.

My prompt might not be effectively guiding the model's reasoning process in subsequent interactions. Perhaps the instructions to explore hypotheses and validate steps aren't strong enough to override the model's tendency to generate diverse outputs. Adjusting the prompt to be more explicit about verifying against known entities or prioritizing certain data sources could help.

I hope it can help you guys to continue this analyses. let me know what you are trying to improve this prompt.

If I found something new I'll update this discussion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Secrets behind improving system prompt GRPO #164

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Secrets behind improving system prompt GRPO #164

Uh oh!

rrfaria Feb 2, 2025

Replies: 0 comments

rrfaria
Feb 2, 2025