Masked thinking tag <think> when preparing SFT dataset #2846

lcamber · 2025-06-30T03:49:23Z

lcamber
Jun 30, 2025

I prepared a dataset to train a Qwen3-32B model with supervised fine-tuning (SFT) with full parameters training. During dataset preparation, I realized the <think> token is masked in loss calculation.

....
<|im_start|>(-100, 151644) assistant(-100, 77091) 
(-100, 198) <think>(-100, 151667) 
(198, 198) The(785, 785)  user(1196, 1196)  wants

By examining the reason on the source code, I found the code snippet that determines the label of the tokens, which is thefind_turn in src/axolotl/prompt_strategies/chat_template.py :

# Find first difference (start of content)
for i in range(min_len):
   if dummy_ids[i] != full_ids[i]:
       start_idx = i # determine the first token that has a label =/= -100
       break

where dummy_ids and full_ids are the tokenized contents of

turns_with_empty: Same conversation but with current turn's content replaced by "[[dummy_message]]".
turns_with_content: Normal conversation with actual content.
respectively.

# Create conversation versions
turns_with_empty = turns[:turn_idx] + [empty_turn]
turns_with_content = turns[: turn_idx + 1]

The tokenized turns_with_empty and turns_with_content have the following formats:
turns_with_empty:

<|im_start|>assistant
<think>

</think>

[[dummy_message]]<|im_end|>

turns_with_content:

<|im_start|>assistant
<think>
The user wants to ...
</think>

...<|im_end|>

As we can see, Qwen3's chate template will add empty <think> </think> to the dummy message. If we follow the code implementation in find_turn, the <think> will be omitted with label -100. As a result, when we train a model with SFT, the <think> token will not be included in the loss calculation.

After fine-tuning my model with SFT, the fine-tuned model will not output the <think> token anymore. Weridly, it only happens in full parameters training setting. If I train the model with LoRA or QLoRA, the model can still output the <think> token, even the <think> is masked as well. I would like to know if it is a feature or a bug?

NanoCode012 · 2025-06-30T03:54:49Z

NanoCode012
Jun 30, 2025
Maintainer

@ehartford has solved this. Do you mind sharing how you did so?

6 replies

lcamber Jun 30, 2025
Author

May I know how you tackle this problem? It would be great if you could share your method.

ehartford Jun 30, 2025
Collaborator

src/axolotl/prompt_strategies/chat_template.py

diff --git a/src/axolotl/prompt_strategies/chat_template.py b/src/axolotl/prompt_strategies/chat_template.py
index 0271fca24..502124217 100644
--- a/src/axolotl/prompt_strategies/chat_template.py
+++ b/src/axolotl/prompt_strategies/chat_template.py
@@ -505,6 +505,10 @@ class ChatTemplateStrategy(PromptTokenizingStrategy):
                         f"Set labels for training from {turn_start_idx} to {turn_end_idx}"
                     )
 
+                # Mask thinking blocks if split_thinking is enabled and this is an assistant turn
+                if self.split_thinking and role == "assistant":
+                    self._mask_thinking_blocks(input_ids, labels, turn_start_idx, turn_end_idx)
+
                 LOG.debug(f"Labels after processing turn {index}: {labels}")
 
             # Handle special tokens (EOT and EOS)
@@ -555,6 +559,64 @@ class ChatTemplateStrategy(PromptTokenizingStrategy):
             "attention_mask": [1] * len(input_ids),
         }
 
+    def _mask_thinking_blocks(self, input_ids: List[int], labels: List[int], turn_start_idx: int, turn_end_idx: int):
+        """
+        Mask thinking blocks in the labels for assistant turns.
+        
+        Args:
+            input_ids: The tokenized input sequence
+            labels: The labels array to modify
+            turn_start_idx: Start index of the current turn
+            turn_end_idx: End index of the current turn
+        """
+        # Define thinking block patterns to detect
+        thinking_patterns = [
+            ("<think>", "</think>"),
+            ("<reasoning>", "</reasoning>"),
+            ("<|begin_of_thought|>", "<|end_of_thought|>"),
+        ]
+        
+        # Convert patterns to token IDs for detection
+        for start_tag, end_tag in thinking_patterns:
+            start_token_ids = self.tokenizer.encode(start_tag, add_special_tokens=False)
+            end_token_ids = self.tokenizer.encode(end_tag, add_special_tokens=False)
+            
+            if not start_token_ids or not end_token_ids:
+                continue
+                
+            # Search for thinking blocks within the turn boundaries
+            i = turn_start_idx
+            while i <= turn_end_idx - len(start_token_ids):
+                # Check if we found the start tag
+                if input_ids[i:i + len(start_token_ids)] == start_token_ids:
+                    thinking_start = i
+                    
+                    # Look for the corresponding end tag
+                    j = i + len(start_token_ids)
+                    while j <= turn_end_idx - len(end_token_ids):
+                        if input_ids[j:j + len(end_token_ids)] == end_token_ids:
+                            thinking_end = j + len(end_token_ids)
+                            
+                            # Mask the entire thinking block including tags
+                            for k in range(thinking_start, thinking_end):
+                                if k < len(labels):
+                                    labels[k] = IGNORE_TOKEN_ID
+                            
+                            LOG.debug(
+                                f"Masked thinking block from {thinking_start} to {thinking_end} "
+                                f"(tags: {start_tag}, {end_tag})"
+                            )
+                            
+                            # Skip past this block
+                            i = thinking_end
+                            break
+                        j += 1
+                    else:
+                        # No matching end tag found, move to next position
+                        i += 1
+                else:
+                    i += 1
+
     def find_first_eos_token(self, input_ids, start_idx):
         eos_token_id = self.tokenizer.eos_token_id
         for i in range(start_idx, len(input_ids)):

lcamber Jun 30, 2025
Author

Instead of masking the whole thinking block, I want to explicitly train on the thinking block, as I want the model to learn the reasoning process. Is it a bug on the masked thinking tag?

ehartford Jul 1, 2025
Collaborator

That's the default behavior

NanoCode012 Jul 1, 2025
Maintainer

@lcamber , I may have misread your initial problem. By default, we do training on the thinking step + content.

Re: why wasn't the first think token masked and whether it's an issue, I'm not sure it is, as if you use the same template for inference, the template would auto add the think token already, so I don't think you need to train on it. In other words, Qwen won't need to add the think token, your template would.

Re: how to avoid the empty think blocks, I think you would need to mask out that section like Eric has shown. Otherwise, the model may learn incorrectly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Masked thinking tag <think> when preparing SFT dataset #2846

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Masked thinking tag <think> when preparing SFT dataset #2846

Uh oh!

lcamber Jun 30, 2025

Replies: 1 comment · 6 replies

Uh oh!

NanoCode012 Jun 30, 2025 Maintainer

Uh oh!

lcamber Jun 30, 2025 Author

Uh oh!

Uh oh!

ehartford Jun 30, 2025 Collaborator

Uh oh!

lcamber Jun 30, 2025 Author

Uh oh!

ehartford Jul 1, 2025 Collaborator

Uh oh!

NanoCode012 Jul 1, 2025 Maintainer

lcamber
Jun 30, 2025

Replies: 1 comment 6 replies

NanoCode012
Jun 30, 2025
Maintainer

lcamber Jun 30, 2025
Author

ehartford Jun 30, 2025
Collaborator

lcamber Jun 30, 2025
Author

ehartford Jul 1, 2025
Collaborator

NanoCode012 Jul 1, 2025
Maintainer