SFT distillation: bug fixes, VLM support, and pretokenization optimization by eligotts · Pull Request #2053 · PrimeIntellect-ai/prime-rl

eligotts · 2026-03-19T23:09:05Z

Summary

Builds on top of #1905 (SFT distillation with teacher endpoint). This PR adds bug fixes and VLM multimodal support discovered during end-to-end testing of the SFT distillation pipeline across multiple environments.

Tested configurations:

Wiki-search — text-only, multi-turn tool calling, Claude Sonnet 4.6 teacher → Qwen3-4B student
Tic-tac-toe — VLM multimodal, multi-turn tool calling with board images, Claude teacher → Qwen3-VL-4B student
Color-codeword — VLM multimodal, multi-turn with dynamically generated images, no tools, Claude teacher → Qwen3-VL-4B student

Bug fixes

Handle dict tool_defs in _convert_tools_to_oai_format — Tool definitions arrive as plain dicts after ZMQ msgpack serialization (verifiers' msgpack_encoder calls model_dump() on Tool pydantic objects). The function assumed attribute access (.name), now supports both dict and object forms.
Clean spurious fields from VLM messages — Some environments (e.g., tic-tac-toe) include image_url: None on text content items, which causes the Qwen3-VL processor to miscount images vs. image placeholder tokens. _prepare_messages_for_processor now emits clean {"type": "text", "text": ...} dicts.

VLM processor support in pretokenization

For VLM models, the tokenizer's apply_chat_template produces only 1 <|image_pad|> token per image, but the model expects the count to match actual image patches (e.g., 1900 for a larger image). The processor correctly expands these placeholders but wasn't wired into the pretokenization path.

Thread the processor through pretokenize_rollout_trajectory → _tokenize_step_from_messages → render_messages / build_incremental_token_mask
Add _prepare_messages_for_processor to convert image_url items to PIL Images and normalize message format for the processor
Move pretokenize_rollout_trajectory before build_vlm_image_cache in the orchestrator (the cache build strips image data from messages)
When processor is None (non-VLM models), the original tokenizer-only path runs unchanged

Pretokenization optimization

In the SFT distillation path, each trajectory step's completion is always a single assistant message (structurally enforced by verifiers' parse_response_message). The previous code used build_incremental_token_mask (N tokenizer/processor calls per step) to compute a mask that is trivially [True] * len(completion_ids). Replaced with a single render_messages call + direct mask assignment. For VLM models, this avoids redundant image preprocessing on every incremental prefix. Added an assertion to guard the single-assistant-role assumption.

Test plan

Unit tests pass (test_sft_trajectories.py)
End-to-end wiki-search (text + tools) — 3 steps, loss decreasing
End-to-end tic-tac-toe (VLM + tools) — 10 steps, loss decreasing
End-to-end color-codeword (VLM, no tools) — 10 steps, loss decreasing
Verified old vs new mask path produces identical token IDs and masks
Confirmed non-VLM and regular RL paths are unaffected (processor=None, tokens already populated by vLLM)

🤖 Generated with Claude Code

Note

Medium Risk
Changes orchestrator rollouts, sampling args, and trainer loss selection, including disabling weight broadcast/policy updates in a new external-rollout SFT mode; mistakes could break training/inference behavior or tokenization, especially for multimodal/tool-call trajectories.

Overview
Enables hard SFT distillation from an external OpenAI-compatible teacher endpoint via new orchestrator.teacher_rollout_model, with config validation that enforces trainer.loss.type = "sft", use_token_client = false, and no local [inference].

Updates the orchestrator to route rollout generation through the external model when configured, disabling policy weight updates/weight broadcast and adjusting checkpoint-step handling accordingly.

Adds an sft loss variant (SFTLossConfig + masked-NLL implementation) and introduces a shared chat_template utility plus rollout pretokenization/reconstruction (including VLM image message handling and tool-call argument deserialization), with new unit tests and example/docs/config updates.

^{Written by Cursor Bugbot for commit c95d9e1. This will update automatically on new commits. Configure here.}

…d CustomLossConfig Co-authored-by: will brown <willccbb@users.noreply.github.com>

# Conflicts: # CHANGELOG.md # src/prime_rl/orchestrator/orchestrator.py # src/prime_rl/orchestrator/scheduler.py # src/prime_rl/orchestrator/trajectories.py # src/prime_rl/trainer/rl/train.py # tests/unit/test_configs.py

full_ids was tokenized via _render_messages then immediately overwritten by build_incremental_token_mask, wasting a tokenization pass per step. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Duplicate of should_add_generation_prompt in utils/chat_template.py, never called anywhere. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Explicitly disable top-k and min-p filtering on all vLLM requests, not just the token client path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…inference Redundant with the Pydantic model validator definition. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When external teacher rollouts have tool_defs, convert them to OAI format and pass through to build_incremental_token_mask so that tokenization includes tool definition tokens from the chat template. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Matches the SFT data pipeline behavior. Without this, parallel tool call responses cause an assertion failure in incremental tokenization when the chat template re-renders consecutive tool messages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tool definitions arrive as plain dicts after ZMQ msgpack serialization (verifiers' msgpack_encoder calls model_dump() on Tool pydantic objects). Use duck-typed accessor to support both dict and object forms. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

For VLM models, the tokenizer alone produces only 1 <|image_pad|> token per image, but the model expects the count to match actual image patches. Thread the processor through pretokenize → render_messages → build_incremental_token_mask so that processor.apply_chat_template (which expands image placeholders correctly) is used when available. Also move pretokenize before build_vlm_image_cache in the orchestrator, since the cache build strips image data from messages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

In the SFT distillation path, completion is always a single assistant message (enforced by verifiers' parse_response_message), so the completion mask is trivially all-True. Replace N incremental tokenizer/processor calls from build_incremental_token_mask with a single render_messages call. For VLM models this avoids redundant image preprocessing on every incremental prefix. Added assertion to guard the single-assistant-role assumption. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Five trivial wrapper functions add unnecessary indirection
- Removed the five pass-through wrappers in trajectories and updated internal/test call sites to use the imported chat_template functions directly.

Or push these changes by commenting:

@cursor push 50cd9ca098

Preview (50cd9ca098)

diff --git a/src/prime_rl/orchestrator/trajectories.py b/src/prime_rl/orchestrator/trajectories.py
--- a/src/prime_rl/orchestrator/trajectories.py
+++ b/src/prime_rl/orchestrator/trajectories.py
@@ -45,39 +45,6 @@
     zero_entry = [[0] * topk for _ in range(num_layers)]
     return routed_experts + [zero_entry for _ in range(deficit)]
 
-
-def _common_prefix_len(a: list[int], b: list[int]) -> int:
-    return common_prefix_len(a, b)
-
-
-def _normalize_messages(messages: Any, default_role: str) -> list[dict[str, Any]]:
-    return normalize_messages(messages, default_role)
-
-
-def _deserialize_tool_calls(messages: list[dict[str, Any]]) -> list[dict[str, Any]]:
-    return deserialize_tool_calls(messages)
-
-
-def _strip_message_content(messages: list[dict[str, Any]]) -> list[dict[str, Any]]:
-    return strip_message_content(messages)
-
-
-def _render_messages(
-    tokenizer: PreTrainedTokenizer,
-    messages: list[dict[str, Any]],
-    add_generation_prompt: bool = False,
-    tools: list[dict[str, Any]] | None = None,
-    processor=None,
-) -> list[int]:
-    return render_messages(
-        tokenizer,
-        messages,
-        add_generation_prompt=add_generation_prompt,
-        tools=tools,
-        processor=processor,
-    )
-
-
 def _prepare_messages_for_processor(messages: list[dict[str, Any]]) -> list[dict[str, Any]]:
     """Convert messages to the format expected by the VLM processor.
 
@@ -124,11 +91,11 @@
     tools: list[dict[str, Any]] | None = None,
     processor=None,
 ) -> dict[str, Any]:
-    prompt = _normalize_messages(step.get("prompt"), default_role="user")
-    completion = _normalize_messages(step.get("completion"), default_role="assistant")
+    prompt = normalize_messages(step.get("prompt"), default_role="user")
+    completion = normalize_messages(step.get("completion"), default_role="assistant")
 
-    prompt = _strip_message_content(_deserialize_tool_calls(prompt))
-    completion = _strip_message_content(_deserialize_tool_calls(completion))
+    prompt = strip_message_content(deserialize_tool_calls(prompt))
+    completion = strip_message_content(deserialize_tool_calls(completion))
 
     assert all(m.get("role") == "assistant" for m in completion), (
         "Expected all completion messages to be assistant role for SFT distillation, "
@@ -141,21 +108,21 @@
 
     all_messages = prompt + completion
     prompt_has_assistant_completion = len(completion) > 0 and completion[0].get("role") == "assistant"
-    prompt_ids = _render_messages(
+    prompt_ids = render_messages(
         tokenizer,
         prompt,
         add_generation_prompt=prompt_has_assistant_completion,
         tools=tools,
         processor=processor,
     )
-    full_ids = _render_messages(
+    full_ids = render_messages(
         tokenizer,
         all_messages,
         tools=tools,
         processor=processor,
     )
 
-    split_idx = _common_prefix_len(prompt_ids, full_ids)
+    split_idx = common_prefix_len(prompt_ids, full_ids)
 
     completion_ids = full_ids[split_idx:]
     completion_mask = [True] * len(completion_ids)

diff --git a/tests/unit/orchestrator/test_trajectories.py b/tests/unit/orchestrator/test_trajectories.py
--- a/tests/unit/orchestrator/test_trajectories.py
+++ b/tests/unit/orchestrator/test_trajectories.py
@@ -10,13 +10,13 @@
 from prime_rl.orchestrator.trajectories import (
     VLMImageCache,
     _align_routed_experts,
-    _deserialize_tool_calls,
     _extract_images_from_examples,
     _extract_images_from_messages,
     _ImageStore,
     build_vlm_image_cache,
     interleave_rollout,
 )
+from prime_rl.utils.chat_template import deserialize_tool_calls
 
 
 def _pixels(data: list[list[float]]) -> tuple[bytes, list[int]]:
@@ -33,7 +33,7 @@
 def test_deserialize_tool_calls_does_not_inject_missing_key():
     messages = [{"role": "assistant", "content": "hello"}]
 
-    deserialized = _deserialize_tool_calls(messages)
+    deserialized = deserialize_tool_calls(messages)
 
     assert "tool_calls" not in deserialized[0]
 
@@ -52,7 +52,7 @@
         }
     ]
 
-    deserialized = _deserialize_tool_calls(messages)
+    deserialized = deserialize_tool_calls(messages)
 
     assert deserialized[0]["tool_calls"][0]["function"]["arguments"] == {"x": 1}

_{This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.}

cursor · 2026-03-19T23:14:01Z

src/prime_rl/orchestrator/trajectories.py

+        add_generation_prompt=add_generation_prompt,
+        tools=tools,
+        processor=processor,
+    )


Five trivial wrapper functions add unnecessary indirection

Low Severity

_common_prefix_len, _normalize_messages, _deserialize_tool_calls, _strip_message_content, and _render_messages are single-line wrappers that purely delegate to identically-named functions already imported from prime_rl.utils.chat_template. The callers inside this file (and the test that imports _deserialize_tool_calls) could use the imported functions directly, removing five layers of unnecessary indirection.

samsja

lets remove all the custom config please. maybe keep it one

samsja · 2026-03-19T23:52:54Z

src/prime_rl/orchestrator/utils.py

+    if use_token_client:
+        sampling_args["logprobs"] = True
+        extra_body["return_token_ids"] = True
+
+    if extra_body:
+        sampling_args["extra_body"] = extra_body


we need logprob in both use token client true and false no ?

added it but we need to make sure passing logprob = True doesn't break an api request to non-vllm api? and same with unconditionally passing in top_k and min_p

tests/unit/test_configs.py

Keep examples/alphabet_sort/sft_distill_hard.toml as the canonical example config. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

src/prime_rl/orchestrator/utils.py

logprobs=True was only set when use_token_client=True, but non-token- client paths (e.g. external APIs) also need logprobs for RL training. return_token_ids remains gated behind use_token_client since it is vLLM-specific. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

src/prime_rl/orchestrator/scheduler.py

When enable_policy_updates is False (SFT distillation), generate_batch skipped calling checkpoint_ready.set(). This worked by accident because the event starts set and is only cleared inside maybe_update_policy (which is never called in this path), but would deadlock if anything else ever cleared the event. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mikasenghaas

nice looks p clean overall

src/prime_rl/trainer/rl/train.py

mikasenghaas · 2026-03-20T17:01:07Z

src/prime_rl/trainer/sft/data.py

are the changes here related to sft distillation on rl trainer?

i think this came from @willccbb but seems to just be a refactor of helper functions that were duplicated between sft/data.py and orchestrator/trajectories.py into a new shared module utils/chat_template.py - should this be a separate pr?

mikasenghaas · 2026-03-20T17:01:59Z

src/prime_rl/orchestrator/orchestrator.py

 from prime_rl.utils.vlm import is_vlm_model


+def setup_external_rollout_model(config: OrchestratorConfig, logger) -> tuple[Any, str, bool]:


would prefer not starting to have orch utils here

src/prime_rl/orchestrator/orchestrator.py

- Revert loss_scale regression: use loss_mask-based scaling unconditionally - Move setup_external_rollout_model to orchestrator/utils.py - Clarify log message to say "SFT distillation mode" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

src/prime_rl/orchestrator/trajectories.py

…sequences When tokenizing trajectory steps from messages, the prompt-only and full (prompt+completion) tokenizations can diverge at a boundary (non-prefix split). Previously, the full prompt_ids was returned even when split_idx < len(prompt_ids), producing corrupted sequences where prompt_ids + completion_ids != full_ids. This broke interleaving (extension property checks fail on mismatched prefixes) and fed invalid token sequences to the trainer. Fix: use full_ids[:split_idx] as prompt_ids so the concatenation always equals full_ids exactly. Verified across GLM-4, Qwen3, Qwen2.5, Qwen2.5-VL tokenizers and tested end-to-end SFT distillation on alphabet-sort, wiki-search, tic-tac-toe, and color-codeword (VLM) environments. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

src/prime_rl/orchestrator/trajectories.py

…ation The previous commit truncating prompt_ids to full_ids[:split_idx] made the debug log condition always false (split_idx < split_idx). Track the original prompt length before overwrite so the log fires correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

willccbb and others added 19 commits March 3, 2026 22:53

SFT distillation with teacher endpoint

bc2670f

streamline SFT distillation rollout/eval model split

d57726c

fix sft reconstruction and default loss scaling regressions

5ecd61b

simplify SFT distillation by removing eval inference split

0002457

chore: apply ruff formatting

25a50f5

docs: add changelog entries for RolloutModelConfig, SFTLossConfig, an…

06f352d

…d CustomLossConfig Co-authored-by: will brown <willccbb@users.noreply.github.com>

refine external SFT rollout path and cleanup tests

54d306a

refactor shared SFT chat-template tokenization logic

5644df3

fix: remove dead store in _tokenize_step_from_messages

20fc8b0

full_ids was tokenized via _render_messages then immediately overwritten by build_incremental_token_mask, wasting a tokenization pass per step. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: remove unused _should_add_generation_prompt from trajectories

14faa48

Duplicate of should_add_generation_prompt in utils/chat_template.py, never called anywhere. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: pass top_k and min_p for both token and non-token client paths

586ae0c

Explicitly disable top-k and min-p filtering on all vLLM requests, not just the token client path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: remove redundant test_rl_config_external_rollout_mode_rejects_…

08eb784

…inference Redundant with the Pydantic model validator definition. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: correct CHANGELOG field name to teacher_rollout_model

a889bb0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cursor bot reviewed Mar 19, 2026

View reviewed changes

samsja reviewed Mar 19, 2026

View reviewed changes

tests/unit/test_configs.py Outdated Show resolved Hide resolved

chore: remove duplicate SFT distillation configs

4cac8fb

Keep examples/alphabet_sort/sft_distill_hard.toml as the canonical example config. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cursor bot reviewed Mar 20, 2026

View reviewed changes

src/prime_rl/orchestrator/utils.py Show resolved Hide resolved

eligotts and others added 2 commits March 20, 2026 00:11

chore: remove test_rl_config_external_rollout_mode_rejects_token_client

5472547

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cursor bot reviewed Mar 20, 2026

View reviewed changes

src/prime_rl/orchestrator/scheduler.py Show resolved Hide resolved

mikasenghaas reviewed Mar 20, 2026

View reviewed changes

Address PR review feedback

1001e11

- Revert loss_scale regression: use loss_mask-based scaling unconditionally - Move setup_external_rollout_model to orchestrator/utils.py - Clarify log message to say "SFT distillation mode" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cursor bot reviewed Mar 20, 2026

View reviewed changes

src/prime_rl/orchestrator/trajectories.py Show resolved Hide resolved

cursor bot reviewed Mar 20, 2026

View reviewed changes

src/prime_rl/orchestrator/trajectories.py Show resolved Hide resolved

samsja approved these changes Mar 20, 2026

View reviewed changes

		from prime_rl.utils.vlm import is_vlm_model


		def setup_external_rollout_model(config: OrchestratorConfig, logger) -> tuple[Any, str, bool]:

Conversation

eligotts commented Mar 19, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bug fixes

VLM processor support in pretokenization

Pretokenization optimization

Test plan

Uh oh!

cursor bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 19, 2026

Choose a reason for hiding this comment

Five trivial wrapper functions add unnecessary indirection

Uh oh!

samsja left a comment

Choose a reason for hiding this comment

Uh oh!

samsja Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

eligotts Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

eligotts Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mikasenghaas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mikasenghaas Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

eligotts Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

mikasenghaas Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

eligotts commented Mar 19, 2026 •

edited by cursor bot

Loading

cursor bot left a comment •

edited

Loading