Add demonstration-conditioned on-policy self-distillation mode by charlieoneill11 · Pull Request #2 · basetenlabs/prime-rl

charlieoneill11 · 2026-02-15T02:08:00Z

Summary

Adds a new type = "self_distill" loss config that replaces reward-derived advantages + PPO/GRPO-style loss with token-level KL distillation against an EMA teacher, conditioned on dataset demonstrations
Introduces SelfDistillLossConfig supporting top-K tail KL divergence (forward, reverse, symmetric) with configurable ema_alpha, top_k, and prefix loss masking
EMA teacher weights with FSDP-compatible swap-in/swap-out for per-sequence teacher forward passes
Teacher context construction (CtxT(x,c)) from dataset examples via new self_distill_context.py module
Full transport plumbing: teacher_prompt_ids and generated_mask through TrainingSample → packing → MicroBatch → TensorMicroBatch
Config validation enforces strict on-policy mode (max_async_level = 0, max_off_policy_steps = 0), auto-disables fused LM head, and rejects multi-run
EMA checkpoint save/load alongside DCP checkpoints
Bug fix: Fixes latent teacher_tau AttributeError when using CustomLossConfig (which lacks teacher_tau field) by adding isinstance(loss, LossConfig) guards in rl.py

Test plan

Unit tests pass: uv run pytest tests/unit/train/rl/test_self_distill_core.py tests/unit/train/rl/test_self_distill_config.py tests/unit/orchestrator/test_self_distill_context.py tests/unit/orchestrator/test_batch.py -v
Existing tests unaffected: uv run pytest tests/unit -v
Debug config runs end-to-end on GPU: uv run trainer @ configs/debug/rl/self_distill.toml
Verify EMA checkpoint save/load across resume cycles
Review docs: docs/on_policy_distillation.md, docs/bring-your-own-algorithms.md

🤖 Generated with Claude Code

Introduces a new `type = "self_distill"` loss config that replaces reward-derived advantages with token-level KL distillation against an EMA teacher, conditioned on dataset demonstrations. Key changes: - SelfDistillLossConfig with top-K tail KL divergence (forward/reverse/symmetric) - EMA teacher weights with FSDP-compatible swap-in/swap-out - Teacher context construction from dataset examples (CtxT prompt building) - Transport plumbing: teacher_prompt_ids and generated_mask through packing pipeline - Prefix loss masking to suppress demonstration-copying artifacts - Dedicated self-distill training branch with per-sequence teacher forward - EMA checkpoint save/load alongside DCP checkpoints - Config validation: on-policy enforcement, fused LM head auto-disable - Fix latent teacher_tau AttributeError when using CustomLossConfig - Debug config, docs, and unit tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

charlieoneill11 and others added 2 commits February 14, 2026 18:07

added comments to compute_topk_tail_distill_loss

9ce8cf6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add demonstration-conditioned on-policy self-distillation mode#2

Add demonstration-conditioned on-policy self-distillation mode#2
charlieoneill11 wants to merge 2 commits intomainfrom
feature/self-distill-mode

charlieoneill11 commented Feb 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

charlieoneill11 commented Feb 15, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant