Skip to content

Add demonstration-conditioned on-policy self-distillation mode#2

Open
charlieoneill11 wants to merge 2 commits intomainfrom
feature/self-distill-mode
Open

Add demonstration-conditioned on-policy self-distillation mode#2
charlieoneill11 wants to merge 2 commits intomainfrom
feature/self-distill-mode

Conversation

@charlieoneill11
Copy link
Collaborator

Summary

  • Adds a new type = "self_distill" loss config that replaces reward-derived advantages + PPO/GRPO-style loss with token-level KL distillation against an EMA teacher, conditioned on dataset demonstrations
  • Introduces SelfDistillLossConfig supporting top-K tail KL divergence (forward, reverse, symmetric) with configurable ema_alpha, top_k, and prefix loss masking
  • EMA teacher weights with FSDP-compatible swap-in/swap-out for per-sequence teacher forward passes
  • Teacher context construction (CtxT(x,c)) from dataset examples via new self_distill_context.py module
  • Full transport plumbing: teacher_prompt_ids and generated_mask through TrainingSample → packing → MicroBatchTensorMicroBatch
  • Config validation enforces strict on-policy mode (max_async_level = 0, max_off_policy_steps = 0), auto-disables fused LM head, and rejects multi-run
  • EMA checkpoint save/load alongside DCP checkpoints
  • Bug fix: Fixes latent teacher_tau AttributeError when using CustomLossConfig (which lacks teacher_tau field) by adding isinstance(loss, LossConfig) guards in rl.py

Test plan

  • Unit tests pass: uv run pytest tests/unit/train/rl/test_self_distill_core.py tests/unit/train/rl/test_self_distill_config.py tests/unit/orchestrator/test_self_distill_context.py tests/unit/orchestrator/test_batch.py -v
  • Existing tests unaffected: uv run pytest tests/unit -v
  • Debug config runs end-to-end on GPU: uv run trainer @ configs/debug/rl/self_distill.toml
  • Verify EMA checkpoint save/load across resume cycles
  • Review docs: docs/on_policy_distillation.md, docs/bring-your-own-algorithms.md

🤖 Generated with Claude Code

charlieoneill11 and others added 2 commits February 14, 2026 18:07
Introduces a new `type = "self_distill"` loss config that replaces
reward-derived advantages with token-level KL distillation against an
EMA teacher, conditioned on dataset demonstrations.

Key changes:
- SelfDistillLossConfig with top-K tail KL divergence (forward/reverse/symmetric)
- EMA teacher weights with FSDP-compatible swap-in/swap-out
- Teacher context construction from dataset examples (CtxT prompt building)
- Transport plumbing: teacher_prompt_ids and generated_mask through packing pipeline
- Prefix loss masking to suppress demonstration-copying artifacts
- Dedicated self-distill training branch with per-sequence teacher forward
- EMA checkpoint save/load alongside DCP checkpoints
- Config validation: on-policy enforcement, fused LM head auto-disable
- Fix latent teacher_tau AttributeError when using CustomLossConfig
- Debug config, docs, and unit tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant