Add demonstration-conditioned on-policy self-distillation mode#2
Open
charlieoneill11 wants to merge 2 commits intomainfrom
Open
Add demonstration-conditioned on-policy self-distillation mode#2charlieoneill11 wants to merge 2 commits intomainfrom
charlieoneill11 wants to merge 2 commits intomainfrom
Conversation
Introduces a new `type = "self_distill"` loss config that replaces reward-derived advantages with token-level KL distillation against an EMA teacher, conditioned on dataset demonstrations. Key changes: - SelfDistillLossConfig with top-K tail KL divergence (forward/reverse/symmetric) - EMA teacher weights with FSDP-compatible swap-in/swap-out - Teacher context construction from dataset examples (CtxT prompt building) - Transport plumbing: teacher_prompt_ids and generated_mask through packing pipeline - Prefix loss masking to suppress demonstration-copying artifacts - Dedicated self-distill training branch with per-sequence teacher forward - EMA checkpoint save/load alongside DCP checkpoints - Config validation: on-policy enforcement, fused LM head auto-disable - Fix latent teacher_tau AttributeError when using CustomLossConfig - Debug config, docs, and unit tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
type = "self_distill"loss config that replaces reward-derived advantages + PPO/GRPO-style loss with token-level KL distillation against an EMA teacher, conditioned on dataset demonstrationsSelfDistillLossConfigsupporting top-K tail KL divergence (forward, reverse, symmetric) with configurableema_alpha,top_k, and prefix loss maskingCtxT(x,c)) from dataset examples via newself_distill_context.pymoduleteacher_prompt_idsandgenerated_maskthroughTrainingSample→ packing →MicroBatch→TensorMicroBatchmax_async_level = 0,max_off_policy_steps = 0), auto-disables fused LM head, and rejects multi-runteacher_tauAttributeErrorwhen usingCustomLossConfig(which lacksteacher_taufield) by addingisinstance(loss, LossConfig)guards inrl.pyTest plan
uv run pytest tests/unit/train/rl/test_self_distill_core.py tests/unit/train/rl/test_self_distill_config.py tests/unit/orchestrator/test_self_distill_context.py tests/unit/orchestrator/test_batch.py -vuv run pytest tests/unit -vuv run trainer @ configs/debug/rl/self_distill.tomldocs/on_policy_distillation.md,docs/bring-your-own-algorithms.md🤖 Generated with Claude Code