This repository contains the algorithm implementation for our paper "Black-Box On-Policy Distillation of Large Language Models".
📄 Paper: arXiv:2511.10643
💾 Data: LMSYS-Chat-GPT-5-Chat-Response
🤖 Models: GAD Models
We use two repos as to easily install different branches for different experiments. Check GAD Repo for environment setup and scripts for running experiments. Check this repo for algorithm implementation.
We implement based on VeRL. We hack to use the critic module in VeRL as our discriminator.
There are four branches in this repo: seqkd branch for running the SeqKD baseline, warmup branch for warmup stage of our method, gad branch for GAD training stage of our method and eval branch to use the already-trained model to perform generation only.
For SeqKD and warmup stage of GAD, the student is supervised-finetuned on the teacher response (corresponding code at sft_seqkd and sft_warmup). We choose to use this VeRL-based repo to implement them for best alignment.
We provide a code walk-through of this branch gad.
- Training Entrance
- Student Rollout: Entrance and Implementation
- Discriminator Update with BT Loss: Entrance and Implementation
- Student Update with Discriminator Score: Entrance and Implementation
If you find this work useful, please cite our paper:
@article{ye2025blackboxonpolicydistillationlarge,
title={Black-Box On-Policy Distillation of Large Language Models},
author={Tianzhu Ye and Li Dong and Zewen Chi and Xun Wu and Shaohan Huang and Furu Wei},
journal={arXiv preprint arXiv:2511.10643},
year={2025},
url={https://arxiv.org/abs/2511.10643}
}For any questions or issues, please open an issue in this repository.