Let us build on each other's strengths and offset each other's weaknesses.
This method runs Binary RL (GRPO) and On-Policy Distillation (OPD) simultaneously, combining evaluative and directional signals into a single training objective. In our experiments, this achieves significant performance gains over either method alone.
| Dimension | Binary RL | OPD | Combined |
|---|---|---|---|
| Signal type | Evaluative (good/bad) | Directional | Evaluative + directional |
| Advantage | Sequence-level scalar | Token-level directional | Mixed sequence and token-level |
| Density | All scored turns | Hint-accepted turns only | All scored turns |
| Feedback type | User / environment | Explicit corrections | Both implicit and explicit feedback |
| Signal richness | 1 scalar per sample | 1 value per token | 1 value per token |
Binary RL accepts every scored turn, requires no hint extraction, and works with any next-state signal — including terse, implicit reactions (a user simply re-asking a question) or structured environment outputs (exit codes, test verdicts). OPD should be enabled additionally when the interaction stream is likely to carry rich directive content: users who give explicit corrections ("don't use that library", "check the file first"), or environments that produce detailed error traces (SWE diffs, compiler diagnostics).
In practice, Binary RL provides broad gradient coverage across all turns, while OPD provides high-resolution, per-token corrections on the subset of turns where directive signals are available.
Both branches share the same PPO clipped surrogate loss — only the advantage computation differs. The combined advantage is:
where reward=0 so their GRPO advantage is zero; RL samples carry teacher_logp ≈ rollout_logp so their teacher advantage is approximately zero. Each branch naturally dominates for its own sample type, and the combined advantage is simply their sum.
For each main-line turn, after the next state arrives:
- Run
mhint-judge votes andmeval votes concurrently. - If the hint is accepted (longest non-trivial positive hint), emit one OPD sample with teacher log-probs.
- If the eval score is
+1or−1, emit one RL sample with the scalar reward. - A single turn can contribute both sample types.
- Training batch fires when collected sample count reaches
rollout_batch_size.
cd slime
bash ../openclaw-combine/run_qwen3_4b_openclaw_combine.sh| Variable | Default | Description |
|---|---|---|
OPENCLAW_COMBINE_W_RL |
1.0 |
Weight |
OPENCLAW_COMBINE_W_OPD |
1.0 |
Weight |
PRM_M |
1 |
Number of independent judge/eval votes per turn |
All other variables (NUM_GPUS, ACTOR_GPUS, HF_CKPT, etc.) are shared with the Binary RL and OPD scripts — see the main README for the full list.
openclaw-combine/
├── README.md
├── run_qwen3_4b_openclaw_combine.sh # Launch script
├── openclaw_combine_api_server.py # Async proxy: hint judge + PRM eval + sample submission
├── openclaw_combine_rollout.py # Rollout bridge to SLIME trainer
├── combine_loss.py # Weighted advantage: w_rl * GRPO + w_opd * teacher
└── results/ # Runtime records (auto-created)