feat: Add Harbor terminal RL recipe + tinker-cookbook Env adapter#321
feat: Add Harbor terminal RL recipe + tinker-cookbook Env adapter#321tyfeng1997 wants to merge 5 commits intothinking-machines-lab:mainfrom
Conversation
|
Thanks for sending this in, I'll take a look shortly! |
the code is still a bit rough around the edges, and I would appreciate your suggestions. I have some free time recently and can gradually improve this pull request, including adding more experiments or making other fixes. |
|
@tyfeng1997 thanks so much for this PR. One thing to note is that I was required to increase the running the following command with 5 repeats: python tinker_cookbook/recipes/terminal_rl/train.py \
--task-dir tinker_cookbook/recipes/terminal_rl/harbor_envs/devops_task \
--env daytona \
--model-name Qwen/Qwen3-30B-A3B-Instruct-2507 \
--renderer-name qwen3 \
--max-tokens 4096 \
--temperature 0.7 \
--groups-per-batch 4 \
--group-size 8 \
--num-batches 25 \
--learning-rate 4e-5 \
--lora-rank 32 \
--save-every 10 \
--log-path ./runs/devops_qwen3_lr4e-5_r32_nb25_gpb4produced the following result:
Might be useful for others trying to train on harbor tasks! Thanks again for this example, useful for us at @VmaxAI |
Thanks a lot for testing this out and sharing your results! I'm really glad to hear this recipe is useful for your work. The insight about increasing groups-per-batch to 4 is very valuable—thanks for the tip! |


Summary
This PR adds an end-to-end terminal-interaction RL recipe that wires Harbor’s sandboxed terminal environments (Docker/Daytona + tmux) into tinker-cookbook’s RL training loop via a small adapter.
At a high level:
tinker.ModelInputbuilt by a tinker-cookbook renderer over a chat history.terminus-json-plainschema describing terminal keystrokes.Motivation
We want a minimal, recipe-scoped example that makes “terminal tasks” trainable with existing RL algorithms (e.g., GRPO) without modifying core library code.
Changes
tinker_cookbook/recipes/terminal_rl/wrapped_env.pyHarborTerminalTinkerEnv: adapts a Harbor async Gym-like terminal env to the tinker-cookbook RLEnvinterface.parse_terminus_json_plain(): parses the assistant JSON (supports fenced code blocks) into executable terminal commands; parse errors end the episode.HarborSingleTaskEnvGroupBuilder+HarborSingleTaskRLDataset(Builder): repeatedly yields env-groups for one fixed task, matching GRPO “same task, multiple sandboxes per group” semantics.tinker_cookbook/recipes/terminal_rl/terminal_env.pytinker_cookbook/recipes/terminal_rl/train.pytinker_cookbook/recipes/terminal_rl/README.mdDesign notes
recipes/terminal_rlto keep this as an optional recipe/example and avoid changes to core RL infrastructure.RendererProtocolinstead of concrete renderer types to minimize coupling.How to run (uv)
export TINKER_API_KEY=... python tinker_cookbook/recipes/terminal_rl/train.py \ --task-dir tinker_cookbook/recipes/terminal_rl/harbor_envs/devops_task \ --env docker \ --model-name Qwen/Qwen3-30B-A3B-Instruct-2507 \ --renderer-name qwen3 \ --max-tokens 4096 \ --temperature 0.7 \ --groups-per-batch 1 \ --group-size 8 \ --num-batches 25 \ --learning-rate 4e-5 \ --lora-rank 32 \ --save-every 10 \ --log-path ./runs/devops_qwen3_lr4e-5_r32_nb25Results: Experimental Validation
To validate the environment integration, we conducted a hyperparameter search on a representative DevOps task.
The training used the importance_sampling loss with a group size of 8. We compared three learning rates:
1e-5,2e-5, and4e-5.Note: The primary goal of this experiment was to validate the effectiveness of the environment integration. Due to the high latency constraints of sandboxed execution (approx. 4 hours per run), we restricted the training to a single task over 25 optimization steps.
Performance Comparison
Key Findings
1e-52e-54e-5HarborTerminalTinkerEnvsuccessfully handled the rollout, execution, and verification cycle. The correlation between Reward increase and Turns decrease in Run C confirms that the environment provides valid, learnable signals for RL.4e-5outperformed lower rates, suggesting that larger updates are needed to explore effective command sequences within the limited horizon.Follow-ups (optional)
parse_terminus_json_plain()without changing interfaces.