Skip to content

feat: Add Harbor terminal RL recipe + tinker-cookbook Env adapter#321

Open
tyfeng1997 wants to merge 5 commits intothinking-machines-lab:mainfrom
tyfeng1997:add_harbor_task_env
Open

feat: Add Harbor terminal RL recipe + tinker-cookbook Env adapter#321
tyfeng1997 wants to merge 5 commits intothinking-machines-lab:mainfrom
tyfeng1997:add_harbor_task_env

Conversation

@tyfeng1997
Copy link

Summary

This PR adds an end-to-end terminal-interaction RL recipe that wires Harbor’s sandboxed terminal environments (Docker/Daytona + tmux) into tinker-cookbook’s RL training loop via a small adapter.

At a high level:

  • Observation: a tinker.ModelInput built by a tinker-cookbook renderer over a chat history.
  • Action: model completion tokens that decode into assistant text.
  • Action format: assistant text is expected to be JSON in a simple terminus-json-plain schema describing terminal keystrokes.
  • Execution: keystrokes are executed in the sandbox via a Harbor Gym-style async environment.

Motivation

We want a minimal, recipe-scoped example that makes “terminal tasks” trainable with existing RL algorithms (e.g., GRPO) without modifying core library code.

Changes

  • Adds a Harbor → tinker-cookbook environment adapter:
    • tinker_cookbook/recipes/terminal_rl/wrapped_env.py
      • HarborTerminalTinkerEnv: adapts a Harbor async Gym-like terminal env to the tinker-cookbook RL Env interface.
      • parse_terminus_json_plain(): parses the assistant JSON (supports fenced code blocks) into executable terminal commands; parse errors end the episode.
      • HarborSingleTaskEnvGroupBuilder + HarborSingleTaskRLDataset(Builder): repeatedly yields env-groups for one fixed task, matching GRPO “same task, multiple sandboxes per group” semantics.
  • Provides the Harbor Gym-style terminal environment wrapper:
    • tinker_cookbook/recipes/terminal_rl/terminal_env.py
  • Provides a runnable training entrypoint:
    • tinker_cookbook/recipes/terminal_rl/train.py
  • Documents setup + usage:
    • tinker_cookbook/recipes/terminal_rl/README.md

Design notes

  • The code is contained under recipes/terminal_rl to keep this as an optional recipe/example and avoid changes to core RL infrastructure.
  • The adapter relies on a small RendererProtocol instead of concrete renderer types to minimize coupling.
  • Resource safety: the sandbox is closed on episode completion and on terminal failures (e.g., invalid JSON / too-long context).

How to run (uv)

export TINKER_API_KEY=...
python tinker_cookbook/recipes/terminal_rl/train.py  \
  --task-dir tinker_cookbook/recipes/terminal_rl/harbor_envs/devops_task \
  --env docker \
  --model-name Qwen/Qwen3-30B-A3B-Instruct-2507 \
  --renderer-name qwen3 \
  --max-tokens 4096 \
  --temperature 0.7 \
  --groups-per-batch 1 \
  --group-size 8 \
  --num-batches 25 \
  --learning-rate 4e-5 \
  --lora-rank 32 \
  --save-every 10 \
  --log-path ./runs/devops_qwen3_lr4e-5_r32_nb25

Results: Experimental Validation

To validate the environment integration, we conducted a hyperparameter search on a representative DevOps task.

The training used the importance_sampling loss with a group size of 8. We compared three learning rates: 1e-5, 2e-5, and 4e-5.

Note: The primary goal of this experiment was to validate the effectiveness of the environment integration. Due to the high latency constraints of sandboxed execution (approx. 4 hours per run), we restricted the training to a single task over 25 optimization steps.

Performance Comparison

result

Key Findings

Run Learning Rate Steps Final Reward (Total) Final Turns/Episode Observation
A 1e-5 25 ~0.40 ~7.0 Failed to improve; policy stagnation.
B 2e-5 25 ~0.45 ~7.0 Minimal improvement; flat reward curve.
C 4e-5 25 ~0.78 ~5.0 Strong learning signal. Reward increased significantly, and turns per episode dropped.
  1. Environment Validity: The HarborTerminalTinkerEnv successfully handled the rollout, execution, and verification cycle. The correlation between Reward increase and Turns decrease in Run C confirms that the environment provides valid, learnable signals for RL.
  2. Higher LR Requirement: For this sparse-reward terminal task, 4e-5 outperformed lower rates, suggesting that larger updates are needed to explore effective command sequences within the limited horizon.
  3. Efficiency: The best-performing run (Run C) not only maximized reward but also reduced the average turns per episode (from ~7 to ~5), indicating the model learned to solve the task more efficiently.

Follow-ups (optional)

  • If we want stricter schema validation, we can tighten parse_terminus_json_plain() without changing interfaces.
  • Future work could involve scaling to multi-task training sets now that the single-task pipeline is validated.

@tyler-griggs
Copy link
Contributor

Thanks for sending this in, I'll take a look shortly!

@tyfeng1997
Copy link
Author

Thanks for sending this in, I'll take a look shortly!

the code is still a bit rough around the edges, and I would appreciate your suggestions. I have some free time recently and can gradually improve this pull request, including adding more experiments or making other fixes.

@self-supervisor
Copy link

self-supervisor commented Feb 10, 2026

@tyfeng1997 thanks so much for this PR.

One thing to note is that I was required to increase the --groups-per-batch to 4 in order to get reliable improvement.

running the following command with 5 repeats:

python tinker_cookbook/recipes/terminal_rl/train.py \
    --task-dir tinker_cookbook/recipes/terminal_rl/harbor_envs/devops_task \
    --env daytona \
    --model-name Qwen/Qwen3-30B-A3B-Instruct-2507 \
    --renderer-name qwen3 \
    --max-tokens 4096 \
    --temperature 0.7 \
    --groups-per-batch 4 \
    --group-size 8 \
    --num-batches 25 \
    --learning-rate 4e-5 \
    --lora-rank 32 \
    --save-every 10 \
    --log-path ./runs/devops_qwen3_lr4e-5_r32_nb25_gpb4

produced the following result:

training_plot_gpb4

Might be useful for others trying to train on harbor tasks! Thanks again for this example, useful for us at @VmaxAI

@tyfeng1997
Copy link
Author

@tyfeng1997 thanks so much for this PR.

One thing to note is that I was required to increase the --groups-per-batch to 4 in order to get reliable improvement.

running the following command with 5 repeats:

python tinker_cookbook/recipes/terminal_rl/train.py \
    --task-dir tinker_cookbook/recipes/terminal_rl/harbor_envs/devops_task \
    --env daytona \
    --model-name Qwen/Qwen3-30B-A3B-Instruct-2507 \
    --renderer-name qwen3 \
    --max-tokens 4096 \
    --temperature 0.7 \
    --groups-per-batch 4 \
    --group-size 8 \
    --num-batches 25 \
    --learning-rate 4e-5 \
    --lora-rank 32 \
    --save-every 10 \
    --log-path ./runs/devops_qwen3_lr4e-5_r32_nb25_gpb4

produced the following result:

training_plot_gpb4 Might be useful for others trying to train on harbor tasks! Thanks again for this example, useful for us at @VmaxAI

Thanks a lot for testing this out and sharing your results! I'm really glad to hear this recipe is useful for your work. The insight about increasing groups-per-batch to 4 is very valuable—thanks for the tip!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants