feat: Add Harbor terminal RL recipe + tinker-cookbook Env adapter by tyfeng1997 · Pull Request #321 · thinking-machines-lab/tinker-cookbook

tyfeng1997 · 2026-01-22T18:17:53Z

Summary

This PR adds an end-to-end terminal-interaction RL recipe that wires Harbor’s sandboxed terminal environments (Docker/Daytona + tmux) into tinker-cookbook’s RL training loop via a small adapter.

At a high level:

Observation: a tinker.ModelInput built by a tinker-cookbook renderer over a chat history.
Action: model completion tokens that decode into assistant text.
Action format: assistant text is expected to be JSON in a simple terminus-json-plain schema describing terminal keystrokes.
Execution: keystrokes are executed in the sandbox via a Harbor Gym-style async environment.

Motivation

We want a minimal, recipe-scoped example that makes “terminal tasks” trainable with existing RL algorithms (e.g., GRPO) without modifying core library code.

Changes

Adds a Harbor → tinker-cookbook environment adapter:
- tinker_cookbook/recipes/terminal_rl/wrapped_env.py
  - HarborTerminalTinkerEnv: adapts a Harbor async Gym-like terminal env to the tinker-cookbook RL Env interface.
  - parse_terminus_json_plain(): parses the assistant JSON (supports fenced code blocks) into executable terminal commands; parse errors end the episode.
  - HarborSingleTaskEnvGroupBuilder + HarborSingleTaskRLDataset(Builder): repeatedly yields env-groups for one fixed task, matching GRPO “same task, multiple sandboxes per group” semantics.
Provides the Harbor Gym-style terminal environment wrapper:
- tinker_cookbook/recipes/terminal_rl/terminal_env.py
Provides a runnable training entrypoint:
- tinker_cookbook/recipes/terminal_rl/train.py
Documents setup + usage:
- tinker_cookbook/recipes/terminal_rl/README.md

Design notes

The code is contained under recipes/terminal_rl to keep this as an optional recipe/example and avoid changes to core RL infrastructure.
The adapter relies on a small RendererProtocol instead of concrete renderer types to minimize coupling.
Resource safety: the sandbox is closed on episode completion and on terminal failures (e.g., invalid JSON / too-long context).

How to run (uv)

export TINKER_API_KEY=...
python tinker_cookbook/recipes/terminal_rl/train.py  \
  --task-dir tinker_cookbook/recipes/terminal_rl/harbor_envs/devops_task \
  --env docker \
  --model-name Qwen/Qwen3-30B-A3B-Instruct-2507 \
  --renderer-name qwen3 \
  --max-tokens 4096 \
  --temperature 0.7 \
  --groups-per-batch 1 \
  --group-size 8 \
  --num-batches 25 \
  --learning-rate 4e-5 \
  --lora-rank 32 \
  --save-every 10 \
  --log-path ./runs/devops_qwen3_lr4e-5_r32_nb25

Results: Experimental Validation

To validate the environment integration, we conducted a hyperparameter search on a representative DevOps task.

The training used the importance_sampling loss with a group size of 8. We compared three learning rates: 1e-5, 2e-5, and 4e-5.

Note: The primary goal of this experiment was to validate the effectiveness of the environment integration. Due to the high latency constraints of sandboxed execution (approx. 4 hours per run), we restricted the training to a single task over 25 optimization steps.

Performance Comparison

Key Findings

Run	Learning Rate	Steps	Final Reward (Total)	Final Turns/Episode	Observation
A	`1e-5`	25	~0.40	~7.0	Failed to improve; policy stagnation.
B	`2e-5`	25	~0.45	~7.0	Minimal improvement; flat reward curve.
C	`4e-5`	25	~0.78	~5.0	Strong learning signal. Reward increased significantly, and turns per episode dropped.

Environment Validity: The HarborTerminalTinkerEnv successfully handled the rollout, execution, and verification cycle. The correlation between Reward increase and Turns decrease in Run C confirms that the environment provides valid, learnable signals for RL.
Higher LR Requirement: For this sparse-reward terminal task, 4e-5 outperformed lower rates, suggesting that larger updates are needed to explore effective command sequences within the limited horizon.
Efficiency: The best-performing run (Run C) not only maximized reward but also reduced the average turns per episode (from ~7 to ~5), indicating the model learned to solve the task more efficiently.

Follow-ups (optional)

If we want stricter schema validation, we can tighten parse_terminus_json_plain() without changing interfaces.
Future work could involve scaling to multi-task training sets now that the single-task pipeline is validated.

tyler-griggs · 2026-01-28T23:30:00Z

Thanks for sending this in, I'll take a look shortly!

tyfeng1997 · 2026-01-29T06:18:06Z

Thanks for sending this in, I'll take a look shortly!

the code is still a bit rough around the edges, and I would appreciate your suggestions. I have some free time recently and can gradually improve this pull request, including adding more experiments or making other fixes.

self-supervisor · 2026-02-10T22:11:52Z

@tyfeng1997 thanks so much for this PR.

One thing to note is that I was required to increase the --groups-per-batch to 4 in order to get reliable improvement.

running the following command with 5 repeats:

python tinker_cookbook/recipes/terminal_rl/train.py \
    --task-dir tinker_cookbook/recipes/terminal_rl/harbor_envs/devops_task \
    --env daytona \
    --model-name Qwen/Qwen3-30B-A3B-Instruct-2507 \
    --renderer-name qwen3 \
    --max-tokens 4096 \
    --temperature 0.7 \
    --groups-per-batch 4 \
    --group-size 8 \
    --num-batches 25 \
    --learning-rate 4e-5 \
    --lora-rank 32 \
    --save-every 10 \
    --log-path ./runs/devops_qwen3_lr4e-5_r32_nb25_gpb4

produced the following result:

Might be useful for others trying to train on harbor tasks! Thanks again for this example, useful for us at @VmaxAI

tyfeng1997 · 2026-02-11T07:46:02Z

@tyfeng1997 thanks so much for this PR.

One thing to note is that I was required to increase the --groups-per-batch to 4 in order to get reliable improvement.

running the following command with 5 repeats:
python tinker_cookbook/recipes/terminal_rl/train.py \
    --task-dir tinker_cookbook/recipes/terminal_rl/harbor_envs/devops_task \
    --env daytona \
    --model-name Qwen/Qwen3-30B-A3B-Instruct-2507 \
    --renderer-name qwen3 \
    --max-tokens 4096 \
    --temperature 0.7 \
    --groups-per-batch 4 \
    --group-size 8 \
    --num-batches 25 \
    --learning-rate 4e-5 \
    --lora-rank 32 \
    --save-every 10 \
    --log-path ./runs/devops_qwen3_lr4e-5_r32_nb25_gpb4
produced the following result:
Might be useful for others trying to train on harbor tasks! Thanks again for this example, useful for us at @VmaxAI

Thanks a lot for testing this out and sharing your results! I'm really glad to hear this recipe is useful for your work. The insight about increasing groups-per-batch to 4 is very valuable—thanks for the tip!

tyfeng1997 and others added 4 commits January 22, 2026 18:00

feat: Add Harbor terminal RL recipe + tinker-cookbook Env adapter

b8603cb

fixed

efa08cb

Fix pyright issues in terminal_rl recipe

0588190

docs(terminal_rl): document single-task-only + TODO multi-task

968e4e4

Merge branch 'main' into add_harbor_task_env

c4416e4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Harbor terminal RL recipe + tinker-cookbook Env adapter#321

feat: Add Harbor terminal RL recipe + tinker-cookbook Env adapter#321
tyfeng1997 wants to merge 5 commits intothinking-machines-lab:mainfrom
tyfeng1997:add_harbor_task_env

tyfeng1997 commented Jan 22, 2026

Uh oh!

tyler-griggs commented Jan 28, 2026

Uh oh!

tyfeng1997 commented Jan 29, 2026

Uh oh!

self-supervisor commented Feb 10, 2026 •

edited

Loading

Uh oh!

tyfeng1997 commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tyfeng1997 commented Jan 22, 2026

Summary

Motivation

Changes

Design notes

How to run (uv)

Results: Experimental Validation

Performance Comparison

Key Findings

Follow-ups (optional)

Uh oh!

tyler-griggs commented Jan 28, 2026

Uh oh!

tyfeng1997 commented Jan 29, 2026

Uh oh!

self-supervisor commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tyfeng1997 commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

self-supervisor commented Feb 10, 2026 •

edited

Loading