Skip to content

Conversation

@EtashGuha
Copy link

@EtashGuha EtashGuha commented Dec 19, 2025

Please ignore, WIP for terminal-bench Async RL support

CharlieFRuan and others added 20 commits November 21, 2025 18:48
We print out the results when a `trial.run()` fails
…await in trainer (NovaSky-AI#2)

Before this PR, when running TBench RL jobs, we see a lot of errors like
the following in the log

```bash
[0m future: <Task finished name='Task-1046' coro=<LoggingWorker._worker_loop() done, defined at /home/ec2-user/miniconda3/envs/dcagent/lib/python3.12/site-packages/litellm/litellm_core_utils/logging_worker.py:61> exception=RuntimeError('<Queue at 0x7f1a00331160 maxsize=50000> is bound to a different event loop')>�[0m
[0m �[33m�[1mTraceback (most recent call last):�[0m
[0m   File "/home/ec2-user/miniconda3/envs/dcagent/lib/python3.12/site-packages/litellm/litellm_core_utils/logging_worker.py", line 69, in _worker_loop
[0m     task = await self._queue.get()
[0m   File "/home/ec2-user/miniconda3/envs/dcagent/lib/python3.12/asyncio/queues.py", line 155, in get
[0m     getter = self._get_loop().create_future()
[0m   File "/home/ec2-user/miniconda3/envs/dcagent/lib/python3.12/asyncio/mixins.py", line 20, in _get_loop
[0m     raise RuntimeError(f'{self!r} is bound to a different event loop')
[0m �[31m�[1mRuntimeError�[0m:�[1m <Queue at 0x7f1a00331160 maxsize=50000> is bound to a different event loop�[0m
```

This is because `RayPPOTrainer.train()` was calling `asyncio.run()`
repeatedly inside its training loop. LiteLLM (used by sandboxes)
initializes a global `LoggingWorker` with an `asyncio.Queue` bound to
the event loop where it was first created. When `asyncio.run()` created
a new event loop in subsequent iterations, LiteLLM tried to access that
queue from the new loop, causing the error.

This PR fixes this by unifying the event loop management:
- Refactored `RayPPOTrainer.train()` to be an async method, replacing
`asyncio.run()` calls with `await`.
- Updated `ConditionalWeightsManager` in `weights_manager.py` to support
async context manager usage (`__aenter__`/`__aexit__`).
- Modified `BasePPOExp.run()` in `main_base.py` to call
`asyncio.run(trainer.train())` once at the top level. This ensures the
entire training process runs within a single persistent event loop.

Tested and confirmed that the training loop now progresses successfully
without those errors
Before this PR, `session_id` is always None because Terminus 2 by
default does not pass it in. So we do `engine_idx = random.randint(0,
len(self.engines) - 1)` which is really bad for prefix cache hit rate.

We can actually pass in a session ID to `AgentConfig` and it will be
passed to all requests.

Verified that the following will print out logs like `CHARLIE:
session_id: 954320202c254bd8bbca083d34457b94` (multiple times too,
meaning the sesion_id is consistent across a trial, i.e. trajectory)

```python
    async def chat_completion(self, request_payload: Dict[str, Any]) -> Dict[str, Any]:
        session_id = request_payload["json"].pop("session_id", None)
        print(f"CHARLIE: session_id: {session_id}")
        ...
```
Extended On Policy Distillation for the `TerminalBenchExp` class. 

Uses the `OnPolicyDistillationTrainer` implemented in the main [SkyRL
branch](https://github.com/NovaSky-AI/SkyRL/tree/main/skyrl-train/examples/on_policy_distillation)
…ky-AI#5)

Before this PR, we disabled systems logging altogether because it was
not well-supported on our TACC environment.

This PR removes those constraints, namely `_disable_stats` and
`_disable_meta`.

However, by default, only a single node's system metrics will be logged,
since WANDB only runs on a single node (the head node).

This PR spins up a Ray actor `WandbNodeLogger` on each node, dedicated
to logging systems metrics like GPU utilization.

We follow
https://docs.wandb.ai/models/track/log/distributed-training#track-all-processes-to-a-single-run

We use `model="shared"` to aggregate system metrics from all nodes to
the same Wandb run. However, with this approach, the systems metrics
panels do not appear in the Wandb UI automatically but requires us to
manually create the panels. The alternative is to create a run for each
node and group them by group_name. We prefer to keep all nodes metrics
to the same run.

Tested on TACC with 4 nodes. Example panel:
<img width="657" height="344" alt="image"
src="https://github.com/user-attachments/assets/44a2f083-ad1f-4d27-af58-4a6da1c8b4b1"
/>

---------

Co-authored-by: Copilot <[email protected]>
…ovaSky-AI#7)

Fixes `epoch` counter after resuming from checkpoint

---------

Cherry picked from NovaSky-AI#589

Signed-off-by: SumanthRH <[email protected]>
Co-authored-by: Sumanth R Hegde <[email protected]>
…es (NovaSky-AI#404) (NovaSky-AI#9)

Fixes chat templating in the Mini-swe-agent and the terminal bench
examples.

Previously, we were naively callling `.apply_chat_template` to encode
response messages turn by turn - but this can append system messages for
each turn depending on the model. (h/t to @CharlieFRuan )

For example, Qwen 3 8B , the templating works fine, but then for Qwen
2.5 1.5B instruct, the code adds a system prompt message while
tokenizing every turn

We use the fixed base approach similar to what we do in the
SkyRLGymGenerator.

---------

Signed-off-by: SumanthRH <[email protected]>
Co-authored-by: Sumanth R Hegde <[email protected]>
…I#710)

Before this PR, in `TerminalBenchGenerator` and `MiniSweAgentGenerator`,
we naively generate loss mask for a chat history by doing, all zeros for
user messages, and all ones for assistant messages.

However, this is incorrect. For instance, the generation prompt of
assistant should be masked with zero, and the token `\n` that comes
after the EOS for Qwen models should be masked with zero.

This PR fixes this by implementing
`get_response_ids_and_loss_mask_from_messages()`, which uses the
fixed-base helper `encode_messages_subset()`, to convert a chat history
(excluding the initial prompt, which may differ by models) into response
IDs, loss mask, and optionally rollout logprobs.

Besides, TerminalBenchGenerator had a bug of

```python
prompt_ids = self.tokenizer.apply_chat_template(
    prompt,
    add_generation_prompt=True,
...
```

which is incorrect because the messages below will add the generation
prompt.

We also add extensive unit tests. The unit tests are implemented by Opus
4.5, which are reviewed and iterated by me.
Before this PR, we use an old sandboxes branch to train, namely
laude-institute/harbor@f534a5e

However, there are important fixes on Harbor (the new "sandboxes") and
it is easier to benefit from new features of Harbor (e.g. to enable
step-wise training, get rollout logprobs, etc.).

This PR bumps our TerminalBenchGenerator to use Harbor. The main changes
include
1. Pass in `override_memory_mb` `override_storage_mb` and
`override_cpus`
2. Collect more metrics to log on WANDB:
- `generate/trajectories_summarized`: number of trajectories that are
summarized
- `generate/trajectories_truncated`: number of trajectories that are
truncated due to our training input length limit
3. API change
- To access reward: `results.verifier_result.reward` ->
`results.verifier_result.rewards["reward"]`

In addition, we log more detailed input/output in `trainer.py` for
debugging.

Specifically, we rely on getting the trajectory via
`results.agent_result.metadata['all_messages']`, which is deprecated on
Harbor and rely on this PR:
laude-institute/harbor#166

This PR also enables change 1 and the logging of
`trajectories_summarized`.


### Testing
We verify by running `run_nl2bash.sh` for 5 steps.

<img width="700" height="267" alt="image"
src="https://github.com/user-attachments/assets/b9930d28-6b0a-4f72-ae10-75dcaa6a7d76"
/>

<img width="2148" height="300" alt="image"
src="https://github.com/user-attachments/assets/35c4629a-3e99-44ca-864e-e96b3f6e2b14"
/>

There are three curves:
- Blue: with Harbor (this PR) at HEAD of this SkyRL branch `dcagent`
- The only difference with brown are these two commits (already merged
to MAIN)
-
mlfoundations@1fe6c5a
-
mlfoundations@f22fbc4
- Before these two commits, we **ignore all thinking tokens** in the
trajectory
- After these two commits, we **take all thinking tokens** in the
trajectory. Also we make sure the loss mask is constructed more
rigorously
- Log:
`/work/11151/charlieruan/vista/dc-agent/train/out_with_thinking_harbor.log`
- Brown: with Harbor (this PR), but without the two commits above
- Log:
`/work/11151/charlieruan/vista/dc-agent/train/out_no_thinking_harbor.log`
- Light blue: with sandboxes (the branch/commit linked above), without
the two commit above, i.e. at this commit of SkyRL
mlfoundations@e44bc9d
(naturally without the two commits above)

For Harbor, we tested at the commit of this PR
laude-institute/harbor#166, at commit
laude-institute/harbor@783b967
and manually changed Terminus 2's `fallback_context_limit` to `32768`.

We make several observations of these three curves:
- Reward curve is roughly the same (at least for the first 5 steps)
- Blue has a much larger `avg_num_tokens` and `min_num_tokens` because
we now collect all thinking tokens
- Light blue for some reason always has 32K max num tokens (even
**without any thinking tokens**) -- likely bug in agent harness! Blue
with thinking tokens never reach 32K
This fixes the issue described in
NovaSky-AI#428.

We assume all requests that come in after sleep() are "unexpected" and
abort them while printing a warning to the user.

I tested this fix with TerminalBenchGenerator (with async_engine=True)
and it was able to successfully abort unwanted requests.

<img width="905" height="147" alt="Screenshot 2025-10-10 at 10 29 35 PM"
src="https://github.com/user-attachments/assets/b520b72f-a416-4312-bf47-3a59dacf1e1e"
/>

Running examples/gsm8k/run_gsm8k.sh with async_engine=False also works.

---------

Co-authored-by: Eric Tang <[email protected]>
…ing params (NovaSky-AI#10)

This PR does two things. Both are hacky and should not be upstreamed to
main SkyRL

### Hardcoded sampling params

We hardcode sampling params to be RL-friendly. The appropriate way is to
allow customizing sampling params in Harbor's instantiation, reading in
the `cfg.generator.sampling_params`. Though we need to make sure those
params are compatible with lite-llm.

### Custom chat template for Qwen3 accumulate thinking tokens

Another hacky thing we do is to allow users to pass
`+generator.engine_init_kwargs.custom_chat_template_chat_completion_path=/path/to/jinja2`

With this, the vllm engine will apply the custom chat template for
`/chat/completion` request. We add a unit test where we use a custom
chat template that accumulates Qwen3 thinking tokens.

The chat template
`skyrl-train/tests/gpu/gpu_ci/qwen3_acc_thinking.jinja2` is adopted from
skyrl-agent's
https://github.com/NovaSky-AI/SkyRL/blob/4127fd174f03543ce563dc87907db041a98ef059/skyrl-agent/skyrl_agent/functional/templates/qwen3_acc_thinking.jinja2.
It is identical to the official Qwen3 chat template except for
<img width="2048" height="585" alt="image"
src="https://github.com/user-attachments/assets/bc61add3-0341-4143-acdd-278cdcea1c26"
/>
Allow user to pass in the `enable_summarize` flag added in Harbor
laude-institute/harbor#94

Note that to not count this as a failed instance (hence skipped for
training), but a reward 0 and still train on it, we need to make the
following change in Harbor so that we still have the verifier results

<img width="1215" height="279" alt="image"
src="https://github.com/user-attachments/assets/b585e477-3b83-4a11-9e0d-23bb21d9d23f"
/>
…rep as well (NovaSky-AI#12)

### Previous issue

This is a followup PR to mlfoundations#10

In that PR, we only allowed passing in custom chat template to the vllm
engine during rollout.

However, we did not pass in the custom chat template when we convert the
chat history to generator output.

Even though we encode each message one-by-one, hence preserving thinking
tokens during such conversion, the default Qwen3 chat template will add
an empty `<think>\n\n</think>\n\n` if non is present, making it
off-policy.

An example can be found below. Above is what we print out in
`trainer.py` (the raw token passed in to training), and below is the
chat history. We can see that there are extra `<think>\n\n</think>\n\n`
that the model did not generate.

<img width="1852" height="1082" alt="image"
src="https://github.com/user-attachments/assets/15e6311f-7efa-4251-ab4c-f1390ee85371"
/>

<img width="2048" height="198" alt="image"
src="https://github.com/user-attachments/assets/703e00be-3e5a-4252-89e9-f605e08af476"
/>

This is because the default Qwen3 chat template: if `loop.last` — we add
`\n<think>\n\n</think>\n\n`:
<img width="2048" height="281" alt="image"
src="https://github.com/user-attachments/assets/1448cd29-bf47-4efd-ba37-1cecf66cad31"
/>

### This PR

This PR fixes the issue with two things.

First, we pass in `custom_chat_template` to
`get_response_ids_and_loss_mask_from_messages()` and the helpers it
calls, so that the conversion and vllm during rollout use the same chat
template.

In addition, we add our own chat template for qwen3 thinking
accumulation, where we discard any logic of `reasoning_content` and
simply add `content` altogether (below is the only diff. Left is
official qwen3 jinja2, right is our own)
<img width="2010" height="573" alt="image"
src="https://github.com/user-attachments/assets/779fef81-04f8-457b-ae76-99ce7b64b506"
/>

### Test

We test this manually with
https://gist.github.com/CharlieFRuan/128e6fb0103064328688d9b2a7d440d7,
where we encode the same chat history we saw in the issue described
above. The output is now what we expect, namely

<img width="2048" height="462" alt="image"
src="https://github.com/user-attachments/assets/49f4ec1a-6cab-45db-b7c7-c77d8ca577e6"
/>
@EtashGuha EtashGuha marked this pull request as draft December 19, 2025 17:31
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several changes, including refactoring for an async trainer, adding new experiment setups, and updating dependencies. However, there are multiple critical issues that need to be addressed. Several files contain unresolved merge conflicts which must be fixed. Additionally, the new terminal_bench_generator.py file has critical bugs, including a syntax error and usage of undefined variables, which will cause runtime failures. There are also some medium to high severity issues like leftover debug code and hardcoded sampling parameters that override user settings. I've provided specific comments and suggestions to fix these issues.

Comment on lines 42 to 51
<<<<<<< HEAD
trials/
=======

# SQLite database files
*.db

# uv lock files
uv.lock
uv.lock
>>>>>>> main
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This file contains unresolved merge conflict markers (<<<<<<<, =======, >>>>>>>). These must be resolved before merging.

Comment on lines +18 to +23
<<<<<<< HEAD
uv run examples/algorithms/dapo/prepare_dapo_data.sh
=======
# Run from the `skyrl-train` directory
bash examples/algorithms/dapo/prepare_dapo_data.sh
>>>>>>> main
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This file contains unresolved merge conflict markers (<<<<<<<, =======, >>>>>>>). Please resolve them.

Suggested change
<<<<<<< HEAD
uv run examples/algorithms/dapo/prepare_dapo_data.sh
=======
# Run from the `skyrl-train` directory
bash examples/algorithms/dapo/prepare_dapo_data.sh
>>>>>>> main
uv run examples/algorithms/dapo/prepare_dapo_data.sh

Comment on lines +5 to +9
<<<<<<< HEAD
# uv run examples/algorithms/dapo/prepare_dapo_data.sh
=======
# bash examples/algorithms/dapo/prepare_dapo_data.sh
>>>>>>> main
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This file contains unresolved merge conflict markers (<<<<<<<, =======, >>>>>>>). Please resolve them.

Suggested change
<<<<<<< HEAD
# uv run examples/algorithms/dapo/prepare_dapo_data.sh
=======
# bash examples/algorithms/dapo/prepare_dapo_data.sh
>>>>>>> main
# uv run examples/algorithms/dapo/prepare_dapo_data.sh

Comment on lines +5 to +9
<<<<<<< HEAD
# uv run examples/algorithms/dapo/prepare_dapo_data.sh
=======
# bash examples/algorithms/dapo/prepare_dapo_data.sh
>>>>>>> main
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This file contains unresolved merge conflict markers (<<<<<<<, =======, >>>>>>>). Please resolve them.

Suggested change
<<<<<<< HEAD
# uv run examples/algorithms/dapo/prepare_dapo_data.sh
=======
# bash examples/algorithms/dapo/prepare_dapo_data.sh
>>>>>>> main
# uv run examples/algorithms/dapo/prepare_dapo_data.sh


# Process response messages (everything after the first message)
response_messages = chat_history[1:]
rollout_details = getattr(results.agent_result, "rollout_details", None),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a trailing comma at the end of this line. This will cause rollout_details to be a tuple (details, ) instead of the details object itself, leading to an AttributeError on the subsequent lines where its attributes are accessed.

Suggested change
rollout_details = getattr(results.agent_result, "rollout_details", None),
rollout_details = getattr(results.agent_result, "rollout_details", None)


# Truncate to maximum allowed length
response_ids = response_ids[:max_response_tokens]
loss_mask = loss_mask[:max_response_tokens]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The variable loss_mask is used here but it has not been defined in this scope. It seems to be a remnant from some commented-out code. You need to define loss_mask before using it. Based on the surrounding code, it seems you might want to create a list of 1s with the same length as the (truncated) response_ids.

Suggested change
loss_mask = loss_mask[:max_response_tokens]
loss_mask = [1] * len(response_ids)

loss_mask=loss_mask,
prompt_ids=prompt_ids,
trajectory_id=trajectory_id,
rollout_logprobs=rollout_logprobs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The TerminalBenchAgentOutput dataclass does not have a rollout_logprobs field. Passing it to the constructor will raise a TypeError at runtime. Please either add the field to the dataclass or remove it from this call.

Comment on lines +562 to +567
body.update({
"temperature": 1.0,
"top_p": 1.0,
"top_k": -1,
"min_p": 0.0,
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

These lines hardcode sampling parameters, overriding any parameters sent in the request (e.g., from Harbor). This is a significant and potentially unexpected change in behavior. The TODO comment acknowledges this is a hack. This should be addressed by allowing the calling service to pass its own sampling parameters instead of overwriting them here.


async def chat_completion(self, request_payload: Dict[str, Any]) -> Dict[str, Any]:
session_id = request_payload["json"].pop("session_id", None)
# print(f"CHARLIE session_id: {session_id}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This appears to be a leftover debug print statement. It should be removed before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants