Dcagent #795

EtashGuha · 2025-12-19T17:31:10Z

Please ignore, WIP for terminal-bench Async RL support

We print out the results when a `trial.run()` fails

…await in trainer (NovaSky-AI#2) Before this PR, when running TBench RL jobs, we see a lot of errors like the following in the log ```bash [0m future: <Task finished name='Task-1046' coro=<LoggingWorker._worker_loop() done, defined at /home/ec2-user/miniconda3/envs/dcagent/lib/python3.12/site-packages/litellm/litellm_core_utils/logging_worker.py:61> exception=RuntimeError('<Queue at 0x7f1a00331160 maxsize=50000> is bound to a different event loop')>�[0m [0m �[33m�[1mTraceback (most recent call last):�[0m [0m File "/home/ec2-user/miniconda3/envs/dcagent/lib/python3.12/site-packages/litellm/litellm_core_utils/logging_worker.py", line 69, in _worker_loop [0m task = await self._queue.get() [0m File "/home/ec2-user/miniconda3/envs/dcagent/lib/python3.12/asyncio/queues.py", line 155, in get [0m getter = self._get_loop().create_future() [0m File "/home/ec2-user/miniconda3/envs/dcagent/lib/python3.12/asyncio/mixins.py", line 20, in _get_loop [0m raise RuntimeError(f'{self!r} is bound to a different event loop') [0m �[31m�[1mRuntimeError�[0m:�[1m <Queue at 0x7f1a00331160 maxsize=50000> is bound to a different event loop�[0m ``` This is because `RayPPOTrainer.train()` was calling `asyncio.run()` repeatedly inside its training loop. LiteLLM (used by sandboxes) initializes a global `LoggingWorker` with an `asyncio.Queue` bound to the event loop where it was first created. When `asyncio.run()` created a new event loop in subsequent iterations, LiteLLM tried to access that queue from the new loop, causing the error. This PR fixes this by unifying the event loop management: - Refactored `RayPPOTrainer.train()` to be an async method, replacing `asyncio.run()` calls with `await`. - Updated `ConditionalWeightsManager` in `weights_manager.py` to support async context manager usage (`__aenter__`/`__aexit__`). - Modified `BasePPOExp.run()` in `main_base.py` to call `asyncio.run(trainer.train())` once at the top level. This ensures the entire training process runs within a single persistent event loop. Tested and confirmed that the training loop now progresses successfully without those errors

Before this PR, `session_id` is always None because Terminus 2 by default does not pass it in. So we do `engine_idx = random.randint(0, len(self.engines) - 1)` which is really bad for prefix cache hit rate. We can actually pass in a session ID to `AgentConfig` and it will be passed to all requests. Verified that the following will print out logs like `CHARLIE: session_id: 954320202c254bd8bbca083d34457b94` (multiple times too, meaning the sesion_id is consistent across a trial, i.e. trajectory) ```python async def chat_completion(self, request_payload: Dict[str, Any]) -> Dict[str, Any]: session_id = request_payload["json"].pop("session_id", None) print(f"CHARLIE: session_id: {session_id}") ... ```

Extended On Policy Distillation for the `TerminalBenchExp` class. Uses the `OnPolicyDistillationTrainer` implemented in the main [SkyRL branch](https://github.com/NovaSky-AI/SkyRL/tree/main/skyrl-train/examples/on_policy_distillation)

…ky-AI#5) Before this PR, we disabled systems logging altogether because it was not well-supported on our TACC environment. This PR removes those constraints, namely `_disable_stats` and `_disable_meta`. However, by default, only a single node's system metrics will be logged, since WANDB only runs on a single node (the head node). This PR spins up a Ray actor `WandbNodeLogger` on each node, dedicated to logging systems metrics like GPU utilization. We follow https://docs.wandb.ai/models/track/log/distributed-training#track-all-processes-to-a-single-run We use `model="shared"` to aggregate system metrics from all nodes to the same Wandb run. However, with this approach, the systems metrics panels do not appear in the Wandb UI automatically but requires us to manually create the panels. The alternative is to create a run for each node and group them by group_name. We prefer to keep all nodes metrics to the same run. Tested on TACC with 4 nodes. Example panel: <img width="657" height="344" alt="image" src="https://github.com/user-attachments/assets/44a2f083-ad1f-4d27-af58-4a6da1c8b4b1" /> --------- Co-authored-by: Copilot <[email protected]>

…ovaSky-AI#7) Fixes `epoch` counter after resuming from checkpoint --------- Cherry picked from NovaSky-AI#589 Signed-off-by: SumanthRH <[email protected]> Co-authored-by: Sumanth R Hegde <[email protected]>

@CharlieFRuan

…es (NovaSky-AI#404) (NovaSky-AI#9) Fixes chat templating in the Mini-swe-agent and the terminal bench examples. Previously, we were naively callling `.apply_chat_template` to encode response messages turn by turn - but this can append system messages for each turn depending on the model. (h/t to @CharlieFRuan ) For example, Qwen 3 8B , the templating works fine, but then for Qwen 2.5 1.5B instruct, the code adds a system prompt message while tokenizing every turn We use the fixed base approach similar to what we do in the SkyRLGymGenerator. --------- Signed-off-by: SumanthRH <[email protected]> Co-authored-by: Sumanth R Hegde <[email protected]>

…I#710) Before this PR, in `TerminalBenchGenerator` and `MiniSweAgentGenerator`, we naively generate loss mask for a chat history by doing, all zeros for user messages, and all ones for assistant messages. However, this is incorrect. For instance, the generation prompt of assistant should be masked with zero, and the token `\n` that comes after the EOS for Qwen models should be masked with zero. This PR fixes this by implementing `get_response_ids_and_loss_mask_from_messages()`, which uses the fixed-base helper `encode_messages_subset()`, to convert a chat history (excluding the initial prompt, which may differ by models) into response IDs, loss mask, and optionally rollout logprobs. Besides, TerminalBenchGenerator had a bug of ```python prompt_ids = self.tokenizer.apply_chat_template( prompt, add_generation_prompt=True, ... ``` which is incorrect because the messages below will add the generation prompt. We also add extensive unit tests. The unit tests are implemented by Opus 4.5, which are reviewed and iterated by me.

Before this PR, we use an old sandboxes branch to train, namely laude-institute/harbor@f534a5e However, there are important fixes on Harbor (the new "sandboxes") and it is easier to benefit from new features of Harbor (e.g. to enable step-wise training, get rollout logprobs, etc.). This PR bumps our TerminalBenchGenerator to use Harbor. The main changes include 1. Pass in `override_memory_mb` `override_storage_mb` and `override_cpus` 2. Collect more metrics to log on WANDB: - `generate/trajectories_summarized`: number of trajectories that are summarized - `generate/trajectories_truncated`: number of trajectories that are truncated due to our training input length limit 3. API change - To access reward: `results.verifier_result.reward` -> `results.verifier_result.rewards["reward"]` In addition, we log more detailed input/output in `trainer.py` for debugging. Specifically, we rely on getting the trajectory via `results.agent_result.metadata['all_messages']`, which is deprecated on Harbor and rely on this PR: laude-institute/harbor#166 This PR also enables change 1 and the logging of `trajectories_summarized`. ### Testing We verify by running `run_nl2bash.sh` for 5 steps. <img width="700" height="267" alt="image" src="https://github.com/user-attachments/assets/b9930d28-6b0a-4f72-ae10-75dcaa6a7d76" /> <img width="2148" height="300" alt="image" src="https://github.com/user-attachments/assets/35c4629a-3e99-44ca-864e-e96b3f6e2b14" /> There are three curves: - Blue: with Harbor (this PR) at HEAD of this SkyRL branch `dcagent` - The only difference with brown are these two commits (already merged to MAIN) - mlfoundations@1fe6c5a - mlfoundations@f22fbc4 - Before these two commits, we **ignore all thinking tokens** in the trajectory - After these two commits, we **take all thinking tokens** in the trajectory. Also we make sure the loss mask is constructed more rigorously - Log: `/work/11151/charlieruan/vista/dc-agent/train/out_with_thinking_harbor.log` - Brown: with Harbor (this PR), but without the two commits above - Log: `/work/11151/charlieruan/vista/dc-agent/train/out_no_thinking_harbor.log` - Light blue: with sandboxes (the branch/commit linked above), without the two commit above, i.e. at this commit of SkyRL mlfoundations@e44bc9d (naturally without the two commits above) For Harbor, we tested at the commit of this PR laude-institute/harbor#166, at commit laude-institute/harbor@783b967 and manually changed Terminus 2's `fallback_context_limit` to `32768`. We make several observations of these three curves: - Reward curve is roughly the same (at least for the first 5 steps) - Blue has a much larger `avg_num_tokens` and `min_num_tokens` because we now collect all thinking tokens - Light blue for some reason always has 32K max num tokens (even **without any thinking tokens**) -- likely bug in agent harness! Blue with thinking tokens never reach 32K

This fixes the issue described in NovaSky-AI#428. We assume all requests that come in after sleep() are "unexpected" and abort them while printing a warning to the user. I tested this fix with TerminalBenchGenerator (with async_engine=True) and it was able to successfully abort unwanted requests. <img width="905" height="147" alt="Screenshot 2025-10-10 at 10 29 35 PM" src="https://github.com/user-attachments/assets/b520b72f-a416-4312-bf47-3a59dacf1e1e" /> Running examples/gsm8k/run_gsm8k.sh with async_engine=False also works. --------- Co-authored-by: Eric Tang <[email protected]>

…ing params (NovaSky-AI#10) This PR does two things. Both are hacky and should not be upstreamed to main SkyRL ### Hardcoded sampling params We hardcode sampling params to be RL-friendly. The appropriate way is to allow customizing sampling params in Harbor's instantiation, reading in the `cfg.generator.sampling_params`. Though we need to make sure those params are compatible with lite-llm. ### Custom chat template for Qwen3 accumulate thinking tokens Another hacky thing we do is to allow users to pass `+generator.engine_init_kwargs.custom_chat_template_chat_completion_path=/path/to/jinja2` With this, the vllm engine will apply the custom chat template for `/chat/completion` request. We add a unit test where we use a custom chat template that accumulates Qwen3 thinking tokens. The chat template `skyrl-train/tests/gpu/gpu_ci/qwen3_acc_thinking.jinja2` is adopted from skyrl-agent's https://github.com/NovaSky-AI/SkyRL/blob/4127fd174f03543ce563dc87907db041a98ef059/skyrl-agent/skyrl_agent/functional/templates/qwen3_acc_thinking.jinja2. It is identical to the official Qwen3 chat template except for <img width="2048" height="585" alt="image" src="https://github.com/user-attachments/assets/bc61add3-0341-4143-acdd-278cdcea1c26" />

Allow user to pass in the `enable_summarize` flag added in Harbor laude-institute/harbor#94 Note that to not count this as a failed instance (hence skipped for training), but a reward 0 and still train on it, we need to make the following change in Harbor so that we still have the verifier results <img width="1215" height="279" alt="image" src="https://github.com/user-attachments/assets/b585e477-3b83-4a11-9e0d-23bb21d9d23f" />

…rep as well (NovaSky-AI#12) ### Previous issue This is a followup PR to mlfoundations#10 In that PR, we only allowed passing in custom chat template to the vllm engine during rollout. However, we did not pass in the custom chat template when we convert the chat history to generator output. Even though we encode each message one-by-one, hence preserving thinking tokens during such conversion, the default Qwen3 chat template will add an empty `<think>\n\n</think>\n\n` if non is present, making it off-policy. An example can be found below. Above is what we print out in `trainer.py` (the raw token passed in to training), and below is the chat history. We can see that there are extra `<think>\n\n</think>\n\n` that the model did not generate. <img width="1852" height="1082" alt="image" src="https://github.com/user-attachments/assets/15e6311f-7efa-4251-ab4c-f1390ee85371" /> <img width="2048" height="198" alt="image" src="https://github.com/user-attachments/assets/703e00be-3e5a-4252-89e9-f605e08af476" /> This is because the default Qwen3 chat template: if `loop.last` — we add `\n<think>\n\n</think>\n\n`: <img width="2048" height="281" alt="image" src="https://github.com/user-attachments/assets/1448cd29-bf47-4efd-ba37-1cecf66cad31" /> ### This PR This PR fixes the issue with two things. First, we pass in `custom_chat_template` to `get_response_ids_and_loss_mask_from_messages()` and the helpers it calls, so that the conversion and vllm during rollout use the same chat template. In addition, we add our own chat template for qwen3 thinking accumulation, where we discard any logic of `reasoning_content` and simply add `content` altogether (below is the only diff. Left is official qwen3 jinja2, right is our own) <img width="2010" height="573" alt="image" src="https://github.com/user-attachments/assets/779fef81-04f8-457b-ae76-99ce7b64b506" /> ### Test We test this manually with https://gist.github.com/CharlieFRuan/128e6fb0103064328688d9b2a7d440d7, where we encode the same chat history we saw in the issue described above. The output is now what we expect, namely <img width="2048" height="462" alt="image" src="https://github.com/user-attachments/assets/49f4ec1a-6cab-45db-b7c7-c77d8ca577e6" />

gemini-code-assist

Code Review

This pull request introduces several changes, including refactoring for an async trainer, adding new experiment setups, and updating dependencies. However, there are multiple critical issues that need to be addressed. Several files contain unresolved merge conflicts which must be fixed. Additionally, the new terminal_bench_generator.py file has critical bugs, including a syntax error and usage of undefined variables, which will cause runtime failures. There are also some medium to high severity issues like leftover debug code and hardcoded sampling parameters that override user settings. I've provided specific comments and suggestions to fix these issues.

gemini-code-assist · 2025-12-19T17:33:46Z

.gitignore

+<<<<<<< HEAD
+trials/
+=======

 # SQLite database files
 *.db

 # uv lock files
-uv.lock
+uv.lock
+>>>>>>> main


This file contains unresolved merge conflict markers (<<<<<<<, =======, >>>>>>>). These must be resolved before merging.

gemini-code-assist · 2025-12-19T17:33:46Z

skyrl-train/examples/on_policy_distillation/README.md

+<<<<<<< HEAD
+uv run examples/algorithms/dapo/prepare_dapo_data.sh
+=======
 # Run from the `skyrl-train` directory
 bash examples/algorithms/dapo/prepare_dapo_data.sh
+>>>>>>> main


This file contains unresolved merge conflict markers (<<<<<<<, =======, >>>>>>>). Please resolve them.

Suggested change

<<<<<<< HEAD

uv run examples/algorithms/dapo/prepare_dapo_data.sh

=======

# Run from the `skyrl-train` directory

bash examples/algorithms/dapo/prepare_dapo_data.sh

>>>>>>> main

uv run examples/algorithms/dapo/prepare_dapo_data.sh

gemini-code-assist · 2025-12-19T17:33:46Z

skyrl-train/examples/on_policy_distillation/run_on_policy_distill_math_qwen3_1.7b.sh

+<<<<<<< HEAD
+# uv run examples/algorithms/dapo/prepare_dapo_data.sh
+=======
 # bash examples/algorithms/dapo/prepare_dapo_data.sh
+>>>>>>> main


This file contains unresolved merge conflict markers (<<<<<<<, =======, >>>>>>>). Please resolve them.

Suggested change

<<<<<<< HEAD

# uv run examples/algorithms/dapo/prepare_dapo_data.sh

=======

# bash examples/algorithms/dapo/prepare_dapo_data.sh

>>>>>>> main

# uv run examples/algorithms/dapo/prepare_dapo_data.sh

gemini-code-assist · 2025-12-19T17:33:47Z

skyrl-train/examples/on_policy_distillation/run_on_policy_distill_math_qwen3_4b.sh

+<<<<<<< HEAD
+# uv run examples/algorithms/dapo/prepare_dapo_data.sh
+=======
 # bash examples/algorithms/dapo/prepare_dapo_data.sh
+>>>>>>> main


This file contains unresolved merge conflict markers (<<<<<<<, =======, >>>>>>>). Please resolve them.

Suggested change

<<<<<<< HEAD

# uv run examples/algorithms/dapo/prepare_dapo_data.sh

=======

# bash examples/algorithms/dapo/prepare_dapo_data.sh

>>>>>>> main

# uv run examples/algorithms/dapo/prepare_dapo_data.sh

gemini-code-assist · 2025-12-19T17:33:47Z

skyrl-train/examples/terminal_bench/terminal_bench_generator.py

+
+        # Process response messages (everything after the first message)
+        response_messages = chat_history[1:]
+        rollout_details = getattr(results.agent_result, "rollout_details", None), 


There is a trailing comma at the end of this line. This will cause rollout_details to be a tuple (details, ) instead of the details object itself, leading to an AttributeError on the subsequent lines where its attributes are accessed.

Suggested change

rollout_details = getattr(results.agent_result, "rollout_details", None),

rollout_details = getattr(results.agent_result, "rollout_details", None)

gemini-code-assist · 2025-12-19T17:33:47Z

skyrl-train/examples/terminal_bench/terminal_bench_generator.py

+
+        # Truncate to maximum allowed length
+        response_ids = response_ids[:max_response_tokens]
+        loss_mask = loss_mask[:max_response_tokens]


The variable loss_mask is used here but it has not been defined in this scope. It seems to be a remnant from some commented-out code. You need to define loss_mask before using it. Based on the surrounding code, it seems you might want to create a list of 1s with the same length as the (truncated) response_ids.

Suggested change

loss_mask = loss_mask[:max_response_tokens]

loss_mask = [1] * len(response_ids)

gemini-code-assist · 2025-12-19T17:33:47Z

skyrl-train/examples/terminal_bench/terminal_bench_generator.py

+            loss_mask=loss_mask,
+            prompt_ids=prompt_ids,
+            trajectory_id=trajectory_id,
+            rollout_logprobs=rollout_logprobs,


The TerminalBenchAgentOutput dataclass does not have a rollout_logprobs field. Passing it to the constructor will raise a TypeError at runtime. Please either add the field to the dataclass or remove it from this call.

gemini-code-assist · 2025-12-19T17:33:47Z

skyrl-train/skyrl_train/inference_engines/vllm/vllm_engine.py

+        body.update({
+            "temperature": 1.0,
+            "top_p": 1.0,
+            "top_k": -1,
+            "min_p": 0.0,
+        })


These lines hardcode sampling parameters, overriding any parameters sent in the request (e.g., from Harbor). This is a significant and potentially unexpected change in behavior. The TODO comment acknowledges this is a hack. This should be addressed by allowing the calling service to pass its own sampling parameters instead of overwriting them here.

gemini-code-assist · 2025-12-19T17:33:47Z

skyrl-train/skyrl_train/inference_engines/inference_engine_client.py


    async def chat_completion(self, request_payload: Dict[str, Any]) -> Dict[str, Any]:
        session_id = request_payload["json"].pop("session_id", None)
+        # print(f"CHARLIE session_id: {session_id}")


This appears to be a leftover debug print statement. It should be removed before merging.

CharlieFRuan and others added 20 commits November 21, 2025 18:48

Sandboxes work rebased on top of SkyRL main 20eb6f1

f18c460

[TBenchGen] Add more logs to print out results when failing (#1)

891916a

We print out the results when a `trial.run()` fails

[Trivial] Update gitignore, and remove terminal_bench.yaml content

ca85618

[skyrl-train][Fix] Fix epoch counter after resuming from checkpoint (N…

e44bc9d

…ovaSky-AI#7) Fixes `epoch` counter after resuming from checkpoint --------- Cherry picked from NovaSky-AI#589 Signed-off-by: SumanthRH <[email protected]> Co-authored-by: Sumanth R Hegde <[email protected]>

[Harbor] minor change on error handling for chat history

9948063

[engine] Fix abort request handling for v0.9.2 vllm, add logger

c493316

temp working for async

e0eb1d5

working on async rl

27d61e8

should be good post merge

d7472ac

EtashGuha marked this pull request as draft December 19, 2025 17:31

gemini-code-assist bot reviewed Dec 19, 2025

View reviewed changes

asyncrl seems to be working e2e

090f325

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dcagent #795

Dcagent #795

Uh oh!

EtashGuha commented Dec 19, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	rollout_details = getattr(results.agent_result, "rollout_details", None),
	rollout_details = getattr(results.agent_result, "rollout_details", None)

	loss_mask = loss_mask[:max_response_tokens]
	loss_mask = [1] * len(response_ids)

Dcagent #795

Are you sure you want to change the base?

Dcagent #795

Uh oh!

Conversation

EtashGuha commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

EtashGuha commented Dec 19, 2025 •

edited

Loading