-
Notifications
You must be signed in to change notification settings - Fork 214
Dcagent #795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Dcagent #795
Conversation
We print out the results when a `trial.run()` fails
…await in trainer (NovaSky-AI#2) Before this PR, when running TBench RL jobs, we see a lot of errors like the following in the log ```bash [0m future: <Task finished name='Task-1046' coro=<LoggingWorker._worker_loop() done, defined at /home/ec2-user/miniconda3/envs/dcagent/lib/python3.12/site-packages/litellm/litellm_core_utils/logging_worker.py:61> exception=RuntimeError('<Queue at 0x7f1a00331160 maxsize=50000> is bound to a different event loop')>�[0m [0m �[33m�[1mTraceback (most recent call last):�[0m [0m File "/home/ec2-user/miniconda3/envs/dcagent/lib/python3.12/site-packages/litellm/litellm_core_utils/logging_worker.py", line 69, in _worker_loop [0m task = await self._queue.get() [0m File "/home/ec2-user/miniconda3/envs/dcagent/lib/python3.12/asyncio/queues.py", line 155, in get [0m getter = self._get_loop().create_future() [0m File "/home/ec2-user/miniconda3/envs/dcagent/lib/python3.12/asyncio/mixins.py", line 20, in _get_loop [0m raise RuntimeError(f'{self!r} is bound to a different event loop') [0m �[31m�[1mRuntimeError�[0m:�[1m <Queue at 0x7f1a00331160 maxsize=50000> is bound to a different event loop�[0m ``` This is because `RayPPOTrainer.train()` was calling `asyncio.run()` repeatedly inside its training loop. LiteLLM (used by sandboxes) initializes a global `LoggingWorker` with an `asyncio.Queue` bound to the event loop where it was first created. When `asyncio.run()` created a new event loop in subsequent iterations, LiteLLM tried to access that queue from the new loop, causing the error. This PR fixes this by unifying the event loop management: - Refactored `RayPPOTrainer.train()` to be an async method, replacing `asyncio.run()` calls with `await`. - Updated `ConditionalWeightsManager` in `weights_manager.py` to support async context manager usage (`__aenter__`/`__aexit__`). - Modified `BasePPOExp.run()` in `main_base.py` to call `asyncio.run(trainer.train())` once at the top level. This ensures the entire training process runs within a single persistent event loop. Tested and confirmed that the training loop now progresses successfully without those errors
Before this PR, `session_id` is always None because Terminus 2 by
default does not pass it in. So we do `engine_idx = random.randint(0,
len(self.engines) - 1)` which is really bad for prefix cache hit rate.
We can actually pass in a session ID to `AgentConfig` and it will be
passed to all requests.
Verified that the following will print out logs like `CHARLIE:
session_id: 954320202c254bd8bbca083d34457b94` (multiple times too,
meaning the sesion_id is consistent across a trial, i.e. trajectory)
```python
async def chat_completion(self, request_payload: Dict[str, Any]) -> Dict[str, Any]:
session_id = request_payload["json"].pop("session_id", None)
print(f"CHARLIE: session_id: {session_id}")
...
```
Extended On Policy Distillation for the `TerminalBenchExp` class. Uses the `OnPolicyDistillationTrainer` implemented in the main [SkyRL branch](https://github.com/NovaSky-AI/SkyRL/tree/main/skyrl-train/examples/on_policy_distillation)
…ky-AI#5) Before this PR, we disabled systems logging altogether because it was not well-supported on our TACC environment. This PR removes those constraints, namely `_disable_stats` and `_disable_meta`. However, by default, only a single node's system metrics will be logged, since WANDB only runs on a single node (the head node). This PR spins up a Ray actor `WandbNodeLogger` on each node, dedicated to logging systems metrics like GPU utilization. We follow https://docs.wandb.ai/models/track/log/distributed-training#track-all-processes-to-a-single-run We use `model="shared"` to aggregate system metrics from all nodes to the same Wandb run. However, with this approach, the systems metrics panels do not appear in the Wandb UI automatically but requires us to manually create the panels. The alternative is to create a run for each node and group them by group_name. We prefer to keep all nodes metrics to the same run. Tested on TACC with 4 nodes. Example panel: <img width="657" height="344" alt="image" src="https://github.com/user-attachments/assets/44a2f083-ad1f-4d27-af58-4a6da1c8b4b1" /> --------- Co-authored-by: Copilot <[email protected]>
…ovaSky-AI#7) Fixes `epoch` counter after resuming from checkpoint --------- Cherry picked from NovaSky-AI#589 Signed-off-by: SumanthRH <[email protected]> Co-authored-by: Sumanth R Hegde <[email protected]>
…es (NovaSky-AI#404) (NovaSky-AI#9) Fixes chat templating in the Mini-swe-agent and the terminal bench examples. Previously, we were naively callling `.apply_chat_template` to encode response messages turn by turn - but this can append system messages for each turn depending on the model. (h/t to @CharlieFRuan ) For example, Qwen 3 8B , the templating works fine, but then for Qwen 2.5 1.5B instruct, the code adds a system prompt message while tokenizing every turn We use the fixed base approach similar to what we do in the SkyRLGymGenerator. --------- Signed-off-by: SumanthRH <[email protected]> Co-authored-by: Sumanth R Hegde <[email protected]>
…I#710) Before this PR, in `TerminalBenchGenerator` and `MiniSweAgentGenerator`, we naively generate loss mask for a chat history by doing, all zeros for user messages, and all ones for assistant messages. However, this is incorrect. For instance, the generation prompt of assistant should be masked with zero, and the token `\n` that comes after the EOS for Qwen models should be masked with zero. This PR fixes this by implementing `get_response_ids_and_loss_mask_from_messages()`, which uses the fixed-base helper `encode_messages_subset()`, to convert a chat history (excluding the initial prompt, which may differ by models) into response IDs, loss mask, and optionally rollout logprobs. Besides, TerminalBenchGenerator had a bug of ```python prompt_ids = self.tokenizer.apply_chat_template( prompt, add_generation_prompt=True, ... ``` which is incorrect because the messages below will add the generation prompt. We also add extensive unit tests. The unit tests are implemented by Opus 4.5, which are reviewed and iterated by me.
Before this PR, we use an old sandboxes branch to train, namely laude-institute/harbor@f534a5e However, there are important fixes on Harbor (the new "sandboxes") and it is easier to benefit from new features of Harbor (e.g. to enable step-wise training, get rollout logprobs, etc.). This PR bumps our TerminalBenchGenerator to use Harbor. The main changes include 1. Pass in `override_memory_mb` `override_storage_mb` and `override_cpus` 2. Collect more metrics to log on WANDB: - `generate/trajectories_summarized`: number of trajectories that are summarized - `generate/trajectories_truncated`: number of trajectories that are truncated due to our training input length limit 3. API change - To access reward: `results.verifier_result.reward` -> `results.verifier_result.rewards["reward"]` In addition, we log more detailed input/output in `trainer.py` for debugging. Specifically, we rely on getting the trajectory via `results.agent_result.metadata['all_messages']`, which is deprecated on Harbor and rely on this PR: laude-institute/harbor#166 This PR also enables change 1 and the logging of `trajectories_summarized`. ### Testing We verify by running `run_nl2bash.sh` for 5 steps. <img width="700" height="267" alt="image" src="https://github.com/user-attachments/assets/b9930d28-6b0a-4f72-ae10-75dcaa6a7d76" /> <img width="2148" height="300" alt="image" src="https://github.com/user-attachments/assets/35c4629a-3e99-44ca-864e-e96b3f6e2b14" /> There are three curves: - Blue: with Harbor (this PR) at HEAD of this SkyRL branch `dcagent` - The only difference with brown are these two commits (already merged to MAIN) - mlfoundations@1fe6c5a - mlfoundations@f22fbc4 - Before these two commits, we **ignore all thinking tokens** in the trajectory - After these two commits, we **take all thinking tokens** in the trajectory. Also we make sure the loss mask is constructed more rigorously - Log: `/work/11151/charlieruan/vista/dc-agent/train/out_with_thinking_harbor.log` - Brown: with Harbor (this PR), but without the two commits above - Log: `/work/11151/charlieruan/vista/dc-agent/train/out_no_thinking_harbor.log` - Light blue: with sandboxes (the branch/commit linked above), without the two commit above, i.e. at this commit of SkyRL mlfoundations@e44bc9d (naturally without the two commits above) For Harbor, we tested at the commit of this PR laude-institute/harbor#166, at commit laude-institute/harbor@783b967 and manually changed Terminus 2's `fallback_context_limit` to `32768`. We make several observations of these three curves: - Reward curve is roughly the same (at least for the first 5 steps) - Blue has a much larger `avg_num_tokens` and `min_num_tokens` because we now collect all thinking tokens - Light blue for some reason always has 32K max num tokens (even **without any thinking tokens**) -- likely bug in agent harness! Blue with thinking tokens never reach 32K
This fixes the issue described in NovaSky-AI#428. We assume all requests that come in after sleep() are "unexpected" and abort them while printing a warning to the user. I tested this fix with TerminalBenchGenerator (with async_engine=True) and it was able to successfully abort unwanted requests. <img width="905" height="147" alt="Screenshot 2025-10-10 at 10 29 35 PM" src="https://github.com/user-attachments/assets/b520b72f-a416-4312-bf47-3a59dacf1e1e" /> Running examples/gsm8k/run_gsm8k.sh with async_engine=False also works. --------- Co-authored-by: Eric Tang <[email protected]>
…ing params (NovaSky-AI#10) This PR does two things. Both are hacky and should not be upstreamed to main SkyRL ### Hardcoded sampling params We hardcode sampling params to be RL-friendly. The appropriate way is to allow customizing sampling params in Harbor's instantiation, reading in the `cfg.generator.sampling_params`. Though we need to make sure those params are compatible with lite-llm. ### Custom chat template for Qwen3 accumulate thinking tokens Another hacky thing we do is to allow users to pass `+generator.engine_init_kwargs.custom_chat_template_chat_completion_path=/path/to/jinja2` With this, the vllm engine will apply the custom chat template for `/chat/completion` request. We add a unit test where we use a custom chat template that accumulates Qwen3 thinking tokens. The chat template `skyrl-train/tests/gpu/gpu_ci/qwen3_acc_thinking.jinja2` is adopted from skyrl-agent's https://github.com/NovaSky-AI/SkyRL/blob/4127fd174f03543ce563dc87907db041a98ef059/skyrl-agent/skyrl_agent/functional/templates/qwen3_acc_thinking.jinja2. It is identical to the official Qwen3 chat template except for <img width="2048" height="585" alt="image" src="https://github.com/user-attachments/assets/bc61add3-0341-4143-acdd-278cdcea1c26" />
Allow user to pass in the `enable_summarize` flag added in Harbor laude-institute/harbor#94 Note that to not count this as a failed instance (hence skipped for training), but a reward 0 and still train on it, we need to make the following change in Harbor so that we still have the verifier results <img width="1215" height="279" alt="image" src="https://github.com/user-attachments/assets/b585e477-3b83-4a11-9e0d-23bb21d9d23f" />
…rep as well (NovaSky-AI#12) ### Previous issue This is a followup PR to mlfoundations#10 In that PR, we only allowed passing in custom chat template to the vllm engine during rollout. However, we did not pass in the custom chat template when we convert the chat history to generator output. Even though we encode each message one-by-one, hence preserving thinking tokens during such conversion, the default Qwen3 chat template will add an empty `<think>\n\n</think>\n\n` if non is present, making it off-policy. An example can be found below. Above is what we print out in `trainer.py` (the raw token passed in to training), and below is the chat history. We can see that there are extra `<think>\n\n</think>\n\n` that the model did not generate. <img width="1852" height="1082" alt="image" src="https://github.com/user-attachments/assets/15e6311f-7efa-4251-ab4c-f1390ee85371" /> <img width="2048" height="198" alt="image" src="https://github.com/user-attachments/assets/703e00be-3e5a-4252-89e9-f605e08af476" /> This is because the default Qwen3 chat template: if `loop.last` — we add `\n<think>\n\n</think>\n\n`: <img width="2048" height="281" alt="image" src="https://github.com/user-attachments/assets/1448cd29-bf47-4efd-ba37-1cecf66cad31" /> ### This PR This PR fixes the issue with two things. First, we pass in `custom_chat_template` to `get_response_ids_and_loss_mask_from_messages()` and the helpers it calls, so that the conversion and vllm during rollout use the same chat template. In addition, we add our own chat template for qwen3 thinking accumulation, where we discard any logic of `reasoning_content` and simply add `content` altogether (below is the only diff. Left is official qwen3 jinja2, right is our own) <img width="2010" height="573" alt="image" src="https://github.com/user-attachments/assets/779fef81-04f8-457b-ae76-99ce7b64b506" /> ### Test We test this manually with https://gist.github.com/CharlieFRuan/128e6fb0103064328688d9b2a7d440d7, where we encode the same chat history we saw in the issue described above. The output is now what we expect, namely <img width="2048" height="462" alt="image" src="https://github.com/user-attachments/assets/49f4ec1a-6cab-45db-b7c7-c77d8ca577e6" />
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces several changes, including refactoring for an async trainer, adding new experiment setups, and updating dependencies. However, there are multiple critical issues that need to be addressed. Several files contain unresolved merge conflicts which must be fixed. Additionally, the new terminal_bench_generator.py file has critical bugs, including a syntax error and usage of undefined variables, which will cause runtime failures. There are also some medium to high severity issues like leftover debug code and hardcoded sampling parameters that override user settings. I've provided specific comments and suggestions to fix these issues.
| <<<<<<< HEAD | ||
| trials/ | ||
| ======= | ||
|
|
||
| # SQLite database files | ||
| *.db | ||
|
|
||
| # uv lock files | ||
| uv.lock | ||
| uv.lock | ||
| >>>>>>> main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| <<<<<<< HEAD | ||
| uv run examples/algorithms/dapo/prepare_dapo_data.sh | ||
| ======= | ||
| # Run from the `skyrl-train` directory | ||
| bash examples/algorithms/dapo/prepare_dapo_data.sh | ||
| >>>>>>> main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file contains unresolved merge conflict markers (<<<<<<<, =======, >>>>>>>). Please resolve them.
| <<<<<<< HEAD | |
| uv run examples/algorithms/dapo/prepare_dapo_data.sh | |
| ======= | |
| # Run from the `skyrl-train` directory | |
| bash examples/algorithms/dapo/prepare_dapo_data.sh | |
| >>>>>>> main | |
| uv run examples/algorithms/dapo/prepare_dapo_data.sh |
| <<<<<<< HEAD | ||
| # uv run examples/algorithms/dapo/prepare_dapo_data.sh | ||
| ======= | ||
| # bash examples/algorithms/dapo/prepare_dapo_data.sh | ||
| >>>>>>> main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file contains unresolved merge conflict markers (<<<<<<<, =======, >>>>>>>). Please resolve them.
| <<<<<<< HEAD | |
| # uv run examples/algorithms/dapo/prepare_dapo_data.sh | |
| ======= | |
| # bash examples/algorithms/dapo/prepare_dapo_data.sh | |
| >>>>>>> main | |
| # uv run examples/algorithms/dapo/prepare_dapo_data.sh |
| <<<<<<< HEAD | ||
| # uv run examples/algorithms/dapo/prepare_dapo_data.sh | ||
| ======= | ||
| # bash examples/algorithms/dapo/prepare_dapo_data.sh | ||
| >>>>>>> main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file contains unresolved merge conflict markers (<<<<<<<, =======, >>>>>>>). Please resolve them.
| <<<<<<< HEAD | |
| # uv run examples/algorithms/dapo/prepare_dapo_data.sh | |
| ======= | |
| # bash examples/algorithms/dapo/prepare_dapo_data.sh | |
| >>>>>>> main | |
| # uv run examples/algorithms/dapo/prepare_dapo_data.sh |
|
|
||
| # Process response messages (everything after the first message) | ||
| response_messages = chat_history[1:] | ||
| rollout_details = getattr(results.agent_result, "rollout_details", None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a trailing comma at the end of this line. This will cause rollout_details to be a tuple (details, ) instead of the details object itself, leading to an AttributeError on the subsequent lines where its attributes are accessed.
| rollout_details = getattr(results.agent_result, "rollout_details", None), | |
| rollout_details = getattr(results.agent_result, "rollout_details", None) |
|
|
||
| # Truncate to maximum allowed length | ||
| response_ids = response_ids[:max_response_tokens] | ||
| loss_mask = loss_mask[:max_response_tokens] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable loss_mask is used here but it has not been defined in this scope. It seems to be a remnant from some commented-out code. You need to define loss_mask before using it. Based on the surrounding code, it seems you might want to create a list of 1s with the same length as the (truncated) response_ids.
| loss_mask = loss_mask[:max_response_tokens] | |
| loss_mask = [1] * len(response_ids) |
| loss_mask=loss_mask, | ||
| prompt_ids=prompt_ids, | ||
| trajectory_id=trajectory_id, | ||
| rollout_logprobs=rollout_logprobs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| body.update({ | ||
| "temperature": 1.0, | ||
| "top_p": 1.0, | ||
| "top_k": -1, | ||
| "min_p": 0.0, | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines hardcode sampling parameters, overriding any parameters sent in the request (e.g., from Harbor). This is a significant and potentially unexpected change in behavior. The TODO comment acknowledges this is a hack. This should be addressed by allowing the calling service to pass its own sampling parameters instead of overwriting them here.
|
|
||
| async def chat_completion(self, request_payload: Dict[str, Any]) -> Dict[str, Any]: | ||
| session_id = request_payload["json"].pop("session_id", None) | ||
| # print(f"CHARLIE session_id: {session_id}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please ignore, WIP for terminal-bench Async RL support