-
Notifications
You must be signed in to change notification settings - Fork 33
Open
ars22/pipeline-rl
#20Description
Hello, when training with long rollouts (e.g. more than 32k tokens), I observe the following warning and subsequent error in my training logs:
[2025-12-04 19:17:26,736][pipelinerl.streams][WARNING] - Waiting for weight_update_request/0/0 to be created
[2025-12-05 02:22:24,430][pipelinerl.streams][WARNING] - Could not decode JSON from weight_update_request/0/0, might have run into end of the file. Will reopen the file and retry (0/10), starting from position 27988)
[2025-12-05 02:22:24,442][pipelinerl.streams][WARNING] - Could not decode JSON from weight_update_request/0/0, might have run into end of the file. Will reopen the file and retry (1/10), starting from position 27988)
[2025-12-05 02:22:24,463][pipelinerl.streams][WARNING] - Could not decode JSON from weight_update_request/0/0, might have run into end of the file. Will reopen the file and retry (2/10), starting from position 27988)
[2025-12-05 02:22:24,504][pipelinerl.streams][WARNING] - Could not decode JSON from weight_update_request/0/0, might have run into end of the file. Will reopen the file and retry (3/10), starting from position 27988)
[2025-12-05 02:22:24,584][pipelinerl.streams][WARNING] - Could not decode JSON from weight_update_request/0/0, might have run into end of the file. Will reopen the file and retry (4/10), starting from position 27988)
[2025-12-05 02:22:24,745][pipelinerl.streams][WARNING] - Could not decode JSON from weight_update_request/0/0, might have run into end of the file. Will reopen the file and retry (5/10), starting from position 27988)
[2025-12-05 02:22:25,066][pipelinerl.streams][WARNING] - Could not decode JSON from weight_update_request/0/0, might have run into end of the file. Will reopen the file and retry (6/10), starting from position 27988)
[2025-12-05 02:22:25,707][pipelinerl.streams][WARNING] - Could not decode JSON from weight_update_request/0/0, might have run into end of the file. Will reopen the file and retry (7/10), starting from position 27988)
[2025-12-05 02:22:26,988][pipelinerl.streams][WARNING] - Could not decode JSON from weight_update_request/0/0, might have run into end of the file. Will reopen the file and retry (8/10), starting from position 27988)
[2025-12-05 02:22:29,549][pipelinerl.streams][WARNING] - Could not decode JSON from weight_update_request/0/0, might have run into end of the file. Will reopen the file and retry (9/10), starting from position 27988)
[2025-12-05 02:22:34,670][pipelinerl.streams][ERROR] - Error reading stream weight_update_request/0/0, giving up after 10 retries
If I then look at offending file in output_dir/streams/weight_update_request/0/0/0.jsonl I indeed see a corrupted entry:
{"kind":"samples_processed","samples_processed":4091,"timestamp":1764875551.5654244}
{"kind":"samples_processed","samples_processed":4092,"timestamp":1764875551.5654244}
{"kind":"samples_processed","samples_processed":4094,"timestamp":1764875551.5654244}
{"kind":"samples_processed","samples_processed":4095,"timestamp":1764875551.5654244}
{"kind":"weight_update_success","version":4096,"timestamp":1764875551.565069}
54244} # <-- CORRUPTED!
{"kind":"samples_processed","samples_processed":4115,"timestamp":1764875551.5654244}
{"kind":"samples_processed","samples_processed":4125,"timestamp":1764875551.5654244}
{"kind":"samples_processed","samples_processed":4139,"timestamp":1764875551.5654244}
{"kind":"samples_processed","samples_processed":4154,"timestamp":1764875551.5654244}The issue appears to be with FileStreamReader trying to open/write to the same file from multiple processes. I would attach a minimal reproducible example, but the issue is stochastic and hard to reproduce - happy to share more details that can help understand the root cause!
Edit: here's a config that reproduces the error, usually around 20-40 steps in
# --------------------
# Config for 4+2 nodes
# --------------------
defaults:
- base
- _self_
model_path: Qwen/Qwen3-4B-Thinking-2507
# Uncomment to skip test set rollouts
# eval_every_n_versions: 0
preprocess:
shared_memory_entry_size: 2000000000
finetune:
attempts: 16
train_batch_size: 32
valid_batch_size: 32
gradient_accumulation_passes: 16
seq_length: 49920
seq_parallel: 2
learning_rate: 1.0e-6
llm:
parameters:
max_tokens: 49152
temperature: 0.8
# Sampling params taken from Qwen3 tech report: https://arxiv.org/abs/2505.09388
test_llm:
parameters:
max_tokens: 49152
temperature: 0.8
top_p: 0.95
top_k: 20
actor:
llm_max_rollouts: 16 # Larger values exhaust the KV cache
shared_memory_entry_size: 2000000000 # Allow up to 2GB per rollout
rollout_policy: pipelinerl.domains.math.generate_math_rollout
system_prompt: Please reason step by step, and put your final answer within \boxed{}.
task_template: |-
{task}
environment:
_target_: pipelinerl.domains.math.MathEnvironment
model_name: ${llm_grader.name}
sampling_kwargs: ${llm_grader.sampling_kwargs}
dataset_loader: pipelinerl.domains.math.load_datasets
train_dataset_names:
- hub_id: POLARIS-Project/Polaris-Dataset-53K # Note: custom logic for loading Hub datasets
split: train
vllm_config:
use_v1: false
vllm_kwargs:
dtype: bfloat16
gpu-memory-utilization: 0.9
num-scheduler-steps: 1
disable-log-requests: ""
disable-frontend-multiprocessing: ""
max-num-seqs: ${actor.llm_max_rollouts}
max-num-batched-tokens: 16384
enable-prefix-caching: ""
enable-chunked-prefill: ""
return-tokens-as-token-ids: ""
tensor-parallel-size: 1
pipeline-parallel-size: 1
generation-config: vllm
world:
replicas: 1
actor_fraction: 32
preprocessor_fraction: 0
finetune_fraction: 16
env_replicas: 1
actor_group_port: 9000
environment_start_port: 7777rafapi
Metadata
Metadata
Assignees
Labels
No labels