Skip to content

Could not decode JSON from weight_update_request/0/0 #113

@lewtun

Description

@lewtun

Hello, when training with long rollouts (e.g. more than 32k tokens), I observe the following warning and subsequent error in my training logs:

[2025-12-04 19:17:26,736][pipelinerl.streams][WARNING] - Waiting for weight_update_request/0/0 to be created
[2025-12-05 02:22:24,430][pipelinerl.streams][WARNING] - Could not decode JSON from weight_update_request/0/0, might have run into end of the file. Will reopen the file and retry (0/10), starting from position 27988)
[2025-12-05 02:22:24,442][pipelinerl.streams][WARNING] - Could not decode JSON from weight_update_request/0/0, might have run into end of the file. Will reopen the file and retry (1/10), starting from position 27988)
[2025-12-05 02:22:24,463][pipelinerl.streams][WARNING] - Could not decode JSON from weight_update_request/0/0, might have run into end of the file. Will reopen the file and retry (2/10), starting from position 27988)
[2025-12-05 02:22:24,504][pipelinerl.streams][WARNING] - Could not decode JSON from weight_update_request/0/0, might have run into end of the file. Will reopen the file and retry (3/10), starting from position 27988)
[2025-12-05 02:22:24,584][pipelinerl.streams][WARNING] - Could not decode JSON from weight_update_request/0/0, might have run into end of the file. Will reopen the file and retry (4/10), starting from position 27988)
[2025-12-05 02:22:24,745][pipelinerl.streams][WARNING] - Could not decode JSON from weight_update_request/0/0, might have run into end of the file. Will reopen the file and retry (5/10), starting from position 27988)
[2025-12-05 02:22:25,066][pipelinerl.streams][WARNING] - Could not decode JSON from weight_update_request/0/0, might have run into end of the file. Will reopen the file and retry (6/10), starting from position 27988)
[2025-12-05 02:22:25,707][pipelinerl.streams][WARNING] - Could not decode JSON from weight_update_request/0/0, might have run into end of the file. Will reopen the file and retry (7/10), starting from position 27988)
[2025-12-05 02:22:26,988][pipelinerl.streams][WARNING] - Could not decode JSON from weight_update_request/0/0, might have run into end of the file. Will reopen the file and retry (8/10), starting from position 27988)
[2025-12-05 02:22:29,549][pipelinerl.streams][WARNING] - Could not decode JSON from weight_update_request/0/0, might have run into end of the file. Will reopen the file and retry (9/10), starting from position 27988)
[2025-12-05 02:22:34,670][pipelinerl.streams][ERROR] - Error reading stream weight_update_request/0/0, giving up after 10 retries

If I then look at offending file in output_dir/streams/weight_update_request/0/0/0.jsonl I indeed see a corrupted entry:

{"kind":"samples_processed","samples_processed":4091,"timestamp":1764875551.5654244}
{"kind":"samples_processed","samples_processed":4092,"timestamp":1764875551.5654244}
{"kind":"samples_processed","samples_processed":4094,"timestamp":1764875551.5654244}
{"kind":"samples_processed","samples_processed":4095,"timestamp":1764875551.5654244}
{"kind":"weight_update_success","version":4096,"timestamp":1764875551.565069}
54244} # <-- CORRUPTED!
{"kind":"samples_processed","samples_processed":4115,"timestamp":1764875551.5654244}
{"kind":"samples_processed","samples_processed":4125,"timestamp":1764875551.5654244}
{"kind":"samples_processed","samples_processed":4139,"timestamp":1764875551.5654244}
{"kind":"samples_processed","samples_processed":4154,"timestamp":1764875551.5654244}

The issue appears to be with FileStreamReader trying to open/write to the same file from multiple processes. I would attach a minimal reproducible example, but the issue is stochastic and hard to reproduce - happy to share more details that can help understand the root cause!

Edit: here's a config that reproduces the error, usually around 20-40 steps in

# --------------------
# Config for 4+2 nodes
# --------------------
defaults:
    - base
    - _self_

model_path: Qwen/Qwen3-4B-Thinking-2507
# Uncomment to skip test set rollouts
# eval_every_n_versions: 0

preprocess:
  shared_memory_entry_size: 2000000000

finetune:
  attempts: 16
  train_batch_size: 32
  valid_batch_size: 32
  gradient_accumulation_passes: 16
  seq_length: 49920
  seq_parallel: 2
  learning_rate: 1.0e-6

llm:
  parameters:
    max_tokens: 49152 
    temperature: 0.8
# Sampling params taken from Qwen3 tech report: https://arxiv.org/abs/2505.09388
test_llm:
  parameters: 
    max_tokens: 49152
    temperature: 0.8 
    top_p: 0.95
    top_k: 20

actor:
  llm_max_rollouts: 16 # Larger values exhaust the KV cache
  shared_memory_entry_size: 2000000000 # Allow up to 2GB per rollout
  rollout_policy: pipelinerl.domains.math.generate_math_rollout
  system_prompt: Please reason step by step, and put your final answer within \boxed{}.
  task_template: |-
    {task}
environment:
  _target_: pipelinerl.domains.math.MathEnvironment
  model_name: ${llm_grader.name}
  sampling_kwargs: ${llm_grader.sampling_kwargs}
dataset_loader: pipelinerl.domains.math.load_datasets
train_dataset_names:
  - hub_id: POLARIS-Project/Polaris-Dataset-53K # Note: custom logic for loading Hub datasets
    split: train

vllm_config:
  use_v1: false
  vllm_kwargs:
    dtype: bfloat16
    gpu-memory-utilization: 0.9
    num-scheduler-steps: 1
    disable-log-requests: ""
    disable-frontend-multiprocessing: ""
    max-num-seqs: ${actor.llm_max_rollouts}
    max-num-batched-tokens: 16384
    enable-prefix-caching: ""
    enable-chunked-prefill: ""
    return-tokens-as-token-ids: ""
    tensor-parallel-size: 1
    pipeline-parallel-size: 1
    generation-config: vllm

world:
  replicas: 1
  actor_fraction: 32
  preprocessor_fraction: 0
  finetune_fraction: 16
  env_replicas: 1
  actor_group_port: 9000
  environment_start_port: 7777

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions