Could not decode JSON from weight_update_request/0/0

Hello, when training with long rollouts (e.g. more than 32k tokens), I observe the following warning and subsequent error in my training logs:

```
[2025-12-04 19:17:26,736][pipelinerl.streams][WARNING] - Waiting for weight_update_request/0/0 to be created
[2025-12-05 02:22:24,430][pipelinerl.streams][WARNING] - Could not decode JSON from weight_update_request/0/0, might have run into end of the file. Will reopen the file and retry (0/10), starting from position 27988)
[2025-12-05 02:22:24,442][pipelinerl.streams][WARNING] - Could not decode JSON from weight_update_request/0/0, might have run into end of the file. Will reopen the file and retry (1/10), starting from position 27988)
[2025-12-05 02:22:24,463][pipelinerl.streams][WARNING] - Could not decode JSON from weight_update_request/0/0, might have run into end of the file. Will reopen the file and retry (2/10), starting from position 27988)
[2025-12-05 02:22:24,504][pipelinerl.streams][WARNING] - Could not decode JSON from weight_update_request/0/0, might have run into end of the file. Will reopen the file and retry (3/10), starting from position 27988)
[2025-12-05 02:22:24,584][pipelinerl.streams][WARNING] - Could not decode JSON from weight_update_request/0/0, might have run into end of the file. Will reopen the file and retry (4/10), starting from position 27988)
[2025-12-05 02:22:24,745][pipelinerl.streams][WARNING] - Could not decode JSON from weight_update_request/0/0, might have run into end of the file. Will reopen the file and retry (5/10), starting from position 27988)
[2025-12-05 02:22:25,066][pipelinerl.streams][WARNING] - Could not decode JSON from weight_update_request/0/0, might have run into end of the file. Will reopen the file and retry (6/10), starting from position 27988)
[2025-12-05 02:22:25,707][pipelinerl.streams][WARNING] - Could not decode JSON from weight_update_request/0/0, might have run into end of the file. Will reopen the file and retry (7/10), starting from position 27988)
[2025-12-05 02:22:26,988][pipelinerl.streams][WARNING] - Could not decode JSON from weight_update_request/0/0, might have run into end of the file. Will reopen the file and retry (8/10), starting from position 27988)
[2025-12-05 02:22:29,549][pipelinerl.streams][WARNING] - Could not decode JSON from weight_update_request/0/0, might have run into end of the file. Will reopen the file and retry (9/10), starting from position 27988)
[2025-12-05 02:22:34,670][pipelinerl.streams][ERROR] - Error reading stream weight_update_request/0/0, giving up after 10 retries
```

If I then look at offending file in `output_dir/streams/weight_update_request/0/0/0.jsonl` I indeed see a corrupted entry:

```json
{"kind":"samples_processed","samples_processed":4091,"timestamp":1764875551.5654244}
{"kind":"samples_processed","samples_processed":4092,"timestamp":1764875551.5654244}
{"kind":"samples_processed","samples_processed":4094,"timestamp":1764875551.5654244}
{"kind":"samples_processed","samples_processed":4095,"timestamp":1764875551.5654244}
{"kind":"weight_update_success","version":4096,"timestamp":1764875551.565069}
54244} # <-- CORRUPTED!
{"kind":"samples_processed","samples_processed":4115,"timestamp":1764875551.5654244}
{"kind":"samples_processed","samples_processed":4125,"timestamp":1764875551.5654244}
{"kind":"samples_processed","samples_processed":4139,"timestamp":1764875551.5654244}
{"kind":"samples_processed","samples_processed":4154,"timestamp":1764875551.5654244}
```

The issue appears to be with `FileStreamReader` trying to open/write to the same file from multiple processes. I would attach a minimal reproducible example, but the issue is stochastic and hard to reproduce - happy to share more details that can help understand the root cause!

Edit: here's a config that reproduces the error, usually around 20-40 steps in

```yaml
# --------------------
# Config for 4+2 nodes
# --------------------
defaults:
    - base
    - _self_

model_path: Qwen/Qwen3-4B-Thinking-2507
# Uncomment to skip test set rollouts
# eval_every_n_versions: 0

preprocess:
  shared_memory_entry_size: 2000000000

finetune:
  attempts: 16
  train_batch_size: 32
  valid_batch_size: 32
  gradient_accumulation_passes: 16
  seq_length: 49920
  seq_parallel: 2
  learning_rate: 1.0e-6

llm:
  parameters:
    max_tokens: 49152 
    temperature: 0.8
# Sampling params taken from Qwen3 tech report: https://arxiv.org/abs/2505.09388
test_llm:
  parameters: 
    max_tokens: 49152
    temperature: 0.8 
    top_p: 0.95
    top_k: 20

actor:
  llm_max_rollouts: 16 # Larger values exhaust the KV cache
  shared_memory_entry_size: 2000000000 # Allow up to 2GB per rollout
  rollout_policy: pipelinerl.domains.math.generate_math_rollout
  system_prompt: Please reason step by step, and put your final answer within \boxed{}.
  task_template: |-
    {task}
environment:
  _target_: pipelinerl.domains.math.MathEnvironment
  model_name: ${llm_grader.name}
  sampling_kwargs: ${llm_grader.sampling_kwargs}
dataset_loader: pipelinerl.domains.math.load_datasets
train_dataset_names:
  - hub_id: POLARIS-Project/Polaris-Dataset-53K # Note: custom logic for loading Hub datasets
    split: train

vllm_config:
  use_v1: false
  vllm_kwargs:
    dtype: bfloat16
    gpu-memory-utilization: 0.9
    num-scheduler-steps: 1
    disable-log-requests: ""
    disable-frontend-multiprocessing: ""
    max-num-seqs: ${actor.llm_max_rollouts}
    max-num-batched-tokens: 16384
    enable-prefix-caching: ""
    enable-chunked-prefill: ""
    return-tokens-as-token-ids: ""
    tensor-parallel-size: 1
    pipeline-parallel-size: 1
    generation-config: vllm

world:
  replicas: 1
  actor_fraction: 32
  preprocessor_fraction: 0
  finetune_fraction: 16
  env_replicas: 1
  actor_group_port: 9000
  environment_start_port: 7777
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Could not decode JSON from weight_update_request/0/0 #113

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Could not decode JSON from weight_update_request/0/0 #113

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions