Skip to content

Training hangs after step 12 due to race condition with max_policy_age: 0Β #557

@wukaixingxp

Description

@wukaixingxp

πŸ› Describe the bug

My training job sometime hangs after some steps, using claude code to debug and found out this, please help:

Summary

Training stops and hangs after step 12 when max_policy_age is set to 0 (via off_by_n: 0). The process crashes silently with no error messages, while the Julia docker container continues running and processing requests.

Root Cause

A race condition between weight updates and buffer sampling when max_policy_age: 0:
Configuration:

# apps/openenv/llama3_8b_julia.yaml:11,116
off_by_n: 0
max_policy_age: ${off_by_n}  # This is 0!

The Deadly Sequence:

  1. Training completes step 12 β†’ increments training_step to 13
  2. Buffer sampling calls _evict(curr_policy_version=13)
  3. Eviction logic (replay_buffer.py:36) removes ALL episodes where policy_version < 13
  4. Generator is still on version 12 (weight updates take 17-30 seconds)
  5. New rollouts create episodes with generator_version=12
  6. These episodes get evicted immediately on next sample call

Result: Infinite wait for version 13 episodes that never arrive β†’ buffer becomes empty β†’ sample() returns None β†’ training deadlock

Evidence from Logs

Timeline:
21:19:45 - RLTrainer pushes weights for version 12
21:19:57 - Push completes (12s)
21:20:15 - Generator FINISHES updating to v12 (30s total!)

Metrics:

buffer/sample/avg_data_utilization: 1.0 (only 8 episodes in buffer - the minimum needed)
buffer/evict/sum_episodes_evicted: 8.0 (evicting 8 episodes per step)
Last logged activity: 21:20:15, then silence
Training completed only 12 out of 1500 steps

Versions

using my branch

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions