feat(task2): CPU bucket cache + selective weight sync (F4, F6-transport, Gate 2.5) by zhenyulincs · Pull Request #8 · rlops/rlix

zhenyulincs · 2026-04-26T04:48:07Z

What does this PR do ?

Implements Task 2 from plans/nemorl-port-plan.md: the training-side CPU bucket cache (Feature 4) and the selective weight sync transport layer (Feature 6) needed for GPU time-sharing between training and inference workers.

After each training step, model weights are packed into CPU-resident BucketRecord objects and held in a versioned cache. ModelUpdateService then transfers these weights to specific inference workers on demand — using CUDA IPC
for same-GPU colocated workers and dynamic NCCL broadcast for cross-GPU workers — without ever reloading the full model to the sender GPU.

All Gate 2.5 integration tests pass on 4× RTX A5000.

Issues

Closes Gate 2.5 requirement from plans/nemorl-port-plan.md (Feature 4 + Feature 6-transport)
Fixes silent VRAM budget violation: a single param larger than bucket_size_bytes was previously accepted without error
Fixes receiver rank mask: was using dist.get_rank() instead of self.rank (wrong identity in multi-node setups)
Fixes version publish ordering: set_weight_version was called after routing activation instead of before (spec lines 602–608)
Fixes port claim release: was released before receiver NCCL teardown instead of after (spec lines 380–389)

Usage

Clone with NeMo submodule

  git clone https://github.com/zhenyulincs/rlix.git --recurse-submodules
  cd rlix

Required — no implicit default

  export RLIX_BUCKET_SIZE_BYTES=$((256 * 1024 * 1024))
  export RLIX_MODEL_UPDATE_TRANSPORT=cpu_serialize  # or cuda_ipc for same-GPU

Unit tests (no GPU)

  python -m pytest tests/test_bucket_cache.py tests/test_bucket_cache_lifecycle.py \
                    tests/test_model_update_service.py tests/test_nemo_rl_pipeline.py

Gate 2.5 integration tests (4× GPU)

  export NCCL_P2P_DISABLE=1 NCCL_SHM_DISABLE=1
  torchrun --nproc-per-node=4 tests/integration/test_gate2_5_selective_sync.py
  torchrun --nproc-per-node=4 tests/integration/test_gate2_5_megatron_tp.py
  HF_HUB_OFFLINE=1 torchrun --nproc-per-node=4 tests/integration/test_gate2_5_qwen_train_sync.py
  torchrun --nproc-per-node=4 tests/integration/test_gate2_5_feature6.py

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

New files: rlix/pipeline/bucket_cache.py, bucket_cache_lifecycle.py, model_update_service.py, coordinator.py
Modified: rlix/pipeline/full_finetune_pipeline.py, rlix/protocol/coordinator.py

…results - Add run_rlix_experiment.py with scenarios A-F (Qwen FT, Dual FT, Multi-LoRA, FT+LoRA, LFM DeepSpeed single/dual) - Add lfm_finetune_pipeline1/2.yaml for LFM2.5-350M DeepSpeed training; includes DS_BUILD_OPS=0 env var (partial fix for sm_120a) - Update all pipeline yamls with VLLM_USE_FLASHINFER_SAMPLER=0 and other RTX 5090 / Blackwell compatibility fixes - Add RLIX_EXPERIMENT.md: architecture docs, benchmark results (v20 run), step-by-step timing, 8 bugs documented with root causes and fixes Results (4x RTX 5090, 2026-04-14 v20 run): A Single FT: 244s 2.4% util PASS B Dual FT: 312s 3.9% util PASS C Single Multi-LoRA: 367s 0.7% util PASS D FT+Multi-LoRA: 434s 1.8% util PASS E LFM Single FT: 105s 0.1% util FAIL (fused_adam JIT, see Bug 8) F LFM Dual FT: 106s 0.0% util FAIL (same)

ZeRO-2 without offload_optimizer causes ROLL to select FusedAdam, which fails to JIT-compile on RTX 5090 (sm_120a / Blackwell). Adding offload_optimizer.device=cpu makes is_offload() return True so ROLL selects DeepSpeedCPUAdam instead. No ROLL source modification needed.

…on strategies build_latest_bucket_cache and promote_active_checkpoint are Megatron-only. DeepSpeed training workers raise RuntimeError for these calls. Catch and skip gracefully so LFM/DeepSpeed pipelines can initialize correctly.

…active clusters

ROLL's ReloadableProcessGroup monkey-patch is incompatible with DeepSpeed's process group initialization. Skip the offload_nccl=True enforcement in the coordinator for deepspeed_* strategy clusters and set offload_nccl=false in LFM pipeline yamls. NCCL buffer overhead is negligible for 350M models.

…or deepspeed

…on-Megatron

…gies DeepSpeed does not implement _build_latest_bucket_cache so the bucket-cache weight sync crashes with CUDA illegal memory access when expanding infer workers. Use skip_load=True to route-only expand for deepspeed backends.

…ight sync DeepSpeed does not implement bucket-cache weight sync. Restricting actor_infer to the same GPUs as actor_train means all infer workers are IPC-accessible and no NCCL cross-GPU sync is needed. Pipeline 1: infer on [0,1]; Pipeline 2: infer on [2,3]. Reverts skip_load workaround.

…atron DeepSpeed weight sync is incompatible with rlix bucket-cache mechanism. E/F now use the same Qwen2.5-0.5B + Megatron config as A/B, providing consistent coverage across all 6 scenarios. NeMo will be added as a dedicated backend later.

… (RTX 5090)

- All 6 scenarios now pass on RTX 5090 Blackwell (sm_120a) after fixes - Update scenarios E/F descriptions from LFM/DeepSpeed to Qwen2.5-0.5B/Megatron - Add final v35/v37 results table showing all 6 scenarios PASS - Preserve v20 historical results as reference - Add Bugs 9-11: NCCL 2.26.2 Blackwell kernels, vLLM torch.dtype serialization, PyTorch 2.7.1 _coalescing_manager UnboundLocalError

Ports 4 modules from nemo-integration and adds BucketCacheLifecycle (new): - bucket_cache.py: thread-safe CPUBucketCache keyed by (param_name, shard_id) with dirty tracking and PP-shard support - bucket_receiver.py: BucketUpdateRequest/Result, PP-shard merge via torch.cat, fail-partial apply_bucket_update semantics - model_update_service_cached.py: owns cache, populates from PP workers, dispatches dirty-bucket sync to inference workers - bucket_cache_lifecycle.py: wraps ROLL's promote_active_checkpoint with _cache_ready_step version tracking; direct worker calls (no .remote()) for testability 64 unit tests across 4 test files; all passing without Ray or GPU. Fixes: worker calls use direct method (not .remote()) to be testable with plain Python fakes — pipeline layer handles Ray .remote() externally.

…es, and bugs

…cationOp refactor gpus_to_allocate/dp_ranks_to_add were renamed to dp_rank_to_gpus_to_add (Dict) in the type definition; update the two affected tests to use the current API.

- external/NeMo: zhenyulincs/RL fork at rlix-task2 branch - tests/integration/test_bucket_cache_gpu.py: 4 test classes using Qwen2.5-0.5B on real GPU: * TestGPUMemoryRelease: verifies >=90% VRAM freed after offload * TestWeightCorrectnessInCache: bit-for-bit match between GPU model and CPUBucketCache contents * TestBucketReceiverPush: weight correctness after apply_bucket_update to CPU and GPU targets * TestFullRoundTrip: end-to-end GPU→cache→offload→push→verify - tests/integration/run_gpu_tests.sh: convenience deploy script

- store() takes keyword args: store(name, shard_id=0, tensor=t) - get_dirty_buckets() returns List[Bucket] not dict - BucketUpdateRequest.sync_id is str not int - Remove manual Bucket construction — pass get_dirty_buckets() directly

Part 1 (test_gate2_5_nccl_destroy.py, torchrun --nproc-per-node=2): - Megatron destroy_model_parallel() VRAM release ≥70% threshold - 5-cycle destroy/re-init stability (no leak, allreduce works after) - Stale process group raises after destroy (no silent corruption) - Requires: megatron-core Part 2 (test_gate2_5_selective_sync.py, torchrun --nproc-per-node=2): - Dynamic NCCL group create/use/destroy per sync cycle - rank 0 → rank 1 bucket broadcast from CPUBucketCache - Bit-exact weight verification on receiver - VRAM stable across 3 sync cycles Part 3 (test_gate2_5_qwen_train_sync.py, torchrun --nproc-per-node=4): - Real Qwen2.5-0.5B forward+backward on GPU 0,1 (TP=2 training) - SHA256 hash snapshot of all weights before sync - CPU bucket cache build on rank 0 - VRAM release ≥60% after model.cpu() + empty_cache - Dynamic NCCL broadcast to GPU 2,3 (inference ranks) - Bit-exact hash verification: training weights == received weights - 2 full steps to verify stability

Root cause of 600s timeout: broadcast_object_list unreliable over pure NCCL backend. New design uses same deterministic seed on both ranks — no Python object broadcast needed. Also fixed new_group ranks to [0, 1] (was incorrectly using [0, 2, 3] on 2-GPU setup). Changes: - Both ranks call make_weights(step=cycle) with same seed - Dynamic group uses ranks=[SENDER, RECEIVER]=[0, 1] - dist.barrier() immediately after init_process_group - SHA256 hash verification on receiver for bit-exact check

PyTorch 2.5+ requires device_id in init_process_group() when using NCCL backend, otherwise dist.barrier() spins at 100% CPU indefinitely. Fix applied to all three gate2.5 test files.

…w_group hang NCCL sub-group creation for [0,2,3] across SYS-topology GPUs (different PCIe root complexes) hangs with P2P disabled. Use world group broadcasts instead: all 4 ranks receive, inference ranks (2,3) retain the data. This validates the full CPU-bucket-cache → GPU broadcast pipeline. Also fixed torch.frombuffer with non-writable bytes by using bytearray().

…rt bugs NCCL on PCIe-SYS topology with P2P+SHM disabled fails with 'invalid usage' for int64 and uint8 broadcasts. Rewrite selective_sync to use only bfloat16: - Bucket count: bfloat16 scalar - Metadata: [name_len, n_elem×4_bytes, hash×16] all as bfloat16 (all values ≤ 255, exactly representable in bfloat16's 7-bit mantissa) - Name bytes: bfloat16 floats (ASCII, all < 128) - Tensor data: bfloat16 (unchanged) n_elements encoded as base-256 (4 bytes × 8 bits = 32 bits) to handle tensors with > 256 elements without float precision loss. Also replaced broadcast_object_list verification with in-protocol hash: hash embedded in metadata, verified locally on inference ranks.

…void NCCL hang on SYS-topology PCIe

… NCCL SYS-topology timeout

…ing, avoiding bfloat16 precision loss

Gate 2.5 test transport migration (gloo→NCCL): - test_gate2_5_selective_sync.py: NCCL group [0,2,3] proper subset of world - test_gate2_5_megatron_tp.py: NCCL groups [0,2] and [1,3] per TP shard - test_gate2_5_qwen_train_sync.py: NCCL group [0,2,3], gloo control-plane only - test_gate2_5_full.py: NCCL groups [0,2,3] phase-A and [1,2,3] phase-B - test_gate2_5_feature6.py: new Feature 6 ordering test (sync→finalize→activate) - test_gate2_5_nccl_destroy.py: stale-PG check downgraded to WARN (platform-specific) Fix: torch.cuda.synchronize() + barrier(nccl_group) before destroy prevents SIGABRT F4/F6 implementation fixes: - _cache_lock spans transport + NCCL teardown (sender group destroyed inside lock) - bucket_size_bytes explicit — RuntimeError if not configured (no 256 MB default) - Host-RAM check uses actual packed model size (in build_latest_bucket_cache) - finalize_weight_update moved from ModelUpdateService to pipeline (spec line 624) - sync_base_weights_to_active returns synced ranks; pipeline finalizes only those - is_lora: bool = False added to update_parameter_in_bucket + broadcast_parameter - VllmGeneration pass-through methods now await sub-worker futures (phase barriers) - Port claim released after full receiver NCCL teardown (spec lines 380-389) - Receiver-side destroy_collective_group added (Phase 4 in ModelUpdateService) - Trajectory collector registered as named Ray actor in grpo.py; pipeline resolves lazily - promote_active_checkpoint keyword arg fixed: version= not checkpoint_version= - model_update_transport param propagated to update_parameter_in_bucket call Docs: IMPLEMENTATION.md, DESIGN_F4_F6.md, GATE2_5_TRANSPORT_REVIEW.md, TASK2_REVIEW.md All 6 Gate 2.5 tests pass on 4x RTX A5000 (Vast.ai), all NCCL transport.

…5 test coverage table

…dex review fixes F6.3 CUDA IPC implementation: - sender (megatron_policy_worker.py): branches on model_update_transport; cuda_ipc path gets CUDA IPC handle via get_handle_from_tensor(), cpu_serialize sends CPU uint8 bucket directly - receiver (vllm_backend.py): uses self.rank (not dist.get_rank()) for IPC mask; cuda_ipc path reconstructs GPU tensor via rebuild_cuda_tensor (zero-copy, no GPU→CPU→GPU roundtrip); cpu_serialize uses pin_memory DMA F4.4 bucket-size guard: - build_latest_bucket_cache: fail-fast when single tensor > bucket_size_bytes (prevents silent VRAM budget violation matching ROLL send_recv_utils.py pattern) F6.6 trajectory collector ordering: - _expand_workers: set_weight_version called BEFORE expand_sampler (routing activation) — spec lines 602-608; previously published after routing was live Codex review fixes (IMPL_REVIEW_CUDA_IPC.md, IMPL_REVIEW_ROUND2.md, FINAL_REVIEW.md): - rank mask: self.rank instead of dist.get_rank() - cuda_ipc: zero-copy reconstruction, no roundtrip - oversized tensor: RuntimeError before append - ordering: version publish before routing activation New tests (all PASS on 4x RTX A5000): - test_gate2_5_cuda_ipc.py: real update_parameter_in_bucket call + cross-process IPC - test_gate2_5_bucket_size_guard.py: real _rlix_get_bucket_size_bytes + guard check - test_gate2_5_trajectory_collector.py: real source ordering + publish contracts Analysis doc: rlix/ROLL_VS_NEMO_ANALYSIS.md — ROLL vs NeMo port differences

…rdering as implemented

…che, Codex fixes)

….6 ordering fix

…tion, architectural constraints

…r F4+F6

…ync (Gate 2.5)

…clutter

…st guide

…the single doc

…ash mismatches on fresh install

…187620155 Add Claude Code GitHub Workflow

zhenyulincs added 30 commits April 13, 2026 21:52

fix(lfm): set offload_nccl=true required by rlix coordinator for GPU-…

4c402ed

…active clusters

fix(lfm): set actor_infer offload_nccl=true, keep actor_train=false f…

ed1a3fe

…or deepspeed

fix(deepspeed): skip promote_active_checkpoint in training loop for n…

818bc27

…on-Megatron

fix(config): restore offload_nccl=true on Megatron pipeline yamls

b4d9157

fix(vllm): disable TorchDynamo compilation to prevent stall on sm_120…

e997dd4

… (RTX 5090)

docs(task2): add TASK2_IMPLEMENTATION.md covering architecture, modul…

b154188

…es, and bugs

fix(tests): update test_gap_ratio assertions to match SchedGuidedAllo…

9fc84d6

…cationOp refactor gpus_to_allocate/dp_ranks_to_add were renamed to dp_rank_to_gpus_to_add (Dict) in the type definition; update the two affected tests to use the current API.

docs(task2): document destroy_model_parallel trap in time-sharing mode

0d5e0b9

fix(tests): import pipeline modules directly to avoid heavy package deps

5a60cbd

fix(tests): use state_dict() to capture tied weights (lm_head in Qwen)

123884b

docs(task2): document 3 bugs found during GPU integration testing

fd1c521

fix(gate2.5): pass device_id to init_process_group for PyTorch 2.5+ NCCL

48859ae

PyTorch 2.5+ requires device_id in init_process_group() when using NCCL backend, otherwise dist.barrier() spins at 100% CPU indefinitely. Fix applied to all three gate2.5 test files.

fix(gate2.5): batch all 291 buckets into 3 fixed-size broadcasts to a…

e4546c1

…void NCCL hang on SYS-topology PCIe

fix(gate2.5): use gloo group for large flat tensor broadcast to avoid…

7baf48e

… NCCL SYS-topology timeout

fix(gate2.5): use float32 hi/lo split at 2^20 for exact n_elems encod…

4827dae

…ing, avoiding bfloat16 precision loss

zhenyulincs added 24 commits April 24, 2026 00:08

docs: update DESIGN_F4_F6.md — reflect gloo→NCCL migration in Gate 2.…

2eae254

…5 test coverage table

docs: update DESIGN_F4_F6.md — mark F6.3 cuda_ipc, F4.4 guard, F6.6 o…

fa97e78

…rdering as implemented

chore: update NeMo submodule to rlix-task2 (F4+F6 CUDA IPC, bucket ca…

81042d0

…che, Codex fixes)

docs: update IMPLEMENTATION.md — cuda_ipc implemented, F4.4 guard, F6…

b09cd86

….6 ordering fix

docs: fix IMPLEMENTATION.md doc accuracy — cuda_ipc inline reconstruc…

1f8b463

…tion, architectural constraints

docs: add TASK2_README.md — setup, architecture, test instructions fo…

551043c

…r F4+F6

feat: merge task2-bucket-cache — F4 CPU bucket cache + F6 selective s…

6cc0aa8

…ync (Gate 2.5)

chore: update NeMo submodule pointer to main (post-merge)

4040093

docs: consolidate all task2 docs into single TASK2.md, remove review …

ac2662a

…clutter

docs: single TASK2.md, remove review docs

5cded83

docs: rewrite TASK2.md in Chinese — full F4/F6 spec→impl mapping + te…

f2155ed

…st guide

docs: Chinese TASK2.md with full spec→impl mapping

4bec114

docs: remove stale TASK2_README.md and TASK2_REVIEW.md — TASK2.md is …

ec1b0b8

…the single doc

docs: remove all review/compliance docs — TASK2.md is the single doc

8802040

docs: fix quick usage example — load bucket_cache.py directly

023fcb5

docs: fix usage example

bc3bc49

fix: update NeMo submodule pointer — nvidia-resiliency-ext hash sync

90995a1

fix: NeMo submodule pointer update (resiliency hash)

dcc64fd

test: add test_env_install.py — catches setup.py/pyproject.toml VCS h…

709362a

…ash mismatches on fresh install

test: add env install test

fe46b54

docs: add env check (step 0) to TASK2.md test protocol

51e6a75

docs+test: env install check in test protocol

8f24ee8

zhenyulincs requested review from Kynnet and TianyeGGBond April 26, 2026 04:48

zhenyulincs self-assigned this Apr 26, 2026

zhenyulincs added 3 commits April 26, 2026 00:13

"Claude PR Assistant workflow"

ab1a833

"Claude Code Review workflow"

2577a0a

Merge pull request #1 from zhenyulincs/add-claude-github-actions-1777…

fe59782

…187620155 Add Claude Code GitHub Workflow

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(task2): CPU bucket cache + selective weight sync (F4, F6-transport, Gate 2.5)#8

feat(task2): CPU bucket cache + selective weight sync (F4, F6-transport, Gate 2.5)#8
zhenyulincs wants to merge 76 commits intorlops:nemofrom
zhenyulincs:main

zhenyulincs commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhenyulincs commented Apr 26, 2026

What does this PR do ?

Issues

Usage

Clone with NeMo submodule

Required — no implicit default

Unit tests (no GPU)

Gate 2.5 integration tests (4× GPU)

Before your PR is "Ready for review"

Additional Information

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant