feat(task2): CPU bucket cache + selective weight sync (F4, F6-transport, Gate 2.5)#8
Open
zhenyulincs wants to merge 76 commits intorlops:nemofrom
Open
feat(task2): CPU bucket cache + selective weight sync (F4, F6-transport, Gate 2.5)#8zhenyulincs wants to merge 76 commits intorlops:nemofrom
zhenyulincs wants to merge 76 commits intorlops:nemofrom
Conversation
…results - Add run_rlix_experiment.py with scenarios A-F (Qwen FT, Dual FT, Multi-LoRA, FT+LoRA, LFM DeepSpeed single/dual) - Add lfm_finetune_pipeline1/2.yaml for LFM2.5-350M DeepSpeed training; includes DS_BUILD_OPS=0 env var (partial fix for sm_120a) - Update all pipeline yamls with VLLM_USE_FLASHINFER_SAMPLER=0 and other RTX 5090 / Blackwell compatibility fixes - Add RLIX_EXPERIMENT.md: architecture docs, benchmark results (v20 run), step-by-step timing, 8 bugs documented with root causes and fixes Results (4x RTX 5090, 2026-04-14 v20 run): A Single FT: 244s 2.4% util PASS B Dual FT: 312s 3.9% util PASS C Single Multi-LoRA: 367s 0.7% util PASS D FT+Multi-LoRA: 434s 1.8% util PASS E LFM Single FT: 105s 0.1% util FAIL (fused_adam JIT, see Bug 8) F LFM Dual FT: 106s 0.0% util FAIL (same)
ZeRO-2 without offload_optimizer causes ROLL to select FusedAdam, which fails to JIT-compile on RTX 5090 (sm_120a / Blackwell). Adding offload_optimizer.device=cpu makes is_offload() return True so ROLL selects DeepSpeedCPUAdam instead. No ROLL source modification needed.
…on strategies build_latest_bucket_cache and promote_active_checkpoint are Megatron-only. DeepSpeed training workers raise RuntimeError for these calls. Catch and skip gracefully so LFM/DeepSpeed pipelines can initialize correctly.
ROLL's ReloadableProcessGroup monkey-patch is incompatible with DeepSpeed's process group initialization. Skip the offload_nccl=True enforcement in the coordinator for deepspeed_* strategy clusters and set offload_nccl=false in LFM pipeline yamls. NCCL buffer overhead is negligible for 350M models.
…gies DeepSpeed does not implement _build_latest_bucket_cache so the bucket-cache weight sync crashes with CUDA illegal memory access when expanding infer workers. Use skip_load=True to route-only expand for deepspeed backends.
…ight sync DeepSpeed does not implement bucket-cache weight sync. Restricting actor_infer to the same GPUs as actor_train means all infer workers are IPC-accessible and no NCCL cross-GPU sync is needed. Pipeline 1: infer on [0,1]; Pipeline 2: infer on [2,3]. Reverts skip_load workaround.
…atron DeepSpeed weight sync is incompatible with rlix bucket-cache mechanism. E/F now use the same Qwen2.5-0.5B + Megatron config as A/B, providing consistent coverage across all 6 scenarios. NeMo will be added as a dedicated backend later.
- All 6 scenarios now pass on RTX 5090 Blackwell (sm_120a) after fixes - Update scenarios E/F descriptions from LFM/DeepSpeed to Qwen2.5-0.5B/Megatron - Add final v35/v37 results table showing all 6 scenarios PASS - Preserve v20 historical results as reference - Add Bugs 9-11: NCCL 2.26.2 Blackwell kernels, vLLM torch.dtype serialization, PyTorch 2.7.1 _coalescing_manager UnboundLocalError
Ports 4 modules from nemo-integration and adds BucketCacheLifecycle (new): - bucket_cache.py: thread-safe CPUBucketCache keyed by (param_name, shard_id) with dirty tracking and PP-shard support - bucket_receiver.py: BucketUpdateRequest/Result, PP-shard merge via torch.cat, fail-partial apply_bucket_update semantics - model_update_service_cached.py: owns cache, populates from PP workers, dispatches dirty-bucket sync to inference workers - bucket_cache_lifecycle.py: wraps ROLL's promote_active_checkpoint with _cache_ready_step version tracking; direct worker calls (no .remote()) for testability 64 unit tests across 4 test files; all passing without Ray or GPU. Fixes: worker calls use direct method (not .remote()) to be testable with plain Python fakes — pipeline layer handles Ray .remote() externally.
…cationOp refactor gpus_to_allocate/dp_ranks_to_add were renamed to dp_rank_to_gpus_to_add (Dict) in the type definition; update the two affected tests to use the current API.
- external/NeMo: zhenyulincs/RL fork at rlix-task2 branch
- tests/integration/test_bucket_cache_gpu.py: 4 test classes using
Qwen2.5-0.5B on real GPU:
* TestGPUMemoryRelease: verifies >=90% VRAM freed after offload
* TestWeightCorrectnessInCache: bit-for-bit match between GPU model
and CPUBucketCache contents
* TestBucketReceiverPush: weight correctness after apply_bucket_update
to CPU and GPU targets
* TestFullRoundTrip: end-to-end GPU→cache→offload→push→verify
- tests/integration/run_gpu_tests.sh: convenience deploy script
- store() takes keyword args: store(name, shard_id=0, tensor=t) - get_dirty_buckets() returns List[Bucket] not dict - BucketUpdateRequest.sync_id is str not int - Remove manual Bucket construction — pass get_dirty_buckets() directly
Part 1 (test_gate2_5_nccl_destroy.py, torchrun --nproc-per-node=2): - Megatron destroy_model_parallel() VRAM release ≥70% threshold - 5-cycle destroy/re-init stability (no leak, allreduce works after) - Stale process group raises after destroy (no silent corruption) - Requires: megatron-core Part 2 (test_gate2_5_selective_sync.py, torchrun --nproc-per-node=2): - Dynamic NCCL group create/use/destroy per sync cycle - rank 0 → rank 1 bucket broadcast from CPUBucketCache - Bit-exact weight verification on receiver - VRAM stable across 3 sync cycles Part 3 (test_gate2_5_qwen_train_sync.py, torchrun --nproc-per-node=4): - Real Qwen2.5-0.5B forward+backward on GPU 0,1 (TP=2 training) - SHA256 hash snapshot of all weights before sync - CPU bucket cache build on rank 0 - VRAM release ≥60% after model.cpu() + empty_cache - Dynamic NCCL broadcast to GPU 2,3 (inference ranks) - Bit-exact hash verification: training weights == received weights - 2 full steps to verify stability
Root cause of 600s timeout: broadcast_object_list unreliable over pure NCCL backend. New design uses same deterministic seed on both ranks — no Python object broadcast needed. Also fixed new_group ranks to [0, 1] (was incorrectly using [0, 2, 3] on 2-GPU setup). Changes: - Both ranks call make_weights(step=cycle) with same seed - Dynamic group uses ranks=[SENDER, RECEIVER]=[0, 1] - dist.barrier() immediately after init_process_group - SHA256 hash verification on receiver for bit-exact check
PyTorch 2.5+ requires device_id in init_process_group() when using NCCL backend, otherwise dist.barrier() spins at 100% CPU indefinitely. Fix applied to all three gate2.5 test files.
…w_group hang NCCL sub-group creation for [0,2,3] across SYS-topology GPUs (different PCIe root complexes) hangs with P2P disabled. Use world group broadcasts instead: all 4 ranks receive, inference ranks (2,3) retain the data. This validates the full CPU-bucket-cache → GPU broadcast pipeline. Also fixed torch.frombuffer with non-writable bytes by using bytearray().
…rt bugs NCCL on PCIe-SYS topology with P2P+SHM disabled fails with 'invalid usage' for int64 and uint8 broadcasts. Rewrite selective_sync to use only bfloat16: - Bucket count: bfloat16 scalar - Metadata: [name_len, n_elem×4_bytes, hash×16] all as bfloat16 (all values ≤ 255, exactly representable in bfloat16's 7-bit mantissa) - Name bytes: bfloat16 floats (ASCII, all < 128) - Tensor data: bfloat16 (unchanged) n_elements encoded as base-256 (4 bytes × 8 bits = 32 bits) to handle tensors with > 256 elements without float precision loss. Also replaced broadcast_object_list verification with in-protocol hash: hash embedded in metadata, verified locally on inference ranks.
…void NCCL hang on SYS-topology PCIe
… NCCL SYS-topology timeout
…ing, avoiding bfloat16 precision loss
Gate 2.5 test transport migration (gloo→NCCL): - test_gate2_5_selective_sync.py: NCCL group [0,2,3] proper subset of world - test_gate2_5_megatron_tp.py: NCCL groups [0,2] and [1,3] per TP shard - test_gate2_5_qwen_train_sync.py: NCCL group [0,2,3], gloo control-plane only - test_gate2_5_full.py: NCCL groups [0,2,3] phase-A and [1,2,3] phase-B - test_gate2_5_feature6.py: new Feature 6 ordering test (sync→finalize→activate) - test_gate2_5_nccl_destroy.py: stale-PG check downgraded to WARN (platform-specific) Fix: torch.cuda.synchronize() + barrier(nccl_group) before destroy prevents SIGABRT F4/F6 implementation fixes: - _cache_lock spans transport + NCCL teardown (sender group destroyed inside lock) - bucket_size_bytes explicit — RuntimeError if not configured (no 256 MB default) - Host-RAM check uses actual packed model size (in build_latest_bucket_cache) - finalize_weight_update moved from ModelUpdateService to pipeline (spec line 624) - sync_base_weights_to_active returns synced ranks; pipeline finalizes only those - is_lora: bool = False added to update_parameter_in_bucket + broadcast_parameter - VllmGeneration pass-through methods now await sub-worker futures (phase barriers) - Port claim released after full receiver NCCL teardown (spec lines 380-389) - Receiver-side destroy_collective_group added (Phase 4 in ModelUpdateService) - Trajectory collector registered as named Ray actor in grpo.py; pipeline resolves lazily - promote_active_checkpoint keyword arg fixed: version= not checkpoint_version= - model_update_transport param propagated to update_parameter_in_bucket call Docs: IMPLEMENTATION.md, DESIGN_F4_F6.md, GATE2_5_TRANSPORT_REVIEW.md, TASK2_REVIEW.md All 6 Gate 2.5 tests pass on 4x RTX A5000 (Vast.ai), all NCCL transport.
…5 test coverage table
…dex review fixes F6.3 CUDA IPC implementation: - sender (megatron_policy_worker.py): branches on model_update_transport; cuda_ipc path gets CUDA IPC handle via get_handle_from_tensor(), cpu_serialize sends CPU uint8 bucket directly - receiver (vllm_backend.py): uses self.rank (not dist.get_rank()) for IPC mask; cuda_ipc path reconstructs GPU tensor via rebuild_cuda_tensor (zero-copy, no GPU→CPU→GPU roundtrip); cpu_serialize uses pin_memory DMA F4.4 bucket-size guard: - build_latest_bucket_cache: fail-fast when single tensor > bucket_size_bytes (prevents silent VRAM budget violation matching ROLL send_recv_utils.py pattern) F6.6 trajectory collector ordering: - _expand_workers: set_weight_version called BEFORE expand_sampler (routing activation) — spec lines 602-608; previously published after routing was live Codex review fixes (IMPL_REVIEW_CUDA_IPC.md, IMPL_REVIEW_ROUND2.md, FINAL_REVIEW.md): - rank mask: self.rank instead of dist.get_rank() - cuda_ipc: zero-copy reconstruction, no roundtrip - oversized tensor: RuntimeError before append - ordering: version publish before routing activation New tests (all PASS on 4x RTX A5000): - test_gate2_5_cuda_ipc.py: real update_parameter_in_bucket call + cross-process IPC - test_gate2_5_bucket_size_guard.py: real _rlix_get_bucket_size_bytes + guard check - test_gate2_5_trajectory_collector.py: real source ordering + publish contracts Analysis doc: rlix/ROLL_VS_NEMO_ANALYSIS.md — ROLL vs NeMo port differences
…rdering as implemented
…che, Codex fixes)
…tion, architectural constraints
…ash mismatches on fresh install
…187620155 Add Claude Code GitHub Workflow
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Implements Task 2 from plans/nemorl-port-plan.md: the training-side CPU bucket cache (Feature 4) and the selective weight sync transport layer (Feature 6) needed for GPU time-sharing between training and inference workers.
After each training step, model weights are packed into CPU-resident BucketRecord objects and held in a versioned cache. ModelUpdateService then transfers these weights to specific inference workers on demand — using CUDA IPC
for same-GPU colocated workers and dynamic NCCL broadcast for cross-GPU workers — without ever reloading the full model to the sender GPU.
All Gate 2.5 integration tests pass on 4× RTX A5000.
Issues
Usage
Clone with NeMo submodule
git clone https://github.com/zhenyulincs/rlix.git --recurse-submodules cd rlixRequired — no implicit default
Unit tests (no GPU)
python -m pytest tests/test_bucket_cache.py tests/test_bucket_cache_lifecycle.py \ tests/test_model_update_service.py tests/test_nemo_rl_pipeline.pyGate 2.5 integration tests (4× GPU)
export NCCL_P2P_DISABLE=1 NCCL_SHM_DISABLE=1 torchrun --nproc-per-node=4 tests/integration/test_gate2_5_selective_sync.py torchrun --nproc-per-node=4 tests/integration/test_gate2_5_megatron_tp.py HF_HUB_OFFLINE=1 torchrun --nproc-per-node=4 tests/integration/test_gate2_5_qwen_train_sync.py torchrun --nproc-per-node=4 tests/integration/test_gate2_5_feature6.pyBefore your PR is "Ready for review"
Pre checks:
Additional Information
New files: rlix/pipeline/bucket_cache.py, bucket_cache_lifecycle.py, model_update_service.py, coordinator.py
Modified: rlix/pipeline/full_finetune_pipeline.py, rlix/protocol/coordinator.py