[BUG] Qwen3-Next-80B-A3B on 8*H100 use --colocate crash

### Describe the Bug

1. 尝试在8 * H100 上面使用megatron+sglang对Qwen3-Next-80B-A3B进行RL训练，如果使用训推一体化（Colocated）配置，则会报错如下。退一步不使用--colocate，则可以正常运行。（在多机上也会出现现象）

   ```
   (SGLangEngine pid=32887) [2025-12-25 10:14:03 TP7 EP7] Scheduler hit an exception: Traceback (most recent call last):
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2683, in run_scheduler_process
   (SGLangEngine pid=32887)     scheduler.event_loop_normal()
   (SGLangEngine pid=32887)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
   (SGLangEngine pid=32887)     return func(*args, **kwargs)
   (SGLangEngine pid=32887)            ^^^^^^^^^^^^^^^^^^^^^
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 974, in event_loop_normal
   (SGLangEngine pid=32887)     self.process_input_requests(recv_reqs)
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1187, in process_input_requests
   (SGLangEngine pid=32887)     output = self._request_dispatcher(recv_req)
   (SGLangEngine pid=32887)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/utils.py", line 507, in __call__
   (SGLangEngine pid=32887)     return fn(obj)
   (SGLangEngine pid=32887)            ^^^^^^^
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_update_weights_mixin.py", line 132, in release_memory_occupation
   (SGLangEngine pid=32887)     self.flush_cache()
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2249, in flush_cache
   (SGLangEngine pid=32887)     torch.cuda.empty_cache()
   (SGLangEngine pid=32887)   File "/usr/local/lib/python3.12/dist-packages/torch/cuda/memory.py", line 224, in empty_cache
   (SGLangEngine pid=32887)     torch._C._cuda_emptyCache()
   (SGLangEngine pid=32887) torch.AcceleratorError: CUDA error: an illegal memory access was encountered
   (SGLangEngine pid=32887) Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
   (SGLangEngine pid=32887) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
   (SGLangEngine pid=32887) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
   (SGLangEngine pid=32887) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
   (SGLangEngine pid=32887) 
   (SGLangEngine pid=32887) 
   (SGLangEngine pid=32887) [2025-12-25 10:14:03] SIGQUIT received. signum=None, frame=None. It usually means one child failed.
   (SGLangEngine pid=32887) [2025-12-25 10:14:03 TP5 EP5] Scheduler hit an exception: Traceback (most recent call last):
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2683, in run_scheduler_process
   (SGLangEngine pid=32887)     scheduler.event_loop_normal()
   (SGLangEngine pid=32887)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
   (SGLangEngine pid=32887)     return func(*args, **kwargs)
   (SGLangEngine pid=32887)            ^^^^^^^^^^^^^^^^^^^^^
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 974, in event_loop_normal
   (SGLangEngine pid=32887)     self.process_input_requests(recv_reqs)
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1187, in process_input_requests
   (SGLangEngine pid=32887)     output = self._request_dispatcher(recv_req)
   (SGLangEngine pid=32887)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/utils.py", line 507, in __call__
   (SGLangEngine pid=32887)     return fn(obj)
   (SGLangEngine pid=32887)            ^^^^^^^
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_update_weights_mixin.py", line 132, in release_memory_occupation
   (SGLangEngine pid=32887)     self.flush_cache()
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2249, in flush_cache
   (SGLangEngine pid=32887)     torch.cuda.empty_cache()
   (SGLangEngine pid=32887)   File "/usr/local/lib/python3.12/dist-packages/torch/cuda/memory.py", line 224, in empty_cache
   (SGLangEngine pid=32887)     torch._C._cuda_emptyCache()
   (SGLangEngine pid=32887) torch.AcceleratorError: CUDA error: an illegal memory access was encountered
   (SGLangEngine pid=32887) Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
   (SGLangEngine pid=32887) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
   (SGLangEngine pid=32887) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
   (SGLangEngine pid=32887) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
   (SGLangEngine pid=32887) 
   (SGLangEngine pid=32887) 
   (SGLangEngine pid=32887) [2025-12-25 10:14:03 TP0 EP0] Scheduler hit an exception: Traceback (most recent call last):
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2683, in run_scheduler_process
   (SGLangEngine pid=32887)     scheduler.event_loop_normal()
   (SGLangEngine pid=32887)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
   (SGLangEngine pid=32887)     return func(*args, **kwargs)
   (SGLangEngine pid=32887)            ^^^^^^^^^^^^^^^^^^^^^
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 974, in event_loop_normal
   (SGLangEngine pid=32887)     self.process_input_requests(recv_reqs)
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1187, in process_input_requests
   (SGLangEngine pid=32887)     output = self._request_dispatcher(recv_req)
   (SGLangEngine pid=32887)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/utils.py", line 507, in __call__
   (SGLangEngine pid=32887)     return fn(obj)
   (SGLangEngine pid=32887)            ^^^^^^^
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_update_weights_mixin.py", line 132, in release_memory_occupation
   (SGLangEngine pid=32887)     self.flush_cache()
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2249, in flush_cache
   (SGLangEngine pid=32887)     torch.cuda.empty_cache()
   (SGLangEngine pid=32887)   File "/usr/local/lib/python3.12/dist-packages/torch/cuda/memory.py", line 224, in empty_cache
   (SGLangEngine pid=32887)     torch._C._cuda_emptyCache()
   (SGLangEngine pid=32887) torch.AcceleratorError: CUDA error: an illegal memory access was encountered
   (SGLangEngine pid=32887) Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
   (SGLangEngine pid=32887) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
   (SGLangEngine pid=32887) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
   (SGLangEngine pid=32887) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
   (SGLangEngine pid=32887) 
   (SGLangEngine pid=32887) 
   (SGLangEngine pid=32887) [2025-12-25 10:14:03 TP3 EP3] Scheduler hit an exception: Traceback (most recent call last):
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2683, in run_scheduler_process
   (SGLangEngine pid=32887)     scheduler.event_loop_normal()
   (SGLangEngine pid=32887)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
   (SGLangEngine pid=32887)     return func(*args, **kwargs)
   (SGLangEngine pid=32887)            ^^^^^^^^^^^^^^^^^^^^^
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 974, in event_loop_normal
   (SGLangEngine pid=32887)     self.process_input_requests(recv_reqs)
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1187, in process_input_requests
   (SGLangEngine pid=32887)     output = self._request_dispatcher(recv_req)
   (SGLangEngine pid=32887)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/utils.py", line 507, in __call__
   (SGLangEngine pid=32887)     return fn(obj)
   (SGLangEngine pid=32887)            ^^^^^^^
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_update_weights_mixin.py", line 132, in release_memory_occupation
   (SGLangEngine pid=32887)     self.flush_cache()
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2249, in flush_cache
   (SGLangEngine pid=32887)     torch.cuda.empty_cache()
   (SGLangEngine pid=32887)   File "/usr/local/lib/python3.12/dist-packages/torch/cuda/memory.py", line 224, in empty_cache
   (SGLangEngine pid=32887)     torch._C._cuda_emptyCache()
   (SGLangEngine pid=32887) torch.AcceleratorError: CUDA error: an illegal memory access was encountered
   (SGLangEngine pid=32887) Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
   (SGLangEngine pid=32887) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
   (SGLangEngine pid=32887) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
   (SGLangEngine pid=32887) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
   (SGLangEngine pid=32887) 
   (SGLangEngine pid=32887) 
   (SGLangEngine pid=32887) [2025-12-25 10:14:03 TP6 EP6] Scheduler hit an exception: Traceback (most recent call last):
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2683, in run_scheduler_process
   (SGLangEngine pid=32887)     scheduler.event_loop_normal()
   (SGLangEngine pid=32887)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
   (SGLangEngine pid=32887)     return func(*args, **kwargs)
   (SGLangEngine pid=32887)            ^^^^^^^^^^^^^^^^^^^^^
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 974, in event_loop_normal
   (SGLangEngine pid=32887)     self.process_input_requests(recv_reqs)
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1187, in process_input_requests
   (SGLangEngine pid=32887)     output = self._request_dispatcher(recv_req)
   (SGLangEngine pid=32887)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/utils.py", line 507, in __call__
   (SGLangEngine pid=32887)     return fn(obj)
   (SGLangEngine pid=32887)            ^^^^^^^
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_update_weights_mixin.py", line 132, in release_memory_occupation
   (SGLangEngine pid=32887)     self.flush_cache()
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2249, in flush_cache
   (SGLangEngine pid=32887)     torch.cuda.empty_cache()
   (SGLangEngine pid=32887)   File "/usr/local/lib/python3.12/dist-packages/torch/cuda/memory.py", line 224, in empty_cache
   (SGLangEngine pid=32887)     torch._C._cuda_emptyCache()
   (SGLangEngine pid=32887) torch.AcceleratorError: CUDA error: an illegal memory access was encountered
   (SGLangEngine pid=32887) Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
   (SGLangEngine pid=32887) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
   (SGLangEngine pid=32887) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
   (SGLangEngine pid=32887) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
   (SGLangEngine pid=32887) 
   (SGLangEngine pid=32887) 
   (SGLangEngine pid=32887) [2025-12-25 10:14:03 TP4 EP4] Scheduler hit an exception: Traceback (most recent call last):
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2683, in run_scheduler_process
   (SGLangEngine pid=32887)     scheduler.event_loop_normal()
   (SGLangEngine pid=32887)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
   (SGLangEngine pid=32887)     return func(*args, **kwargs)
   (SGLangEngine pid=32887)            ^^^^^^^^^^^^^^^^^^^^^
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 974, in event_loop_normal
   (SGLangEngine pid=32887)     self.process_input_requests(recv_reqs)
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1187, in process_input_requests
   (SGLangEngine pid=32887)     output = self._request_dispatcher(recv_req)
   (SGLangEngine pid=32887)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/utils.py", line 507, in __call__
   (SGLangEngine pid=32887)     return fn(obj)
   (SGLangEngine pid=32887)            ^^^^^^^
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_update_weights_mixin.py", line 132, in release_memory_occupation
   (SGLangEngine pid=32887)     self.flush_cache()
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2249, in flush_cache
   (SGLangEngine pid=32887)     torch.cuda.empty_cache()
   (SGLangEngine pid=32887)   File "/usr/local/lib/python3.12/dist-packages/torch/cuda/memory.py", line 224, in empty_cache
   (SGLangEngine pid=32887)     torch._C._cuda_emptyCache()
   (SGLangEngine pid=32887) torch.AcceleratorError: CUDA error: an illegal memory access was encountered
   (SGLangEngine pid=32887) Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
   (SGLangEngine pid=32887) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
   (SGLangEngine pid=32887) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
   (SGLangEngine pid=32887) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
   (SGLangEngine pid=32887) 
   (SGLangEngine pid=32887) 
   (SGLangEngine pid=32887) [2025-12-25 10:14:03 TP2 EP2] Scheduler hit an exception: Traceback (most recent call last):
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2683, in run_scheduler_process
   (SGLangEngine pid=32887)     scheduler.event_loop_normal()
   (SGLangEngine pid=32887)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
   (SGLangEngine pid=32887)     return func(*args, **kwargs)
   (SGLangEngine pid=32887)            ^^^^^^^^^^^^^^^^^^^^^
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 974, in event_loop_normal
   (SGLangEngine pid=32887)     self.process_input_requests(recv_reqs)
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1187, in process_input_requests
   (SGLangEngine pid=32887)     output = self._request_dispatcher(recv_req)
   (SGLangEngine pid=32887)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/utils.py", line 507, in __call__
   (SGLangEngine pid=32887)     return fn(obj)
   (SGLangEngine pid=32887)            ^^^^^^^
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_update_weights_mixin.py", line 132, in release_memory_occupation
   (SGLangEngine pid=32887)     self.flush_cache()
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2249, in flush_cache
   (SGLangEngine pid=32887)     torch.cuda.empty_cache()
   (SGLangEngine pid=32887)   File "/usr/local/lib/python3.12/dist-packages/torch/cuda/memory.py", line 224, in empty_cache
   (SGLangEngine pid=32887)     torch._C._cuda_emptyCache()
   (SGLangEngine pid=32887) torch.AcceleratorError: CUDA error: an illegal memory access was encountered
   (SGLangEngine pid=32887) Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
   (SGLangEngine pid=32887) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
   (SGLangEngine pid=32887) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
   (SGLangEngine pid=32887) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
   (SGLangEngine pid=32887) 
   (SGLangEngine pid=32887) 
   (SGLangEngine pid=32887) [2025-12-25 10:14:03 TP1 EP1] Scheduler hit an exception: Traceback (most recent call last):
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2683, in run_scheduler_process
   (SGLangEngine pid=32887)     scheduler.event_loop_normal()
   (SGLangEngine pid=32887)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
   (SGLangEngine pid=32887)     return func(*args, **kwargs)
   (SGLangEngine pid=32887)            ^^^^^^^^^^^^^^^^^^^^^
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 974, in event_loop_normal
   (SGLangEngine pid=32887)     self.process_input_requests(recv_reqs)
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1187, in process_input_requests
   (SGLangEngine pid=32887)     output = self._request_dispatcher(recv_req)
   (SGLangEngine pid=32887)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/utils.py", line 507, in __call__
   (SGLangEngine pid=32887)     return fn(obj)
   (SGLangEngine pid=32887)            ^^^^^^^
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_update_weights_mixin.py", line 132, in release_memory_occupation
   (SGLangEngine pid=32887)     self.flush_cache()
   (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2249, in flush_cache
   (SGLangEngine pid=32887)     torch.cuda.empty_cache()
   (SGLangEngine pid=32887)   File "/usr/local/lib/python3.12/dist-packages/torch/cuda/memory.py", line 224, in empty_cache
   (SGLangEngine pid=32887)     torch._C._cuda_emptyCache()
   (SGLangEngine pid=32887) torch.AcceleratorError: CUDA error: an illegal memory access was encountered
   (SGLangEngine pid=32887) Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
   (SGLangEngine pid=32887) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
   (SGLangEngine pid=32887) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
   (SGLangEngine pid=32887) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
   (SGLangEngine pid=32887) 
   (SGLangEngine pid=32887) 
   Traceback (most recent call last):
     File "/root/slime/train.py", line 106, in <module>
       train(args)
     File "/root/slime/train.py", line 24, in train
       rollout_manager, num_rollout_per_epoch = create_rollout_manager(args, pgs["rollout"])
                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/root/slime/slime/ray/placement_group.py", line 180, in create_rollout_manager
       ray.get(rollout_manager.offload.remote())
     File "/usr/local/lib/python3.12/dist-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
       return fn(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^
     File "/usr/local/lib/python3.12/dist-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
       return func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
     File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 2967, in get
       values, debugger_breakpoint = worker.get_objects(
                                     ^^^^^^^^^^^^^^^^^^^
     File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 1015, in get_objects
       raise value.as_instanceof_cause()
   ray.exceptions.RayTaskError(ConnectionError): ray::RolloutManager.offload() (pid=30384, ip=33.51.132.139, actor_id=cf24753ca82b0251339797c802000000, repr=<slime.ray.rollout.RolloutManager object at 0x7ed8534e43b0>)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/root/slime/slime/ray/rollout.py", line 129, in offload
       return ray.get([engine.release_memory_occupation.remote() for engine in self.rollout_engines])
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
              ^^^^^^^^^^^^^^^^^^^
              ^^^^^^^^^^^^^^^^^^^^^
                                     ^^^^^^^^^^^^^^^^^^^
   ray.exceptions.RayTaskError(ConnectionError): ray::SGLangEngine.release_memory_occupation() (pid=32887, ip=33.51.132.139, actor_id=0226147d06f3e1dcf22a844502000000, repr=<slime.backends.sglang_utils.sglang_engine.SGLangEngine object at 0x7f0366d41400>)
                  ^^^^^^^^^^^^^^^^^^^
     File "/usr/local/lib/python3.12/dist-packages/urllib3/connectionpool.py", line 534, in _make_request
       response = conn.getresponse()
                  ^^^^^^^^^^^^^^^^^^
     File "/usr/local/lib/python3.12/dist-packages/urllib3/connection.py", line 571, in getresponse
       httplib_response = super().getresponse()
                          ^^^^^^^^^^^^^^^^^^^^^
     File "/usr/lib/python3.12/http/client.py", line 1428, in getresponse
       response.begin()
     File "/usr/lib/python3.12/http/client.py", line 331, in begin
       version, status, reason = self._read_status()
                                 ^^^^^^^^^^^^^^^^^^^
     File "/usr/lib/python3.12/http/client.py", line 300, in _read_status
       raise RemoteDisconnected("Remote end closed connection without"
   http.client.RemoteDisconnected: Remote end closed connection without response
   
   During handling of the above exception, another exception occurred:
   
   ray::SGLangEngine.release_memory_occupation() (pid=32887, ip=33.51.132.139, actor_id=0226147d06f3e1dcf22a844502000000, repr=<slime.backends.sglang_utils.sglang_engine.SGLangEngine object at 0x7f0366d41400>)
     File "/usr/local/lib/python3.12/dist-packages/requests/adapters.py", line 644, in send
       resp = conn.urlopen(
              ^^^^^^^^^^^^^
     File "/usr/local/lib/python3.12/dist-packages/urllib3/connectionpool.py", line 841, in urlopen
       retries = retries.increment(
                 ^^^^^^^^^^^^^^^^^^
     File "/usr/local/lib/python3.12/dist-packages/urllib3/util/retry.py", line 474, in increment
       raise reraise(type(error), error, _stacktrace)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/usr/local/lib/python3.12/dist-packages/urllib3/util/util.py", line 38, in reraise
       raise value.with_traceback(tb)
     File "/usr/local/lib/python3.12/dist-packages/urllib3/connectionpool.py", line 787, in urlopen
       response = self._make_request(
                  ^^^^^^^^^^^^^^^^^^^
     File "/usr/local/lib/python3.12/dist-packages/urllib3/connectionpool.py", line 534, in _make_request
       response = conn.getresponse()
                  ^^^^^^^^^^^^^^^^^^
     File "/usr/local/lib/python3.12/dist-packages/urllib3/connection.py", line 571, in getresponse
       httplib_response = super().getresponse()
                          ^^^^^^^^^^^^^^^^^^^^^
     File "/usr/lib/python3.12/http/client.py", line 1428, in getresponse
       response.begin()
     File "/usr/lib/python3.12/http/client.py", line 331, in begin
       version, status, reason = self._read_status()
                                 ^^^^^^^^^^^^^^^^^^^
     File "/usr/lib/python3.12/http/client.py", line 300, in _read_status
       raise RemoteDisconnected("Remote end closed connection without"
   urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
   
   During handling of the above exception, another exception occurred:
   
   ray::SGLangEngine.release_memory_occupation() (pid=32887, ip=33.51.132.139, actor_id=0226147d06f3e1dcf22a844502000000, repr=<slime.backends.sglang_utils.sglang_engine.SGLangEngine object at 0x7f0366d41400>)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/root/slime/slime/backends/sglang_utils/sglang_engine.py", line 295, in release_memory_occupation
       return self._make_request("release_memory_occupation")
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/root/slime/slime/backends/sglang_utils/sglang_engine.py", line 194, in _make_request
       response = requests.post(url, json=payload or {})
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/usr/local/lib/python3.12/dist-packages/requests/api.py", line 115, in post
       return request("post", url, data=data, json=json, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/usr/local/lib/python3.12/dist-packages/requests/api.py", line 59, in request
       return session.request(method=method, url=url, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/usr/local/lib/python3.12/dist-packages/requests/sessions.py", line 589, in request
       resp = self.send(prep, **send_kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/usr/local/lib/python3.12/dist-packages/requests/sessions.py", line 703, in send
       r = adapter.send(request, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/usr/local/lib/python3.12/dist-packages/requests/adapters.py", line 659, in send
       raise ConnectionError(err, request=request)
   requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
   
   ---------------------------------------
   Job 'raysubmit_MRihy95Zs5jJrdkw' failed
   ---------------------------------------
   
   Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/usr/local/lib/python3.12/dist-packages/requests/sessions.py", line 589, in request
       resp = self.send(prep, **send_kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/usr/local/lib/python3.12/dist-packages/requests/sessions.py", line 703, in send
       r = adapter.send(request, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/usr/local/lib/python3.12/dist-packages/requests/adapters.py", line 659, in send
       raise ConnectionError(err, request=request)
   requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
   ```

2. 观察到文档中是否训推一体化，torch_memory_saver的功能不同，考虑这个报错是否是torch_memory_saver引起的。

<img width="1736" height="1054" alt="Image" src="https://github.com/user-attachments/assets/4e3872ef-d75b-467b-b5c7-60256462b79d" />

### Steps to Reproduce the Bug

1. 模型转换

   ```
   PYTHONPATH=/root/Megatron-LM/ torchrun --nproc-per-node 8 \
      tools/convert_hf_to_torch_dist.py \
      ${MODEL_ARGS[@]} \
      --hf-checkpoint /root/Qwen3-Next-80B-A3B-Thinking/ \
      --save /root/Qwen3-Next-80B-A3B-Thinking_torch_dist/
   ```

2. 单机八卡执行脚本

   ```bash
   #!/bin/bash
   export BASE_FOLDER=/root
   export MASTER_ADDR=$(hostname -I | awk '{print $1}')
   
   # for rerun the task
   pkill -9 sglang
   sleep 3
   ray stop --force
   pkill -9 ray
   pkill -9 python
   sleep 3
   pkill -9 ray
   pkill -9 python
   
   set -ex
   
   # if base folder not set raise error
   if [ -z "${BASE_FOLDER}" ]; then
     echo "BASE_FOLDER is not set. Please set it to the base directory of your checkpoints."
     exit 1
   fi
   
   if [ -z "${MASTER_ADDR}" ]; then
     echo "MASTER_ADDR is not set. Please set it to the master node address."
     exit 1
   fi
   
   # will prevent ray from buffering stdout/stderr
   export PYTHONBUFFERED=16
   
   NVLINK_COUNT=$(nvidia-smi topo -m 2>/dev/null | grep -o 'NV[0-9][0-9]*' | wc -l)
   if [ "$NVLINK_COUNT" -gt 0 ]; then
       HAS_NVLINK=1
   else
       HAS_NVLINK=0
   fi
   echo "HAS_NVLINK: $HAS_NVLINK (detected $NVLINK_COUNT NVLink references)"
   
   SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
   source "${SCRIPT_DIR}/models/qwen3-next-80B-A3B.sh"
   
   CKPT_ARGS=(
      --hf-checkpoint ${BASE_FOLDER}/Qwen3-Next-80B-A3B-Thinking
      --ref-load ${BASE_FOLDER}/Qwen3-Next-80B-A3B-Thinking_torch_dist
      --load ${BASE_FOLDER}/Qwen3-Next-80B-A3B-Thinking_slime/
      --save ${BASE_FOLDER}/Qwen3-Next-80B-A3B-Thinking_slime/
      --save-interval 5
   )
   
   ROLLOUT_ARGS=(
      --prompt-data ${BASE_FOLDER}/dapo-math-17k/dapo-math-17k.jsonl
      --input-key prompt
      --label-key label
      --apply-chat-template
      --rollout-shuffle
      --rm-type deepscaler
      --num-rollout 300
      --rollout-batch-size 16
      --n-samples-per-prompt 4
      --rollout-max-response-len 8192
      --rollout-temperature 0.8
   
      --global-batch-size 64
      --balance-data
   )
   
   EVAL_ARGS=(
      --eval-interval 10
      --eval-prompt-data aime ${BASE_FOLDER}/aime-2024/aime-2024.jsonl
      --n-samples-per-eval-prompt 2
      --eval-max-response-len 16384
      --eval-top-p 0.7
   )
   
   PERF_ARGS=(
      --tensor-model-parallel-size 2
      --sequence-parallel
      --pipeline-model-parallel-size 1
      --context-parallel-size 1
      --expert-model-parallel-size 4
      --expert-tensor-parallel-size 1
   
      --recompute-granularity full
      --recompute-method uniform
      --recompute-num-layers 1
   
      # --micro-batch-size 1
      --use-dynamic-batch-size
      --max-tokens-per-gpu 2048
   )
   
   GRPO_ARGS=(
      --advantage-estimator gspo
      #--use-kl-loss
      --kl-loss-coef 0.00
      --kl-loss-type low_var_kl
      --kl-coef 0.00
      --entropy-coef 0.00
      --eps-clip 4e-4
   )
   
   OPTIMIZER_ARGS=(
      --optimizer adam
      --lr 1e-6
      --lr-decay-style constant
      --weight-decay 0.1
      --adam-beta1 0.9
      --adam-beta2 0.98
   
      --optimizer-cpu-offload
      --overlap-cpu-optimizer-d2h-h2d
      --use-precision-aware-optimizer
   )
   
   WANDB_ARGS=(
   #   --use-wandb
   #    --wandb-project slime-dev
   #    --wandb-group qwen3-next-80B-A3B-test
   #    --wandb-key ${WANDB_KEY}
   )
   
   SGLANG_ARGS=(
      --rollout-num-gpus-per-engine 8
     #  --rollout-num-gpus 2
      --sglang-mem-fraction-static 0.8
      --sglang-ep-size 8
      
      --sglang-cuda-graph-bs 1 2 4 8 $(seq 16 8 128)
   
      # mtp
   #   --sglang-speculative-algorithm EAGLE
   #   --sglang-speculative-num-steps 2
   #   --sglang-speculative-eagle-topk 1
   #   --sglang-speculative-num-draft-tokens 3
   #   --sglang-enable-draft-weights-cpu-backup
   #
   #   --sglang-max-running-requests 512
   )
   
   MISC_ARGS=(
      # default dropout in megatron is 0.1
      --attention-dropout 0.0
      --hidden-dropout 0.0
      # should be good for model performance
      --accumulate-allreduce-grads-in-fp32
   #   --grad-reduce-in-bf16
      --attention-softmax-in-fp32
      # need to comment this when using model with MLA
      --attention-backend flash
   
      --moe-token-dispatcher-type alltoall
   #   --moe-enable-deepep
   #   --debug-rollout-only
   )
   
   # launch the master node of ray in container
   export no_proxy="127.0.0.1,${MASTER_ADDR}"
   ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus 8 --disable-usage-stats --dashboard-host=0.0.0.0 --dashboard-port=8265
   for WORKER_IP in $(awk '{print $1}' /root/mpi_rack_hostfile); do
     if [[ "$WORKER_IP" == "$MLP_WORKER_0_HOST" ]]; then
       continue
     fi
     echo "Starting Ray worker on ${WORKER_IP}"
     ssh root@"${WORKER_IP}" \
       "pkill -9 sglang ; ray stop --force ; pkill -9 python ; ray start --address=${MASTER_ADDR}:6379 --num-gpus 8 --node-ip-address ${WORKER_IP} --disable-usage-stats --dashboard-host=0.0.0.0 --dashboard-port=8265" &
   done
   wait
   
   # Build the runtime environment JSON with proper variable substitution
   RUNTIME_ENV_JSON="{
     \"env_vars\": {
       \"PYTHONPATH\": \"/root/Megatron-LM/\",
       \"CUDA_DEVICE_MAX_CONNECTIONS\": \"1\",
       \"NCCL_NVLS_ENABLE\": \"${HAS_NVLINK}\",
       \"no_proxy\": \"${no_proxy}\",
       \"MASTER_ADDR\": \"${MASTER_ADDR}\"
     }
   }"
   
   ray job submit --address="http://127.0.0.1:8265" \
      --runtime-env-json="${RUNTIME_ENV_JSON}" \
      -- python3 train.py \
      --actor-num-nodes 1 \
      --actor-num-gpus-per-node 8 \
      --colocate \
      ${MODEL_ARGS[@]} \
      ${CKPT_ARGS[@]} \
      ${ROLLOUT_ARGS[@]} \
      ${OPTIMIZER_ARGS[@]} \
      ${GRPO_ARGS[@]} \
      ${WANDB_ARGS[@]} \
      ${PERF_ARGS[@]} \
      ${EVAL_ARGS[@]} \
      ${SGLANG_ARGS[@]} \
      ${MISC_ARGS[@]}
   
   ```
3. 详细报错日志
  [err.log](https://github.com/user-attachments/files/24337885/err.log)

### Expected Behavior

1. 没有相关报错

### Environment Information

1. 8*H100
2. slime:nightly-dev-20251222b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Qwen3-Next-80B-A3B on 8*H100 use --colocate crash #1217

Describe the Bug

Steps to Reproduce the Bug

Expected Behavior

Environment Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Qwen3-Next-80B-A3B on 8*H100 use --colocate crash #1217

Description

Describe the Bug

Steps to Reproduce the Bug

Expected Behavior

Environment Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions