Skip to content

[BUG] Qwen3-Next-80B-A3B on 8*H100 use --colocate crash #1217

@huang3eng

Description

@huang3eng

Describe the Bug

  1. 尝试在8 * H100 上面使用megatron+sglang对Qwen3-Next-80B-A3B进行RL训练,如果使用训推一体化(Colocated)配置,则会报错如下。退一步不使用--colocate,则可以正常运行。(在多机上也会出现现象)

    (SGLangEngine pid=32887) [2025-12-25 10:14:03 TP7 EP7] Scheduler hit an exception: Traceback (most recent call last):
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2683, in run_scheduler_process
    (SGLangEngine pid=32887)     scheduler.event_loop_normal()
    (SGLangEngine pid=32887)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    (SGLangEngine pid=32887)     return func(*args, **kwargs)
    (SGLangEngine pid=32887)            ^^^^^^^^^^^^^^^^^^^^^
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 974, in event_loop_normal
    (SGLangEngine pid=32887)     self.process_input_requests(recv_reqs)
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1187, in process_input_requests
    (SGLangEngine pid=32887)     output = self._request_dispatcher(recv_req)
    (SGLangEngine pid=32887)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/utils.py", line 507, in __call__
    (SGLangEngine pid=32887)     return fn(obj)
    (SGLangEngine pid=32887)            ^^^^^^^
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_update_weights_mixin.py", line 132, in release_memory_occupation
    (SGLangEngine pid=32887)     self.flush_cache()
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2249, in flush_cache
    (SGLangEngine pid=32887)     torch.cuda.empty_cache()
    (SGLangEngine pid=32887)   File "/usr/local/lib/python3.12/dist-packages/torch/cuda/memory.py", line 224, in empty_cache
    (SGLangEngine pid=32887)     torch._C._cuda_emptyCache()
    (SGLangEngine pid=32887) torch.AcceleratorError: CUDA error: an illegal memory access was encountered
    (SGLangEngine pid=32887) Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
    (SGLangEngine pid=32887) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
    (SGLangEngine pid=32887) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
    (SGLangEngine pid=32887) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
    (SGLangEngine pid=32887) 
    (SGLangEngine pid=32887) 
    (SGLangEngine pid=32887) [2025-12-25 10:14:03] SIGQUIT received. signum=None, frame=None. It usually means one child failed.
    (SGLangEngine pid=32887) [2025-12-25 10:14:03 TP5 EP5] Scheduler hit an exception: Traceback (most recent call last):
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2683, in run_scheduler_process
    (SGLangEngine pid=32887)     scheduler.event_loop_normal()
    (SGLangEngine pid=32887)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    (SGLangEngine pid=32887)     return func(*args, **kwargs)
    (SGLangEngine pid=32887)            ^^^^^^^^^^^^^^^^^^^^^
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 974, in event_loop_normal
    (SGLangEngine pid=32887)     self.process_input_requests(recv_reqs)
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1187, in process_input_requests
    (SGLangEngine pid=32887)     output = self._request_dispatcher(recv_req)
    (SGLangEngine pid=32887)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/utils.py", line 507, in __call__
    (SGLangEngine pid=32887)     return fn(obj)
    (SGLangEngine pid=32887)            ^^^^^^^
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_update_weights_mixin.py", line 132, in release_memory_occupation
    (SGLangEngine pid=32887)     self.flush_cache()
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2249, in flush_cache
    (SGLangEngine pid=32887)     torch.cuda.empty_cache()
    (SGLangEngine pid=32887)   File "/usr/local/lib/python3.12/dist-packages/torch/cuda/memory.py", line 224, in empty_cache
    (SGLangEngine pid=32887)     torch._C._cuda_emptyCache()
    (SGLangEngine pid=32887) torch.AcceleratorError: CUDA error: an illegal memory access was encountered
    (SGLangEngine pid=32887) Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
    (SGLangEngine pid=32887) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
    (SGLangEngine pid=32887) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
    (SGLangEngine pid=32887) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
    (SGLangEngine pid=32887) 
    (SGLangEngine pid=32887) 
    (SGLangEngine pid=32887) [2025-12-25 10:14:03 TP0 EP0] Scheduler hit an exception: Traceback (most recent call last):
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2683, in run_scheduler_process
    (SGLangEngine pid=32887)     scheduler.event_loop_normal()
    (SGLangEngine pid=32887)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    (SGLangEngine pid=32887)     return func(*args, **kwargs)
    (SGLangEngine pid=32887)            ^^^^^^^^^^^^^^^^^^^^^
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 974, in event_loop_normal
    (SGLangEngine pid=32887)     self.process_input_requests(recv_reqs)
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1187, in process_input_requests
    (SGLangEngine pid=32887)     output = self._request_dispatcher(recv_req)
    (SGLangEngine pid=32887)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/utils.py", line 507, in __call__
    (SGLangEngine pid=32887)     return fn(obj)
    (SGLangEngine pid=32887)            ^^^^^^^
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_update_weights_mixin.py", line 132, in release_memory_occupation
    (SGLangEngine pid=32887)     self.flush_cache()
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2249, in flush_cache
    (SGLangEngine pid=32887)     torch.cuda.empty_cache()
    (SGLangEngine pid=32887)   File "/usr/local/lib/python3.12/dist-packages/torch/cuda/memory.py", line 224, in empty_cache
    (SGLangEngine pid=32887)     torch._C._cuda_emptyCache()
    (SGLangEngine pid=32887) torch.AcceleratorError: CUDA error: an illegal memory access was encountered
    (SGLangEngine pid=32887) Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
    (SGLangEngine pid=32887) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
    (SGLangEngine pid=32887) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
    (SGLangEngine pid=32887) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
    (SGLangEngine pid=32887) 
    (SGLangEngine pid=32887) 
    (SGLangEngine pid=32887) [2025-12-25 10:14:03 TP3 EP3] Scheduler hit an exception: Traceback (most recent call last):
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2683, in run_scheduler_process
    (SGLangEngine pid=32887)     scheduler.event_loop_normal()
    (SGLangEngine pid=32887)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    (SGLangEngine pid=32887)     return func(*args, **kwargs)
    (SGLangEngine pid=32887)            ^^^^^^^^^^^^^^^^^^^^^
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 974, in event_loop_normal
    (SGLangEngine pid=32887)     self.process_input_requests(recv_reqs)
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1187, in process_input_requests
    (SGLangEngine pid=32887)     output = self._request_dispatcher(recv_req)
    (SGLangEngine pid=32887)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/utils.py", line 507, in __call__
    (SGLangEngine pid=32887)     return fn(obj)
    (SGLangEngine pid=32887)            ^^^^^^^
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_update_weights_mixin.py", line 132, in release_memory_occupation
    (SGLangEngine pid=32887)     self.flush_cache()
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2249, in flush_cache
    (SGLangEngine pid=32887)     torch.cuda.empty_cache()
    (SGLangEngine pid=32887)   File "/usr/local/lib/python3.12/dist-packages/torch/cuda/memory.py", line 224, in empty_cache
    (SGLangEngine pid=32887)     torch._C._cuda_emptyCache()
    (SGLangEngine pid=32887) torch.AcceleratorError: CUDA error: an illegal memory access was encountered
    (SGLangEngine pid=32887) Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
    (SGLangEngine pid=32887) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
    (SGLangEngine pid=32887) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
    (SGLangEngine pid=32887) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
    (SGLangEngine pid=32887) 
    (SGLangEngine pid=32887) 
    (SGLangEngine pid=32887) [2025-12-25 10:14:03 TP6 EP6] Scheduler hit an exception: Traceback (most recent call last):
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2683, in run_scheduler_process
    (SGLangEngine pid=32887)     scheduler.event_loop_normal()
    (SGLangEngine pid=32887)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    (SGLangEngine pid=32887)     return func(*args, **kwargs)
    (SGLangEngine pid=32887)            ^^^^^^^^^^^^^^^^^^^^^
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 974, in event_loop_normal
    (SGLangEngine pid=32887)     self.process_input_requests(recv_reqs)
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1187, in process_input_requests
    (SGLangEngine pid=32887)     output = self._request_dispatcher(recv_req)
    (SGLangEngine pid=32887)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/utils.py", line 507, in __call__
    (SGLangEngine pid=32887)     return fn(obj)
    (SGLangEngine pid=32887)            ^^^^^^^
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_update_weights_mixin.py", line 132, in release_memory_occupation
    (SGLangEngine pid=32887)     self.flush_cache()
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2249, in flush_cache
    (SGLangEngine pid=32887)     torch.cuda.empty_cache()
    (SGLangEngine pid=32887)   File "/usr/local/lib/python3.12/dist-packages/torch/cuda/memory.py", line 224, in empty_cache
    (SGLangEngine pid=32887)     torch._C._cuda_emptyCache()
    (SGLangEngine pid=32887) torch.AcceleratorError: CUDA error: an illegal memory access was encountered
    (SGLangEngine pid=32887) Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
    (SGLangEngine pid=32887) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
    (SGLangEngine pid=32887) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
    (SGLangEngine pid=32887) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
    (SGLangEngine pid=32887) 
    (SGLangEngine pid=32887) 
    (SGLangEngine pid=32887) [2025-12-25 10:14:03 TP4 EP4] Scheduler hit an exception: Traceback (most recent call last):
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2683, in run_scheduler_process
    (SGLangEngine pid=32887)     scheduler.event_loop_normal()
    (SGLangEngine pid=32887)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    (SGLangEngine pid=32887)     return func(*args, **kwargs)
    (SGLangEngine pid=32887)            ^^^^^^^^^^^^^^^^^^^^^
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 974, in event_loop_normal
    (SGLangEngine pid=32887)     self.process_input_requests(recv_reqs)
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1187, in process_input_requests
    (SGLangEngine pid=32887)     output = self._request_dispatcher(recv_req)
    (SGLangEngine pid=32887)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/utils.py", line 507, in __call__
    (SGLangEngine pid=32887)     return fn(obj)
    (SGLangEngine pid=32887)            ^^^^^^^
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_update_weights_mixin.py", line 132, in release_memory_occupation
    (SGLangEngine pid=32887)     self.flush_cache()
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2249, in flush_cache
    (SGLangEngine pid=32887)     torch.cuda.empty_cache()
    (SGLangEngine pid=32887)   File "/usr/local/lib/python3.12/dist-packages/torch/cuda/memory.py", line 224, in empty_cache
    (SGLangEngine pid=32887)     torch._C._cuda_emptyCache()
    (SGLangEngine pid=32887) torch.AcceleratorError: CUDA error: an illegal memory access was encountered
    (SGLangEngine pid=32887) Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
    (SGLangEngine pid=32887) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
    (SGLangEngine pid=32887) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
    (SGLangEngine pid=32887) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
    (SGLangEngine pid=32887) 
    (SGLangEngine pid=32887) 
    (SGLangEngine pid=32887) [2025-12-25 10:14:03 TP2 EP2] Scheduler hit an exception: Traceback (most recent call last):
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2683, in run_scheduler_process
    (SGLangEngine pid=32887)     scheduler.event_loop_normal()
    (SGLangEngine pid=32887)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    (SGLangEngine pid=32887)     return func(*args, **kwargs)
    (SGLangEngine pid=32887)            ^^^^^^^^^^^^^^^^^^^^^
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 974, in event_loop_normal
    (SGLangEngine pid=32887)     self.process_input_requests(recv_reqs)
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1187, in process_input_requests
    (SGLangEngine pid=32887)     output = self._request_dispatcher(recv_req)
    (SGLangEngine pid=32887)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/utils.py", line 507, in __call__
    (SGLangEngine pid=32887)     return fn(obj)
    (SGLangEngine pid=32887)            ^^^^^^^
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_update_weights_mixin.py", line 132, in release_memory_occupation
    (SGLangEngine pid=32887)     self.flush_cache()
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2249, in flush_cache
    (SGLangEngine pid=32887)     torch.cuda.empty_cache()
    (SGLangEngine pid=32887)   File "/usr/local/lib/python3.12/dist-packages/torch/cuda/memory.py", line 224, in empty_cache
    (SGLangEngine pid=32887)     torch._C._cuda_emptyCache()
    (SGLangEngine pid=32887) torch.AcceleratorError: CUDA error: an illegal memory access was encountered
    (SGLangEngine pid=32887) Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
    (SGLangEngine pid=32887) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
    (SGLangEngine pid=32887) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
    (SGLangEngine pid=32887) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
    (SGLangEngine pid=32887) 
    (SGLangEngine pid=32887) 
    (SGLangEngine pid=32887) [2025-12-25 10:14:03 TP1 EP1] Scheduler hit an exception: Traceback (most recent call last):
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2683, in run_scheduler_process
    (SGLangEngine pid=32887)     scheduler.event_loop_normal()
    (SGLangEngine pid=32887)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    (SGLangEngine pid=32887)     return func(*args, **kwargs)
    (SGLangEngine pid=32887)            ^^^^^^^^^^^^^^^^^^^^^
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 974, in event_loop_normal
    (SGLangEngine pid=32887)     self.process_input_requests(recv_reqs)
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1187, in process_input_requests
    (SGLangEngine pid=32887)     output = self._request_dispatcher(recv_req)
    (SGLangEngine pid=32887)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/utils.py", line 507, in __call__
    (SGLangEngine pid=32887)     return fn(obj)
    (SGLangEngine pid=32887)            ^^^^^^^
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_update_weights_mixin.py", line 132, in release_memory_occupation
    (SGLangEngine pid=32887)     self.flush_cache()
    (SGLangEngine pid=32887)   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2249, in flush_cache
    (SGLangEngine pid=32887)     torch.cuda.empty_cache()
    (SGLangEngine pid=32887)   File "/usr/local/lib/python3.12/dist-packages/torch/cuda/memory.py", line 224, in empty_cache
    (SGLangEngine pid=32887)     torch._C._cuda_emptyCache()
    (SGLangEngine pid=32887) torch.AcceleratorError: CUDA error: an illegal memory access was encountered
    (SGLangEngine pid=32887) Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
    (SGLangEngine pid=32887) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
    (SGLangEngine pid=32887) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
    (SGLangEngine pid=32887) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
    (SGLangEngine pid=32887) 
    (SGLangEngine pid=32887) 
    Traceback (most recent call last):
      File "/root/slime/train.py", line 106, in <module>
        train(args)
      File "/root/slime/train.py", line 24, in train
        rollout_manager, num_rollout_per_epoch = create_rollout_manager(args, pgs["rollout"])
                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/root/slime/slime/ray/placement_group.py", line 180, in create_rollout_manager
        ray.get(rollout_manager.offload.remote())
      File "/usr/local/lib/python3.12/dist-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
        return fn(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.12/dist-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
        return func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 2967, in get
        values, debugger_breakpoint = worker.get_objects(
                                      ^^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 1015, in get_objects
        raise value.as_instanceof_cause()
    ray.exceptions.RayTaskError(ConnectionError): ray::RolloutManager.offload() (pid=30384, ip=33.51.132.139, actor_id=cf24753ca82b0251339797c802000000, repr=<slime.ray.rollout.RolloutManager object at 0x7ed8534e43b0>)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/root/slime/slime/ray/rollout.py", line 129, in offload
        return ray.get([engine.release_memory_occupation.remote() for engine in self.rollout_engines])
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
               ^^^^^^^^^^^^^^^^^^^
               ^^^^^^^^^^^^^^^^^^^^^
                                      ^^^^^^^^^^^^^^^^^^^
    ray.exceptions.RayTaskError(ConnectionError): ray::SGLangEngine.release_memory_occupation() (pid=32887, ip=33.51.132.139, actor_id=0226147d06f3e1dcf22a844502000000, repr=<slime.backends.sglang_utils.sglang_engine.SGLangEngine object at 0x7f0366d41400>)
                   ^^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.12/dist-packages/urllib3/connectionpool.py", line 534, in _make_request
        response = conn.getresponse()
                   ^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.12/dist-packages/urllib3/connection.py", line 571, in getresponse
        httplib_response = super().getresponse()
                           ^^^^^^^^^^^^^^^^^^^^^
      File "/usr/lib/python3.12/http/client.py", line 1428, in getresponse
        response.begin()
      File "/usr/lib/python3.12/http/client.py", line 331, in begin
        version, status, reason = self._read_status()
                                  ^^^^^^^^^^^^^^^^^^^
      File "/usr/lib/python3.12/http/client.py", line 300, in _read_status
        raise RemoteDisconnected("Remote end closed connection without"
    http.client.RemoteDisconnected: Remote end closed connection without response
    
    During handling of the above exception, another exception occurred:
    
    ray::SGLangEngine.release_memory_occupation() (pid=32887, ip=33.51.132.139, actor_id=0226147d06f3e1dcf22a844502000000, repr=<slime.backends.sglang_utils.sglang_engine.SGLangEngine object at 0x7f0366d41400>)
      File "/usr/local/lib/python3.12/dist-packages/requests/adapters.py", line 644, in send
        resp = conn.urlopen(
               ^^^^^^^^^^^^^
      File "/usr/local/lib/python3.12/dist-packages/urllib3/connectionpool.py", line 841, in urlopen
        retries = retries.increment(
                  ^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.12/dist-packages/urllib3/util/retry.py", line 474, in increment
        raise reraise(type(error), error, _stacktrace)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.12/dist-packages/urllib3/util/util.py", line 38, in reraise
        raise value.with_traceback(tb)
      File "/usr/local/lib/python3.12/dist-packages/urllib3/connectionpool.py", line 787, in urlopen
        response = self._make_request(
                   ^^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.12/dist-packages/urllib3/connectionpool.py", line 534, in _make_request
        response = conn.getresponse()
                   ^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.12/dist-packages/urllib3/connection.py", line 571, in getresponse
        httplib_response = super().getresponse()
                           ^^^^^^^^^^^^^^^^^^^^^
      File "/usr/lib/python3.12/http/client.py", line 1428, in getresponse
        response.begin()
      File "/usr/lib/python3.12/http/client.py", line 331, in begin
        version, status, reason = self._read_status()
                                  ^^^^^^^^^^^^^^^^^^^
      File "/usr/lib/python3.12/http/client.py", line 300, in _read_status
        raise RemoteDisconnected("Remote end closed connection without"
    urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
    
    During handling of the above exception, another exception occurred:
    
    ray::SGLangEngine.release_memory_occupation() (pid=32887, ip=33.51.132.139, actor_id=0226147d06f3e1dcf22a844502000000, repr=<slime.backends.sglang_utils.sglang_engine.SGLangEngine object at 0x7f0366d41400>)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/root/slime/slime/backends/sglang_utils/sglang_engine.py", line 295, in release_memory_occupation
        return self._make_request("release_memory_occupation")
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/root/slime/slime/backends/sglang_utils/sglang_engine.py", line 194, in _make_request
        response = requests.post(url, json=payload or {})
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.12/dist-packages/requests/api.py", line 115, in post
        return request("post", url, data=data, json=json, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.12/dist-packages/requests/api.py", line 59, in request
        return session.request(method=method, url=url, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.12/dist-packages/requests/sessions.py", line 589, in request
        resp = self.send(prep, **send_kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.12/dist-packages/requests/sessions.py", line 703, in send
        r = adapter.send(request, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.12/dist-packages/requests/adapters.py", line 659, in send
        raise ConnectionError(err, request=request)
    requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
    
    ---------------------------------------
    Job 'raysubmit_MRihy95Zs5jJrdkw' failed
    ---------------------------------------
    
    Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.12/dist-packages/requests/sessions.py", line 589, in request
        resp = self.send(prep, **send_kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.12/dist-packages/requests/sessions.py", line 703, in send
        r = adapter.send(request, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.12/dist-packages/requests/adapters.py", line 659, in send
        raise ConnectionError(err, request=request)
    requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
    
  2. 观察到文档中是否训推一体化,torch_memory_saver的功能不同,考虑这个报错是否是torch_memory_saver引起的。

Image

Steps to Reproduce the Bug

  1. 模型转换

    PYTHONPATH=/root/Megatron-LM/ torchrun --nproc-per-node 8 \
       tools/convert_hf_to_torch_dist.py \
       ${MODEL_ARGS[@]} \
       --hf-checkpoint /root/Qwen3-Next-80B-A3B-Thinking/ \
       --save /root/Qwen3-Next-80B-A3B-Thinking_torch_dist/
    
  2. 单机八卡执行脚本

    #!/bin/bash
    export BASE_FOLDER=/root
    export MASTER_ADDR=$(hostname -I | awk '{print $1}')
    
    # for rerun the task
    pkill -9 sglang
    sleep 3
    ray stop --force
    pkill -9 ray
    pkill -9 python
    sleep 3
    pkill -9 ray
    pkill -9 python
    
    set -ex
    
    # if base folder not set raise error
    if [ -z "${BASE_FOLDER}" ]; then
      echo "BASE_FOLDER is not set. Please set it to the base directory of your checkpoints."
      exit 1
    fi
    
    if [ -z "${MASTER_ADDR}" ]; then
      echo "MASTER_ADDR is not set. Please set it to the master node address."
      exit 1
    fi
    
    # will prevent ray from buffering stdout/stderr
    export PYTHONBUFFERED=16
    
    NVLINK_COUNT=$(nvidia-smi topo -m 2>/dev/null | grep -o 'NV[0-9][0-9]*' | wc -l)
    if [ "$NVLINK_COUNT" -gt 0 ]; then
        HAS_NVLINK=1
    else
        HAS_NVLINK=0
    fi
    echo "HAS_NVLINK: $HAS_NVLINK (detected $NVLINK_COUNT NVLink references)"
    
    SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
    source "${SCRIPT_DIR}/models/qwen3-next-80B-A3B.sh"
    
    CKPT_ARGS=(
       --hf-checkpoint ${BASE_FOLDER}/Qwen3-Next-80B-A3B-Thinking
       --ref-load ${BASE_FOLDER}/Qwen3-Next-80B-A3B-Thinking_torch_dist
       --load ${BASE_FOLDER}/Qwen3-Next-80B-A3B-Thinking_slime/
       --save ${BASE_FOLDER}/Qwen3-Next-80B-A3B-Thinking_slime/
       --save-interval 5
    )
    
    ROLLOUT_ARGS=(
       --prompt-data ${BASE_FOLDER}/dapo-math-17k/dapo-math-17k.jsonl
       --input-key prompt
       --label-key label
       --apply-chat-template
       --rollout-shuffle
       --rm-type deepscaler
       --num-rollout 300
       --rollout-batch-size 16
       --n-samples-per-prompt 4
       --rollout-max-response-len 8192
       --rollout-temperature 0.8
    
       --global-batch-size 64
       --balance-data
    )
    
    EVAL_ARGS=(
       --eval-interval 10
       --eval-prompt-data aime ${BASE_FOLDER}/aime-2024/aime-2024.jsonl
       --n-samples-per-eval-prompt 2
       --eval-max-response-len 16384
       --eval-top-p 0.7
    )
    
    PERF_ARGS=(
       --tensor-model-parallel-size 2
       --sequence-parallel
       --pipeline-model-parallel-size 1
       --context-parallel-size 1
       --expert-model-parallel-size 4
       --expert-tensor-parallel-size 1
    
       --recompute-granularity full
       --recompute-method uniform
       --recompute-num-layers 1
    
       # --micro-batch-size 1
       --use-dynamic-batch-size
       --max-tokens-per-gpu 2048
    )
    
    GRPO_ARGS=(
       --advantage-estimator gspo
       #--use-kl-loss
       --kl-loss-coef 0.00
       --kl-loss-type low_var_kl
       --kl-coef 0.00
       --entropy-coef 0.00
       --eps-clip 4e-4
    )
    
    OPTIMIZER_ARGS=(
       --optimizer adam
       --lr 1e-6
       --lr-decay-style constant
       --weight-decay 0.1
       --adam-beta1 0.9
       --adam-beta2 0.98
    
       --optimizer-cpu-offload
       --overlap-cpu-optimizer-d2h-h2d
       --use-precision-aware-optimizer
    )
    
    WANDB_ARGS=(
    #   --use-wandb
    #    --wandb-project slime-dev
    #    --wandb-group qwen3-next-80B-A3B-test
    #    --wandb-key ${WANDB_KEY}
    )
    
    SGLANG_ARGS=(
       --rollout-num-gpus-per-engine 8
      #  --rollout-num-gpus 2
       --sglang-mem-fraction-static 0.8
       --sglang-ep-size 8
       
       --sglang-cuda-graph-bs 1 2 4 8 $(seq 16 8 128)
    
       # mtp
    #   --sglang-speculative-algorithm EAGLE
    #   --sglang-speculative-num-steps 2
    #   --sglang-speculative-eagle-topk 1
    #   --sglang-speculative-num-draft-tokens 3
    #   --sglang-enable-draft-weights-cpu-backup
    #
    #   --sglang-max-running-requests 512
    )
    
    MISC_ARGS=(
       # default dropout in megatron is 0.1
       --attention-dropout 0.0
       --hidden-dropout 0.0
       # should be good for model performance
       --accumulate-allreduce-grads-in-fp32
    #   --grad-reduce-in-bf16
       --attention-softmax-in-fp32
       # need to comment this when using model with MLA
       --attention-backend flash
    
       --moe-token-dispatcher-type alltoall
    #   --moe-enable-deepep
    #   --debug-rollout-only
    )
    
    # launch the master node of ray in container
    export no_proxy="127.0.0.1,${MASTER_ADDR}"
    ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus 8 --disable-usage-stats --dashboard-host=0.0.0.0 --dashboard-port=8265
    for WORKER_IP in $(awk '{print $1}' /root/mpi_rack_hostfile); do
      if [[ "$WORKER_IP" == "$MLP_WORKER_0_HOST" ]]; then
        continue
      fi
      echo "Starting Ray worker on ${WORKER_IP}"
      ssh root@"${WORKER_IP}" \
        "pkill -9 sglang ; ray stop --force ; pkill -9 python ; ray start --address=${MASTER_ADDR}:6379 --num-gpus 8 --node-ip-address ${WORKER_IP} --disable-usage-stats --dashboard-host=0.0.0.0 --dashboard-port=8265" &
    done
    wait
    
    # Build the runtime environment JSON with proper variable substitution
    RUNTIME_ENV_JSON="{
      \"env_vars\": {
        \"PYTHONPATH\": \"/root/Megatron-LM/\",
        \"CUDA_DEVICE_MAX_CONNECTIONS\": \"1\",
        \"NCCL_NVLS_ENABLE\": \"${HAS_NVLINK}\",
        \"no_proxy\": \"${no_proxy}\",
        \"MASTER_ADDR\": \"${MASTER_ADDR}\"
      }
    }"
    
    ray job submit --address="http://127.0.0.1:8265" \
       --runtime-env-json="${RUNTIME_ENV_JSON}" \
       -- python3 train.py \
       --actor-num-nodes 1 \
       --actor-num-gpus-per-node 8 \
       --colocate \
       ${MODEL_ARGS[@]} \
       ${CKPT_ARGS[@]} \
       ${ROLLOUT_ARGS[@]} \
       ${OPTIMIZER_ARGS[@]} \
       ${GRPO_ARGS[@]} \
       ${WANDB_ARGS[@]} \
       ${PERF_ARGS[@]} \
       ${EVAL_ARGS[@]} \
       ${SGLANG_ARGS[@]} \
       ${MISC_ARGS[@]}
    
  3. 详细报错日志
    err.log

Expected Behavior

  1. 没有相关报错

Environment Information

  1. 8*H100
  2. slime:nightly-dev-20251222b

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions