-
Notifications
You must be signed in to change notification settings - Fork 389
Open
Description
Describe the Bug
-
尝试在8 * H100 上面使用megatron+sglang对Qwen3-Next-80B-A3B进行RL训练,如果使用训推一体化(Colocated)配置,则会报错如下。退一步不使用--colocate,则可以正常运行。(在多机上也会出现现象)
(SGLangEngine pid=32887) [2025-12-25 10:14:03 TP7 EP7] Scheduler hit an exception: Traceback (most recent call last): (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2683, in run_scheduler_process (SGLangEngine pid=32887) scheduler.event_loop_normal() (SGLangEngine pid=32887) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (SGLangEngine pid=32887) return func(*args, **kwargs) (SGLangEngine pid=32887) ^^^^^^^^^^^^^^^^^^^^^ (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 974, in event_loop_normal (SGLangEngine pid=32887) self.process_input_requests(recv_reqs) (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1187, in process_input_requests (SGLangEngine pid=32887) output = self._request_dispatcher(recv_req) (SGLangEngine pid=32887) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/utils.py", line 507, in __call__ (SGLangEngine pid=32887) return fn(obj) (SGLangEngine pid=32887) ^^^^^^^ (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_update_weights_mixin.py", line 132, in release_memory_occupation (SGLangEngine pid=32887) self.flush_cache() (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2249, in flush_cache (SGLangEngine pid=32887) torch.cuda.empty_cache() (SGLangEngine pid=32887) File "/usr/local/lib/python3.12/dist-packages/torch/cuda/memory.py", line 224, in empty_cache (SGLangEngine pid=32887) torch._C._cuda_emptyCache() (SGLangEngine pid=32887) torch.AcceleratorError: CUDA error: an illegal memory access was encountered (SGLangEngine pid=32887) Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. (SGLangEngine pid=32887) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (SGLangEngine pid=32887) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (SGLangEngine pid=32887) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (SGLangEngine pid=32887) (SGLangEngine pid=32887) (SGLangEngine pid=32887) [2025-12-25 10:14:03] SIGQUIT received. signum=None, frame=None. It usually means one child failed. (SGLangEngine pid=32887) [2025-12-25 10:14:03 TP5 EP5] Scheduler hit an exception: Traceback (most recent call last): (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2683, in run_scheduler_process (SGLangEngine pid=32887) scheduler.event_loop_normal() (SGLangEngine pid=32887) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (SGLangEngine pid=32887) return func(*args, **kwargs) (SGLangEngine pid=32887) ^^^^^^^^^^^^^^^^^^^^^ (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 974, in event_loop_normal (SGLangEngine pid=32887) self.process_input_requests(recv_reqs) (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1187, in process_input_requests (SGLangEngine pid=32887) output = self._request_dispatcher(recv_req) (SGLangEngine pid=32887) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/utils.py", line 507, in __call__ (SGLangEngine pid=32887) return fn(obj) (SGLangEngine pid=32887) ^^^^^^^ (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_update_weights_mixin.py", line 132, in release_memory_occupation (SGLangEngine pid=32887) self.flush_cache() (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2249, in flush_cache (SGLangEngine pid=32887) torch.cuda.empty_cache() (SGLangEngine pid=32887) File "/usr/local/lib/python3.12/dist-packages/torch/cuda/memory.py", line 224, in empty_cache (SGLangEngine pid=32887) torch._C._cuda_emptyCache() (SGLangEngine pid=32887) torch.AcceleratorError: CUDA error: an illegal memory access was encountered (SGLangEngine pid=32887) Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. (SGLangEngine pid=32887) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (SGLangEngine pid=32887) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (SGLangEngine pid=32887) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (SGLangEngine pid=32887) (SGLangEngine pid=32887) (SGLangEngine pid=32887) [2025-12-25 10:14:03 TP0 EP0] Scheduler hit an exception: Traceback (most recent call last): (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2683, in run_scheduler_process (SGLangEngine pid=32887) scheduler.event_loop_normal() (SGLangEngine pid=32887) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (SGLangEngine pid=32887) return func(*args, **kwargs) (SGLangEngine pid=32887) ^^^^^^^^^^^^^^^^^^^^^ (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 974, in event_loop_normal (SGLangEngine pid=32887) self.process_input_requests(recv_reqs) (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1187, in process_input_requests (SGLangEngine pid=32887) output = self._request_dispatcher(recv_req) (SGLangEngine pid=32887) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/utils.py", line 507, in __call__ (SGLangEngine pid=32887) return fn(obj) (SGLangEngine pid=32887) ^^^^^^^ (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_update_weights_mixin.py", line 132, in release_memory_occupation (SGLangEngine pid=32887) self.flush_cache() (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2249, in flush_cache (SGLangEngine pid=32887) torch.cuda.empty_cache() (SGLangEngine pid=32887) File "/usr/local/lib/python3.12/dist-packages/torch/cuda/memory.py", line 224, in empty_cache (SGLangEngine pid=32887) torch._C._cuda_emptyCache() (SGLangEngine pid=32887) torch.AcceleratorError: CUDA error: an illegal memory access was encountered (SGLangEngine pid=32887) Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. (SGLangEngine pid=32887) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (SGLangEngine pid=32887) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (SGLangEngine pid=32887) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (SGLangEngine pid=32887) (SGLangEngine pid=32887) (SGLangEngine pid=32887) [2025-12-25 10:14:03 TP3 EP3] Scheduler hit an exception: Traceback (most recent call last): (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2683, in run_scheduler_process (SGLangEngine pid=32887) scheduler.event_loop_normal() (SGLangEngine pid=32887) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (SGLangEngine pid=32887) return func(*args, **kwargs) (SGLangEngine pid=32887) ^^^^^^^^^^^^^^^^^^^^^ (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 974, in event_loop_normal (SGLangEngine pid=32887) self.process_input_requests(recv_reqs) (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1187, in process_input_requests (SGLangEngine pid=32887) output = self._request_dispatcher(recv_req) (SGLangEngine pid=32887) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/utils.py", line 507, in __call__ (SGLangEngine pid=32887) return fn(obj) (SGLangEngine pid=32887) ^^^^^^^ (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_update_weights_mixin.py", line 132, in release_memory_occupation (SGLangEngine pid=32887) self.flush_cache() (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2249, in flush_cache (SGLangEngine pid=32887) torch.cuda.empty_cache() (SGLangEngine pid=32887) File "/usr/local/lib/python3.12/dist-packages/torch/cuda/memory.py", line 224, in empty_cache (SGLangEngine pid=32887) torch._C._cuda_emptyCache() (SGLangEngine pid=32887) torch.AcceleratorError: CUDA error: an illegal memory access was encountered (SGLangEngine pid=32887) Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. (SGLangEngine pid=32887) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (SGLangEngine pid=32887) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (SGLangEngine pid=32887) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (SGLangEngine pid=32887) (SGLangEngine pid=32887) (SGLangEngine pid=32887) [2025-12-25 10:14:03 TP6 EP6] Scheduler hit an exception: Traceback (most recent call last): (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2683, in run_scheduler_process (SGLangEngine pid=32887) scheduler.event_loop_normal() (SGLangEngine pid=32887) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (SGLangEngine pid=32887) return func(*args, **kwargs) (SGLangEngine pid=32887) ^^^^^^^^^^^^^^^^^^^^^ (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 974, in event_loop_normal (SGLangEngine pid=32887) self.process_input_requests(recv_reqs) (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1187, in process_input_requests (SGLangEngine pid=32887) output = self._request_dispatcher(recv_req) (SGLangEngine pid=32887) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/utils.py", line 507, in __call__ (SGLangEngine pid=32887) return fn(obj) (SGLangEngine pid=32887) ^^^^^^^ (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_update_weights_mixin.py", line 132, in release_memory_occupation (SGLangEngine pid=32887) self.flush_cache() (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2249, in flush_cache (SGLangEngine pid=32887) torch.cuda.empty_cache() (SGLangEngine pid=32887) File "/usr/local/lib/python3.12/dist-packages/torch/cuda/memory.py", line 224, in empty_cache (SGLangEngine pid=32887) torch._C._cuda_emptyCache() (SGLangEngine pid=32887) torch.AcceleratorError: CUDA error: an illegal memory access was encountered (SGLangEngine pid=32887) Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. (SGLangEngine pid=32887) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (SGLangEngine pid=32887) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (SGLangEngine pid=32887) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (SGLangEngine pid=32887) (SGLangEngine pid=32887) (SGLangEngine pid=32887) [2025-12-25 10:14:03 TP4 EP4] Scheduler hit an exception: Traceback (most recent call last): (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2683, in run_scheduler_process (SGLangEngine pid=32887) scheduler.event_loop_normal() (SGLangEngine pid=32887) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (SGLangEngine pid=32887) return func(*args, **kwargs) (SGLangEngine pid=32887) ^^^^^^^^^^^^^^^^^^^^^ (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 974, in event_loop_normal (SGLangEngine pid=32887) self.process_input_requests(recv_reqs) (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1187, in process_input_requests (SGLangEngine pid=32887) output = self._request_dispatcher(recv_req) (SGLangEngine pid=32887) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/utils.py", line 507, in __call__ (SGLangEngine pid=32887) return fn(obj) (SGLangEngine pid=32887) ^^^^^^^ (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_update_weights_mixin.py", line 132, in release_memory_occupation (SGLangEngine pid=32887) self.flush_cache() (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2249, in flush_cache (SGLangEngine pid=32887) torch.cuda.empty_cache() (SGLangEngine pid=32887) File "/usr/local/lib/python3.12/dist-packages/torch/cuda/memory.py", line 224, in empty_cache (SGLangEngine pid=32887) torch._C._cuda_emptyCache() (SGLangEngine pid=32887) torch.AcceleratorError: CUDA error: an illegal memory access was encountered (SGLangEngine pid=32887) Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. (SGLangEngine pid=32887) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (SGLangEngine pid=32887) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (SGLangEngine pid=32887) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (SGLangEngine pid=32887) (SGLangEngine pid=32887) (SGLangEngine pid=32887) [2025-12-25 10:14:03 TP2 EP2] Scheduler hit an exception: Traceback (most recent call last): (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2683, in run_scheduler_process (SGLangEngine pid=32887) scheduler.event_loop_normal() (SGLangEngine pid=32887) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (SGLangEngine pid=32887) return func(*args, **kwargs) (SGLangEngine pid=32887) ^^^^^^^^^^^^^^^^^^^^^ (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 974, in event_loop_normal (SGLangEngine pid=32887) self.process_input_requests(recv_reqs) (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1187, in process_input_requests (SGLangEngine pid=32887) output = self._request_dispatcher(recv_req) (SGLangEngine pid=32887) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/utils.py", line 507, in __call__ (SGLangEngine pid=32887) return fn(obj) (SGLangEngine pid=32887) ^^^^^^^ (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_update_weights_mixin.py", line 132, in release_memory_occupation (SGLangEngine pid=32887) self.flush_cache() (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2249, in flush_cache (SGLangEngine pid=32887) torch.cuda.empty_cache() (SGLangEngine pid=32887) File "/usr/local/lib/python3.12/dist-packages/torch/cuda/memory.py", line 224, in empty_cache (SGLangEngine pid=32887) torch._C._cuda_emptyCache() (SGLangEngine pid=32887) torch.AcceleratorError: CUDA error: an illegal memory access was encountered (SGLangEngine pid=32887) Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. (SGLangEngine pid=32887) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (SGLangEngine pid=32887) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (SGLangEngine pid=32887) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (SGLangEngine pid=32887) (SGLangEngine pid=32887) (SGLangEngine pid=32887) [2025-12-25 10:14:03 TP1 EP1] Scheduler hit an exception: Traceback (most recent call last): (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2683, in run_scheduler_process (SGLangEngine pid=32887) scheduler.event_loop_normal() (SGLangEngine pid=32887) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (SGLangEngine pid=32887) return func(*args, **kwargs) (SGLangEngine pid=32887) ^^^^^^^^^^^^^^^^^^^^^ (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 974, in event_loop_normal (SGLangEngine pid=32887) self.process_input_requests(recv_reqs) (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1187, in process_input_requests (SGLangEngine pid=32887) output = self._request_dispatcher(recv_req) (SGLangEngine pid=32887) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/utils.py", line 507, in __call__ (SGLangEngine pid=32887) return fn(obj) (SGLangEngine pid=32887) ^^^^^^^ (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_update_weights_mixin.py", line 132, in release_memory_occupation (SGLangEngine pid=32887) self.flush_cache() (SGLangEngine pid=32887) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2249, in flush_cache (SGLangEngine pid=32887) torch.cuda.empty_cache() (SGLangEngine pid=32887) File "/usr/local/lib/python3.12/dist-packages/torch/cuda/memory.py", line 224, in empty_cache (SGLangEngine pid=32887) torch._C._cuda_emptyCache() (SGLangEngine pid=32887) torch.AcceleratorError: CUDA error: an illegal memory access was encountered (SGLangEngine pid=32887) Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. (SGLangEngine pid=32887) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (SGLangEngine pid=32887) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (SGLangEngine pid=32887) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (SGLangEngine pid=32887) (SGLangEngine pid=32887) Traceback (most recent call last): File "/root/slime/train.py", line 106, in <module> train(args) File "/root/slime/train.py", line 24, in train rollout_manager, num_rollout_per_epoch = create_rollout_manager(args, pgs["rollout"]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/slime/slime/ray/placement_group.py", line 180, in create_rollout_manager ray.get(rollout_manager.offload.remote()) File "/usr/local/lib/python3.12/dist-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/ray/_private/client_mode_hook.py", line 104, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 2967, in get values, debugger_breakpoint = worker.get_objects( ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 1015, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(ConnectionError): ray::RolloutManager.offload() (pid=30384, ip=33.51.132.139, actor_id=cf24753ca82b0251339797c802000000, repr=<slime.ray.rollout.RolloutManager object at 0x7ed8534e43b0>) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/slime/slime/ray/rollout.py", line 129, in offload return ray.get([engine.release_memory_occupation.remote() for engine in self.rollout_engines]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^ ray.exceptions.RayTaskError(ConnectionError): ray::SGLangEngine.release_memory_occupation() (pid=32887, ip=33.51.132.139, actor_id=0226147d06f3e1dcf22a844502000000, repr=<slime.backends.sglang_utils.sglang_engine.SGLangEngine object at 0x7f0366d41400>) ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/urllib3/connectionpool.py", line 534, in _make_request response = conn.getresponse() ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/urllib3/connection.py", line 571, in getresponse httplib_response = super().getresponse() ^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/http/client.py", line 1428, in getresponse response.begin() File "/usr/lib/python3.12/http/client.py", line 331, in begin version, status, reason = self._read_status() ^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/http/client.py", line 300, in _read_status raise RemoteDisconnected("Remote end closed connection without" http.client.RemoteDisconnected: Remote end closed connection without response During handling of the above exception, another exception occurred: ray::SGLangEngine.release_memory_occupation() (pid=32887, ip=33.51.132.139, actor_id=0226147d06f3e1dcf22a844502000000, repr=<slime.backends.sglang_utils.sglang_engine.SGLangEngine object at 0x7f0366d41400>) File "/usr/local/lib/python3.12/dist-packages/requests/adapters.py", line 644, in send resp = conn.urlopen( ^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/urllib3/connectionpool.py", line 841, in urlopen retries = retries.increment( ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/urllib3/util/retry.py", line 474, in increment raise reraise(type(error), error, _stacktrace) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/urllib3/util/util.py", line 38, in reraise raise value.with_traceback(tb) File "/usr/local/lib/python3.12/dist-packages/urllib3/connectionpool.py", line 787, in urlopen response = self._make_request( ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/urllib3/connectionpool.py", line 534, in _make_request response = conn.getresponse() ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/urllib3/connection.py", line 571, in getresponse httplib_response = super().getresponse() ^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/http/client.py", line 1428, in getresponse response.begin() File "/usr/lib/python3.12/http/client.py", line 331, in begin version, status, reason = self._read_status() ^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/http/client.py", line 300, in _read_status raise RemoteDisconnected("Remote end closed connection without" urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) During handling of the above exception, another exception occurred: ray::SGLangEngine.release_memory_occupation() (pid=32887, ip=33.51.132.139, actor_id=0226147d06f3e1dcf22a844502000000, repr=<slime.backends.sglang_utils.sglang_engine.SGLangEngine object at 0x7f0366d41400>) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/slime/slime/backends/sglang_utils/sglang_engine.py", line 295, in release_memory_occupation return self._make_request("release_memory_occupation") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/slime/slime/backends/sglang_utils/sglang_engine.py", line 194, in _make_request response = requests.post(url, json=payload or {}) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/requests/api.py", line 115, in post return request("post", url, data=data, json=json, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/requests/sessions.py", line 589, in request resp = self.send(prep, **send_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/requests/sessions.py", line 703, in send r = adapter.send(request, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/requests/adapters.py", line 659, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) --------------------------------------- Job 'raysubmit_MRihy95Zs5jJrdkw' failed --------------------------------------- Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/requests/sessions.py", line 589, in request resp = self.send(prep, **send_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/requests/sessions.py", line 703, in send r = adapter.send(request, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/requests/adapters.py", line 659, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) -
观察到文档中是否训推一体化,torch_memory_saver的功能不同,考虑这个报错是否是torch_memory_saver引起的。
Steps to Reproduce the Bug
-
模型转换
PYTHONPATH=/root/Megatron-LM/ torchrun --nproc-per-node 8 \ tools/convert_hf_to_torch_dist.py \ ${MODEL_ARGS[@]} \ --hf-checkpoint /root/Qwen3-Next-80B-A3B-Thinking/ \ --save /root/Qwen3-Next-80B-A3B-Thinking_torch_dist/ -
单机八卡执行脚本
#!/bin/bash export BASE_FOLDER=/root export MASTER_ADDR=$(hostname -I | awk '{print $1}') # for rerun the task pkill -9 sglang sleep 3 ray stop --force pkill -9 ray pkill -9 python sleep 3 pkill -9 ray pkill -9 python set -ex # if base folder not set raise error if [ -z "${BASE_FOLDER}" ]; then echo "BASE_FOLDER is not set. Please set it to the base directory of your checkpoints." exit 1 fi if [ -z "${MASTER_ADDR}" ]; then echo "MASTER_ADDR is not set. Please set it to the master node address." exit 1 fi # will prevent ray from buffering stdout/stderr export PYTHONBUFFERED=16 NVLINK_COUNT=$(nvidia-smi topo -m 2>/dev/null | grep -o 'NV[0-9][0-9]*' | wc -l) if [ "$NVLINK_COUNT" -gt 0 ]; then HAS_NVLINK=1 else HAS_NVLINK=0 fi echo "HAS_NVLINK: $HAS_NVLINK (detected $NVLINK_COUNT NVLink references)" SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)" source "${SCRIPT_DIR}/models/qwen3-next-80B-A3B.sh" CKPT_ARGS=( --hf-checkpoint ${BASE_FOLDER}/Qwen3-Next-80B-A3B-Thinking --ref-load ${BASE_FOLDER}/Qwen3-Next-80B-A3B-Thinking_torch_dist --load ${BASE_FOLDER}/Qwen3-Next-80B-A3B-Thinking_slime/ --save ${BASE_FOLDER}/Qwen3-Next-80B-A3B-Thinking_slime/ --save-interval 5 ) ROLLOUT_ARGS=( --prompt-data ${BASE_FOLDER}/dapo-math-17k/dapo-math-17k.jsonl --input-key prompt --label-key label --apply-chat-template --rollout-shuffle --rm-type deepscaler --num-rollout 300 --rollout-batch-size 16 --n-samples-per-prompt 4 --rollout-max-response-len 8192 --rollout-temperature 0.8 --global-batch-size 64 --balance-data ) EVAL_ARGS=( --eval-interval 10 --eval-prompt-data aime ${BASE_FOLDER}/aime-2024/aime-2024.jsonl --n-samples-per-eval-prompt 2 --eval-max-response-len 16384 --eval-top-p 0.7 ) PERF_ARGS=( --tensor-model-parallel-size 2 --sequence-parallel --pipeline-model-parallel-size 1 --context-parallel-size 1 --expert-model-parallel-size 4 --expert-tensor-parallel-size 1 --recompute-granularity full --recompute-method uniform --recompute-num-layers 1 # --micro-batch-size 1 --use-dynamic-batch-size --max-tokens-per-gpu 2048 ) GRPO_ARGS=( --advantage-estimator gspo #--use-kl-loss --kl-loss-coef 0.00 --kl-loss-type low_var_kl --kl-coef 0.00 --entropy-coef 0.00 --eps-clip 4e-4 ) OPTIMIZER_ARGS=( --optimizer adam --lr 1e-6 --lr-decay-style constant --weight-decay 0.1 --adam-beta1 0.9 --adam-beta2 0.98 --optimizer-cpu-offload --overlap-cpu-optimizer-d2h-h2d --use-precision-aware-optimizer ) WANDB_ARGS=( # --use-wandb # --wandb-project slime-dev # --wandb-group qwen3-next-80B-A3B-test # --wandb-key ${WANDB_KEY} ) SGLANG_ARGS=( --rollout-num-gpus-per-engine 8 # --rollout-num-gpus 2 --sglang-mem-fraction-static 0.8 --sglang-ep-size 8 --sglang-cuda-graph-bs 1 2 4 8 $(seq 16 8 128) # mtp # --sglang-speculative-algorithm EAGLE # --sglang-speculative-num-steps 2 # --sglang-speculative-eagle-topk 1 # --sglang-speculative-num-draft-tokens 3 # --sglang-enable-draft-weights-cpu-backup # # --sglang-max-running-requests 512 ) MISC_ARGS=( # default dropout in megatron is 0.1 --attention-dropout 0.0 --hidden-dropout 0.0 # should be good for model performance --accumulate-allreduce-grads-in-fp32 # --grad-reduce-in-bf16 --attention-softmax-in-fp32 # need to comment this when using model with MLA --attention-backend flash --moe-token-dispatcher-type alltoall # --moe-enable-deepep # --debug-rollout-only ) # launch the master node of ray in container export no_proxy="127.0.0.1,${MASTER_ADDR}" ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus 8 --disable-usage-stats --dashboard-host=0.0.0.0 --dashboard-port=8265 for WORKER_IP in $(awk '{print $1}' /root/mpi_rack_hostfile); do if [[ "$WORKER_IP" == "$MLP_WORKER_0_HOST" ]]; then continue fi echo "Starting Ray worker on ${WORKER_IP}" ssh root@"${WORKER_IP}" \ "pkill -9 sglang ; ray stop --force ; pkill -9 python ; ray start --address=${MASTER_ADDR}:6379 --num-gpus 8 --node-ip-address ${WORKER_IP} --disable-usage-stats --dashboard-host=0.0.0.0 --dashboard-port=8265" & done wait # Build the runtime environment JSON with proper variable substitution RUNTIME_ENV_JSON="{ \"env_vars\": { \"PYTHONPATH\": \"/root/Megatron-LM/\", \"CUDA_DEVICE_MAX_CONNECTIONS\": \"1\", \"NCCL_NVLS_ENABLE\": \"${HAS_NVLINK}\", \"no_proxy\": \"${no_proxy}\", \"MASTER_ADDR\": \"${MASTER_ADDR}\" } }" ray job submit --address="http://127.0.0.1:8265" \ --runtime-env-json="${RUNTIME_ENV_JSON}" \ -- python3 train.py \ --actor-num-nodes 1 \ --actor-num-gpus-per-node 8 \ --colocate \ ${MODEL_ARGS[@]} \ ${CKPT_ARGS[@]} \ ${ROLLOUT_ARGS[@]} \ ${OPTIMIZER_ARGS[@]} \ ${GRPO_ARGS[@]} \ ${WANDB_ARGS[@]} \ ${PERF_ARGS[@]} \ ${EVAL_ARGS[@]} \ ${SGLANG_ARGS[@]} \ ${MISC_ARGS[@]}
-
详细报错日志
err.log
Expected Behavior
- 没有相关报错
Environment Information
- 8*H100
- slime:nightly-dev-20251222b
Metadata
Metadata
Assignees
Labels
No labels