-
Notifications
You must be signed in to change notification settings - Fork 615
Closed
Labels
? - Needs TriageNeed team to review and classifyNeed team to review and classifybugSomething isn't workingSomething isn't working
Description
Describe the bug
On ONLY RAPDIS 25.02a CUDA 12.8, I get this NCCL error when trying to fit KMeans on Dask a dask cluster: NCCL error encountered at: file=/opt/conda/include/raft/comms/detail/std_comms.hpp line=435. This happened on an H100
Affects the KMeans MNMG Notebook on ARM SBSA equipped with an H100. Tested on Python 3.12 and 3.11. x86 based B100 seems to work with same docker run commands
Steps/Code to reproduce bug
from cuml.dask.cluster.kmeans import KMeans as cuKMeans
from cuml.dask.common import to_dask_df
from cuml.dask.datasets import make_blobs
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
from dask_ml.cluster import KMeans as skKMeans
cluster = LocalCUDACluster(threads_per_worker=1)
client = Client(cluster)
n_samples = 1000000
n_features = 2
n_total_partitions = len(list(client.has_what().keys()))
X_dca, Y_dca = make_blobs(n_samples,
n_features,
centers = 5,
n_parts = n_total_partitions,
cluster_std=0.1,
verbose=True)
kmeans_cuml = cuKMeans(init="k-means||",
n_clusters=5,
random_state=100)
kmeans_cuml.fit(X_dca)
Outputs
2025-02-10 21:48:32,918 - distributed.worker - ERROR - Compute Failed
Key: _func_fit-95283355-8bff-49c7-8be7-e345c680da67
State: executing
Task: <Task '_func_fit-95283355-8bff-49c7-8be7-e345c680da67' _func_fit(..., ...)>
Exception: "RuntimeError('NCCL error encountered at: file=/opt/conda/include/raft/comms/detail/std_comms.hpp line=435: ')"
Traceback: ' File "/opt/conda/lib/python3.12/site-packages/cuml/dask/common/base.py", line 464, in check_cuml_mnmg\n return func(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^\n File "/opt/conda/lib/python3.12/site-packages/cuml/dask/cluster/kmeans.py", line 113, in _func_fit\n return cumlKMeans(handle=handle, output_type=datatype, **kwargs).fit(\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/opt/conda/lib/python3.12/site-packages/cuml/internals/api_decorators.py", line 188, in wrapper\n ret = func(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^\n File "kmeans_mg.pyx", line 158, in cuml.cluster.kmeans_mg.KMeansMG.fit\n'
2025-02-10 21:48:32,920 - distributed.worker - ERROR - Compute Failed
Key: _func_fit-0d82d4e2-b791-484a-a821-46c29dd567c1
State: executing
Task: <Task '_func_fit-0d82d4e2-b791-484a-a821-46c29dd567c1' _func_fit(..., ...)>
Exception: "RuntimeError('NCCL error encountered at: file=/opt/conda/include/raft/comms/detail/std_comms.hpp line=435: ')"
Traceback: ' File "/opt/conda/lib/python3.12/site-packages/cuml/dask/common/base.py", line 464, in check_cuml_mnmg\n return func(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^\n File "/opt/conda/lib/python3.12/site-packages/cuml/dask/cluster/kmeans.py", line 113, in _func_fit\n return cumlKMeans(handle=handle, output_type=datatype, **kwargs).fit(\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/opt/conda/lib/python3.12/site-packages/cuml/internals/api_decorators.py", line 188, in wrapper\n ret = func(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^\n File "kmeans_mg.pyx", line 158, in cuml.cluster.kmeans_mg.KMeansMG.fit\n'
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
File <timed exec>:5
File /opt/conda/lib/python3.12/site-packages/cuml/internals/memory_utils.py:87, in with_cupy_rmm.<locals>.cupy_rmm_wrapper(*args, **kwargs)
85 if GPU_ENABLED:
86 with cupy_using_allocator(rmm_cupy_allocator):
---> 87 return func(*args, **kwargs)
88 return func(*args, **kwargs)
File /opt/conda/lib/python3.12/site-packages/cuml/dask/cluster/kmeans.py:175, in KMeans.fit(self, X, sample_weight)
159 comms.init(workers=data.workers)
161 kmeans_fit = [
162 self.client.submit(
163 KMeans._func_fit,
(...)
172 for idx, wf in enumerate(data.worker_to_parts.items())
173 ]
--> 175 wait_and_raise_from_futures(kmeans_fit)
177 comms.destroy()
179 _results = [res.result() for res in kmeans_fit]
File /opt/conda/lib/python3.12/site-packages/cuml/dask/common/utils.py:164, in wait_and_raise_from_futures(futures)
159 """
160 Returns the collected futures after all the futures
161 have finished and do not indicate any exceptions.
162 """
163 wait(futures)
--> 164 raise_exception_from_futures(futures)
165 return futures
File /opt/conda/lib/python3.12/site-packages/cuml/dask/common/utils.py:152, in raise_exception_from_futures(futures)
150 errs = [f.exception() for f in futures if f.exception()]
151 if errs:
--> 152 raise RuntimeError(
153 "%d of %d worker jobs failed: %s"
154 % (len(errs), len(futures), ", ".join(map(str, errs)))
155 )
RuntimeError: 2 of 2 worker jobs failed: NCCL error encountered at: file=/opt/conda/include/raft/comms/detail/std_comms.hpp line=435: , NCCL error encountered at: file=/opt/conda/include/raft/comms/detail/std_comms.hpp line=435:
Expected behavior
It should fit the sample data, as it does in the x86 and/or other CUDA releases
Environment details (please complete the following information):
- Environment location: [Docker]
- Linux Distro/Architecture: [Ubuntu 24.04 arm64]
- GPU Model/Driver: [H100 and driver 535.161.08]
- CUDA: [12.8]
- Method of cuDF & cuML install: [conda, Docker, or from source]
- If method of install is [Docker], provide
docker pull&docker runcommands used:docker run --gpus all --rm -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p 9888:8888 -p 9787:8787 -p 9786:8786 rapidsai/notebooks:25.02a-cuda12.8-py3.12also testedpy3.11
- If method of install is [Docker], provide
Additional context
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
? - Needs TriageNeed team to review and classifyNeed team to review and classifybugSomething isn't workingSomething isn't working