Skip to content

[BUG] In the latest 25.02 nightlies, KMeans' fit() throws a NCCL error on an ARM workstation's dask cluster. #6307

@taureandyernv

Description

@taureandyernv

Describe the bug
On ONLY RAPDIS 25.02a CUDA 12.8, I get this NCCL error when trying to fit KMeans on Dask a dask cluster: NCCL error encountered at: file=/opt/conda/include/raft/comms/detail/std_comms.hpp line=435. This happened on an H100

Affects the KMeans MNMG Notebook on ARM SBSA equipped with an H100. Tested on Python 3.12 and 3.11. x86 based B100 seems to work with same docker run commands

Steps/Code to reproduce bug

from cuml.dask.cluster.kmeans import KMeans as cuKMeans
from cuml.dask.common import to_dask_df
from cuml.dask.datasets import make_blobs
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
from dask_ml.cluster import KMeans as skKMeans
cluster = LocalCUDACluster(threads_per_worker=1)
client = Client(cluster)
n_samples = 1000000
n_features = 2
n_total_partitions = len(list(client.has_what().keys()))
X_dca, Y_dca = make_blobs(n_samples, 
                          n_features,
                          centers = 5, 
                          n_parts = n_total_partitions,
                          cluster_std=0.1, 
                          verbose=True)
kmeans_cuml = cuKMeans(init="k-means||",
                       n_clusters=5,
                       random_state=100)

kmeans_cuml.fit(X_dca)

Outputs

2025-02-10 21:48:32,918 - distributed.worker - ERROR - Compute Failed
Key:       _func_fit-95283355-8bff-49c7-8be7-e345c680da67
State:     executing
Task:  <Task '_func_fit-95283355-8bff-49c7-8be7-e345c680da67' _func_fit(..., ...)>
Exception: "RuntimeError('NCCL error encountered at: file=/opt/conda/include/raft/comms/detail/std_comms.hpp line=435: ')"
Traceback: '  File "/opt/conda/lib/python3.12/site-packages/cuml/dask/common/base.py", line 464, in check_cuml_mnmg\n    return func(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^\n  File "/opt/conda/lib/python3.12/site-packages/cuml/dask/cluster/kmeans.py", line 113, in _func_fit\n    return cumlKMeans(handle=handle, output_type=datatype, **kwargs).fit(\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/opt/conda/lib/python3.12/site-packages/cuml/internals/api_decorators.py", line 188, in wrapper\n    ret = func(*args, **kwargs)\n          ^^^^^^^^^^^^^^^^^^^^^\n  File "kmeans_mg.pyx", line 158, in cuml.cluster.kmeans_mg.KMeansMG.fit\n'

2025-02-10 21:48:32,920 - distributed.worker - ERROR - Compute Failed
Key:       _func_fit-0d82d4e2-b791-484a-a821-46c29dd567c1
State:     executing
Task:  <Task '_func_fit-0d82d4e2-b791-484a-a821-46c29dd567c1' _func_fit(..., ...)>
Exception: "RuntimeError('NCCL error encountered at: file=/opt/conda/include/raft/comms/detail/std_comms.hpp line=435: ')"
Traceback: '  File "/opt/conda/lib/python3.12/site-packages/cuml/dask/common/base.py", line 464, in check_cuml_mnmg\n    return func(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^\n  File "/opt/conda/lib/python3.12/site-packages/cuml/dask/cluster/kmeans.py", line 113, in _func_fit\n    return cumlKMeans(handle=handle, output_type=datatype, **kwargs).fit(\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/opt/conda/lib/python3.12/site-packages/cuml/internals/api_decorators.py", line 188, in wrapper\n    ret = func(*args, **kwargs)\n          ^^^^^^^^^^^^^^^^^^^^^\n  File "kmeans_mg.pyx", line 158, in cuml.cluster.kmeans_mg.KMeansMG.fit\n'

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File <timed exec>:5

File /opt/conda/lib/python3.12/site-packages/cuml/internals/memory_utils.py:87, in with_cupy_rmm.<locals>.cupy_rmm_wrapper(*args, **kwargs)
     85 if GPU_ENABLED:
     86     with cupy_using_allocator(rmm_cupy_allocator):
---> 87         return func(*args, **kwargs)
     88 return func(*args, **kwargs)

File /opt/conda/lib/python3.12/site-packages/cuml/dask/cluster/kmeans.py:175, in KMeans.fit(self, X, sample_weight)
    159 comms.init(workers=data.workers)
    161 kmeans_fit = [
    162     self.client.submit(
    163         KMeans._func_fit,
   (...)
    172     for idx, wf in enumerate(data.worker_to_parts.items())
    173 ]
--> 175 wait_and_raise_from_futures(kmeans_fit)
    177 comms.destroy()
    179 _results = [res.result() for res in kmeans_fit]

File /opt/conda/lib/python3.12/site-packages/cuml/dask/common/utils.py:164, in wait_and_raise_from_futures(futures)
    159 """
    160 Returns the collected futures after all the futures
    161 have finished and do not indicate any exceptions.
    162 """
    163 wait(futures)
--> 164 raise_exception_from_futures(futures)
    165 return futures

File /opt/conda/lib/python3.12/site-packages/cuml/dask/common/utils.py:152, in raise_exception_from_futures(futures)
    150 errs = [f.exception() for f in futures if f.exception()]
    151 if errs:
--> 152     raise RuntimeError(
    153         "%d of %d worker jobs failed: %s"
    154         % (len(errs), len(futures), ", ".join(map(str, errs)))
    155     )

RuntimeError: 2 of 2 worker jobs failed: NCCL error encountered at: file=/opt/conda/include/raft/comms/detail/std_comms.hpp line=435: , NCCL error encountered at: file=/opt/conda/include/raft/comms/detail/std_comms.hpp line=435: 

Expected behavior
It should fit the sample data, as it does in the x86 and/or other CUDA releases

Environment details (please complete the following information):

  • Environment location: [Docker]
  • Linux Distro/Architecture: [Ubuntu 24.04 arm64]
  • GPU Model/Driver: [H100 and driver 535.161.08]
  • CUDA: [12.8]
  • Method of cuDF & cuML install: [conda, Docker, or from source]
    • If method of install is [Docker], provide docker pull & docker run commands used: docker run --gpus all --rm -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p 9888:8888 -p 9787:8787 -p 9786:8786 rapidsai/notebooks:25.02a-cuda12.8-py3.12 also tested py3.11

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    ? - Needs TriageNeed team to review and classifybugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions