Skip to content

Harmonypy cannot run on CLL dataset #115

@ghar1821

Description

@ghar1821

Harmonypy's kmeans clustering ran into error when processing CLL dataset:

/opt/homebrew/Caskroom/miniconda/base/envs/single_cell/lib/python3.12/site-packages/harmonypy/harmony.py:145: RuntimeWarning: invalid value encountered in divide
  self.Z_cos = self.Z_orig / self.Z_orig.max(axis=0)
2025-10-13 11:45:35,217 - harmonypy - INFO - Computing initial centroids with sklearn.KMeans...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/homebrew/Caskroom/miniconda/base/envs/single_cell/lib/python3.12/site-packages/harmonypy/harmony.py", line 127, in run_harmony
    ho = Harmony(
         ^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/single_cell/lib/python3.12/site-packages/harmonypy/harmony.py", line 178, in __init__
    self.init_cluster(cluster_fn)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/single_cell/lib/python3.12/site-packages/harmonypy/harmony.py", line 204, in init_cluster
    self.Y = cluster_fn(self.Z_cos.T, self.K).T
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/single_cell/lib/python3.12/site-packages/harmonypy/harmony.py", line 198, in _cluster_kmeans
    model.fit(data)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/single_cell/lib/python3.12/site-packages/sklearn/base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/single_cell/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py", line 1454, in fit
    X = validate_data(
        ^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/single_cell/lib/python3.12/site-packages/sklearn/utils/validation.py", line 2944, in validate_data
    out = check_array(X, input_name="X", **check_params)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/single_cell/lib/python3.12/site-packages/sklearn/utils/validation.py", line 1107, in check_array
    _assert_all_finite(
  File "/opt/homebrew/Caskroom/miniconda/base/envs/single_cell/lib/python3.12/site-packages/sklearn/utils/validation.py", line 120, in _assert_all_finite
    _assert_all_finite_element_wise(
  File "/opt/homebrew/Caskroom/miniconda/base/envs/single_cell/lib/python3.12/site-packages/sklearn/utils/validation.py", line 169, in _assert_all_finite_element_wise
    raise ValueError(msg_err)
ValueError: Input X contains NaN.
KMeans does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

I think KMeans has numerical stability issues with exact zeros because the error goes away if we increase the marker values by 1e-20.

Not sure how this will impact the results. It should be minimum as we add the small values uniformly across all cells and markers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions