Harmonypy cannot run on CLL dataset

Harmonypy's kmeans clustering ran into error when processing CLL dataset:

```
/opt/homebrew/Caskroom/miniconda/base/envs/single_cell/lib/python3.12/site-packages/harmonypy/harmony.py:145: RuntimeWarning: invalid value encountered in divide
  self.Z_cos = self.Z_orig / self.Z_orig.max(axis=0)
2025-10-13 11:45:35,217 - harmonypy - INFO - Computing initial centroids with sklearn.KMeans...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/homebrew/Caskroom/miniconda/base/envs/single_cell/lib/python3.12/site-packages/harmonypy/harmony.py", line 127, in run_harmony
    ho = Harmony(
         ^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/single_cell/lib/python3.12/site-packages/harmonypy/harmony.py", line 178, in __init__
    self.init_cluster(cluster_fn)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/single_cell/lib/python3.12/site-packages/harmonypy/harmony.py", line 204, in init_cluster
    self.Y = cluster_fn(self.Z_cos.T, self.K).T
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/single_cell/lib/python3.12/site-packages/harmonypy/harmony.py", line 198, in _cluster_kmeans
    model.fit(data)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/single_cell/lib/python3.12/site-packages/sklearn/base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/single_cell/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py", line 1454, in fit
    X = validate_data(
        ^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/single_cell/lib/python3.12/site-packages/sklearn/utils/validation.py", line 2944, in validate_data
    out = check_array(X, input_name="X", **check_params)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/single_cell/lib/python3.12/site-packages/sklearn/utils/validation.py", line 1107, in check_array
    _assert_all_finite(
  File "/opt/homebrew/Caskroom/miniconda/base/envs/single_cell/lib/python3.12/site-packages/sklearn/utils/validation.py", line 120, in _assert_all_finite
    _assert_all_finite_element_wise(
  File "/opt/homebrew/Caskroom/miniconda/base/envs/single_cell/lib/python3.12/site-packages/sklearn/utils/validation.py", line 169, in _assert_all_finite_element_wise
    raise ValueError(msg_err)
ValueError: Input X contains NaN.
KMeans does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values
```

I think KMeans has numerical stability issues with exact zeros because the error goes away if we increase the marker values by 1e-20. 

Not sure how this will impact the results. It should be minimum as we add the small values uniformly across all cells and markers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Harmonypy cannot run on CLL dataset #115

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Harmonypy cannot run on CLL dataset #115

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions