Ligrec Reproducibility and Bug Fix in Sparse Data.#991
Conversation
|
I'm not a big fan of running notebooks as tests, that's usually unnecessarily computationally expensive if the functions themselves are tested well 🤔 Within |
|
@timtreis I'd always try to run all tutorial notebooks in the CI. Separate job via a tutorial submodule or nbconvert + render on release. Pros:
|
|
I tried to write a unit test for it but I am not sure much it makes sense because I am not familiar with the context very well. In general I noticed more nan values when I didn't do the fix so I wrote a test based on that. When I run the code the number of nan-pvalues in my unit tests is:
So I just made an assertion to see if nan's is less than 500k. I am open to any suggestions on this |
|
The docs say:
Is it possible to figure out the number of NaNs we should see? If not for these data, then maybe for synthetic data? |
The reason for the bugI found the exact reason why this happens. I'd like to write a more explicit solution for this reason. It is because c.sum is and/or using bool values while when cast to float it is always interpreted as integers when summing. %load_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np
import scipy.sparse as sp
# Create a simple sparse matrix with some small values
data = np.array([
[1.0, 0.1, 0.0],
[0.0, 1.0, 0.0],
[0.0, 1.0, 0.0]
])
# Convert to sparse DataFrame
sparse_df = pd.DataFrame.sparse.from_spmatrix(
sp.csc_matrix(data),
columns=['Gene1', 'Gene2', 'Gene3']
)
dense_df = pd.DataFrame(data, columns=['Gene1', 'Gene2', 'Gene3'])
# Let's look at both DataFrames first
print("Sparse DataFrame:")
print(sparse_df)
print("Dense DataFrame:")
print(dense_df)
sparse_gt0 = sparse_df > 0
dense_gt0 = dense_df > 0
print("Sparse sum")
sparse_sums = sparse_gt0.sum()
print(sparse_sums)
print("Dense sum")
dense_sums = dense_gt0.sum()
print(dense_sums)function to see the number of nan's expecteddef compute_expected_nans(adata: AnnData, interactions: pd.DataFrame, cluster_key: str, threshold: float) -> int:
"""Compute expected NaN count in ligrec results based on expression thresholds.
A value is NaN if either ligand expression in cluster1 or receptor expression in cluster2
is below threshold.
"""
# Convert to dense if sparse and compute expression fractions
X = adata.X.toarray() if hasattr(adata.X, "toarray") else adata.X
clusters = adata.obs[cluster_key]
frac = (
pd.DataFrame((X > 0).astype(int), index=adata.obs_names, columns=adata.var_names)
.assign(cluster=clusters)
.groupby("cluster", observed=True)
.mean()
)
# Count NaNs using boolean operations
cluster_pairs = pd.MultiIndex.from_product([frac.index, frac.index])
ligand_mask = frac.loc[cluster_pairs.get_level_values(0), interactions["source"]].values < threshold
receptor_mask = frac.loc[cluster_pairs.get_level_values(1), interactions["target"]].values < threshold
return np.sum(ligand_mask | receptor_mask) |
| A value in the result becomes NaN if either: | ||
| - The ligand's mask is False in the source cluster, OR | ||
| - The receptor's mask is False in the target cluster |
There was a problem hiding this comment.
Sorry, doesn't this imply they both have to be True to be non-NaN? Wouldn't that exlude Gene2→Gene3?
There was a problem hiding this comment.
You are right thanks for checking it out. Turns out I didn't fully understand it but I think I now got it and made the test a bit more specific as well. I also fixed the explanation.
ilan-gold
left a comment
There was a problem hiding this comment.
Small comment. Once addressed/CI still passes, good to merge from my perspective
src/squidpy/gr/_ligrec.py
Outdated
|
|
||
| mean = groups.mean().values.T # (n_genes, n_clusters) | ||
| mask = groups.apply(lambda c: ((c > 0).sum() / len(c)) >= threshold).values.T # (n_genes, n_clusters) | ||
| mask = groups.apply(lambda c: ((c > 0).astype(int).sum() / len(c)) >= threshold).values.T # (n_genes, n_clusters) |
There was a problem hiding this comment.
Let's link to the issue on why this astype cast is there. And use "int64" to be explicit (which is what int does anyway).
for more information, see https://pre-commit.ci
Fixes #871 and #840
When comparing version 1.2.4 and main, I noticed 0's were integers while non zeros were floats in the main branch while in the old version they were all the float. I think this caused some rounding problems and integer divisions. I have some ideas on how to expose these bugs on unit tests but I just wanted to open this PR to show the fix.
Ligrec needs two things but I seperated the other to simplify this PR. Normally we also need to modify numba code because it doesn't parallelize due to an assertion exitting the loop early. Those changes are in this branch: main...fix/ligrec
The notebooks are now reproducable by my local tests. We used notebook tests on the CI in moscot, do you think we can also do this in squidpy @timtreis ?