Skip to content

Ligrec Reproducibility and Bug Fix in Sparse Data.#991

Merged
selmanozleyen merged 10 commits intomainfrom
fix/ligrec-cast
May 20, 2025
Merged

Ligrec Reproducibility and Bug Fix in Sparse Data.#991
selmanozleyen merged 10 commits intomainfrom
fix/ligrec-cast

Conversation

@selmanozleyen
Copy link
Copy Markdown
Member

@selmanozleyen selmanozleyen commented Apr 22, 2025

Fixes #871 and #840

When comparing version 1.2.4 and main, I noticed 0's were integers while non zeros were floats in the main branch while in the old version they were all the float. I think this caused some rounding problems and integer divisions. I have some ideas on how to expose these bugs on unit tests but I just wanted to open this PR to show the fix.

Ligrec needs two things but I seperated the other to simplify this PR. Normally we also need to modify numba code because it doesn't parallelize due to an assertion exitting the loop early. Those changes are in this branch: main...fix/ligrec

The notebooks are now reproducable by my local tests. We used notebook tests on the CI in moscot, do you think we can also do this in squidpy @timtreis ?

@selmanozleyen selmanozleyen linked an issue Apr 22, 2025 that may be closed by this pull request
@selmanozleyen selmanozleyen linked an issue Apr 22, 2025 that may be closed by this pull request
@timtreis
Copy link
Copy Markdown
Member

I'm not a big fan of running notebooks as tests, that's usually unnecessarily computationally expensive if the functions themselves are tested well 🤔 Within SpatialData we do that for a few heavier notebooks on a dedicated runner but the ligrec function seems light enough to not warrent that.

@Zethson
Copy link
Copy Markdown
Member

Zethson commented May 2, 2025

@timtreis I'd always try to run all tutorial notebooks in the CI. Separate job via a tutorial submodule or nbconvert + render on release. Pros:

  1. Ensures that the tutorial notebooks actually run.
  2. Forces you to keep the tutorials and their datasets small

@selmanozleyen
Copy link
Copy Markdown
Member Author

I tried to write a unit test for it but I am not sure much it makes sense because I am not familiar with the context very well. In general I noticed more nan values when I didn't do the fix so I wrote a test based on that. When I run the code the number of nan-pvalues in my unit tests is:

  • without fix (fails): 572k nan pvalues
  • with fix(passes): 424k nan pvalues

So I just made an assertion to see if nan's is less than 500k. I am open to any suggestions on this

@selmanozleyen selmanozleyen requested review from flying-sheep and timtreis and removed request for flying-sheep and timtreis May 8, 2025 15:44
@flying-sheep
Copy link
Copy Markdown
Member

flying-sheep commented May 9, 2025

The docs say:

NaN p-values mark combinations for which the mean expression of one of the interacting components was 0 or it didn’t pass the threshold percentage of cells being expressed within a given cluster.

Is it possible to figure out the number of NaNs we should see?

If not for these data, then maybe for synthetic data?

@selmanozleyen
Copy link
Copy Markdown
Member Author

selmanozleyen commented May 17, 2025

The reason for the bug

I found the exact reason why this happens. I'd like to write a more explicit solution for this reason. It is because c.sum is and/or using bool values while when cast to float it is always interpreted as integers when summing.

%load_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np
import scipy.sparse as sp
# Create a simple sparse matrix with some small values
data = np.array([
    [1.0, 0.1, 0.0],
    [0.0, 1.0, 0.0],
    [0.0, 1.0, 0.0]
])

# Convert to sparse DataFrame
sparse_df = pd.DataFrame.sparse.from_spmatrix(
    sp.csc_matrix(data),
    columns=['Gene1', 'Gene2', 'Gene3']
)

dense_df = pd.DataFrame(data, columns=['Gene1', 'Gene2', 'Gene3'])

# Let's look at both DataFrames first
print("Sparse DataFrame:")
print(sparse_df)
print("Dense DataFrame:")
print(dense_df)

sparse_gt0 = sparse_df > 0

dense_gt0 = dense_df > 0

print("Sparse sum")
sparse_sums = sparse_gt0.sum()
print(sparse_sums)

print("Dense sum")
dense_sums = dense_gt0.sum()
print(dense_sums)
Sparse DataFrame:
   Gene1  Gene2  Gene3
0    1.0    0.1      0
1      0    1.0      0
2      0    1.0      0
Dense DataFrame:
   Gene1  Gene2  Gene3
0    1.0    0.1    0.0
1    0.0    1.0    0.0
2    0.0    1.0    0.0
Sparse sum
Gene1     True
Gene2     True
Gene3    False
dtype: Sparse[bool, False]
Dense sum
Gene1    1
Gene2    3
Gene3    0
dtype: int64

function to see the number of nan's expected

def compute_expected_nans(adata: AnnData, interactions: pd.DataFrame, cluster_key: str, threshold: float) -> int:
    """Compute expected NaN count in ligrec results based on expression thresholds.

    A value is NaN if either ligand expression in cluster1 or receptor expression in cluster2
    is below threshold.
    """
    # Convert to dense if sparse and compute expression fractions
    X = adata.X.toarray() if hasattr(adata.X, "toarray") else adata.X
    clusters = adata.obs[cluster_key]
    frac = (
        pd.DataFrame((X > 0).astype(int), index=adata.obs_names, columns=adata.var_names)
        .assign(cluster=clusters)
        .groupby("cluster", observed=True)
        .mean()
    )

    # Count NaNs using boolean operations
    cluster_pairs = pd.MultiIndex.from_product([frac.index, frac.index])
    ligand_mask = frac.loc[cluster_pairs.get_level_values(0), interactions["source"]].values < threshold
    receptor_mask = frac.loc[cluster_pairs.get_level_values(1), interactions["target"]].values < threshold

    return np.sum(ligand_mask | receptor_mask)

@selmanozleyen selmanozleyen requested a review from ilan-gold May 19, 2025 12:47
Comment on lines +488 to +490
A value in the result becomes NaN if either:
- The ligand's mask is False in the source cluster, OR
- The receptor's mask is False in the target cluster
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, doesn't this imply they both have to be True to be non-NaN? Wouldn't that exlude Gene2→Gene3?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right thanks for checking it out. Turns out I didn't fully understand it but I think I now got it and made the test a bit more specific as well. I also fixed the explanation.

@selmanozleyen selmanozleyen requested a review from ilan-gold May 20, 2025 08:36
Copy link
Copy Markdown
Contributor

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment. Once addressed/CI still passes, good to merge from my perspective


mean = groups.mean().values.T # (n_genes, n_clusters)
mask = groups.apply(lambda c: ((c > 0).sum() / len(c)) >= threshold).values.T # (n_genes, n_clusters)
mask = groups.apply(lambda c: ((c > 0).astype(int).sum() / len(c)) >= threshold).values.T # (n_genes, n_clusters)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's link to the issue on why this astype cast is there. And use "int64" to be explicit (which is what int does anyway).

@selmanozleyen selmanozleyen merged commit fef6966 into main May 20, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ligand-receptor plot always report Error After removing rows with only NaN interactions, none remain.

5 participants