Ligrec Reproducibility and Bug Fix in Sparse Data. by selmanozleyen · Pull Request #991 · scverse/squidpy

selmanozleyen · 2025-04-22T08:34:39Z

When comparing version 1.2.4 and main, I noticed 0's were integers while non zeros were floats in the main branch while in the old version they were all the float. I think this caused some rounding problems and integer divisions. I have some ideas on how to expose these bugs on unit tests but I just wanted to open this PR to show the fix.

Ligrec needs two things but I seperated the other to simplify this PR. Normally we also need to modify numba code because it doesn't parallelize due to an assertion exitting the loop early. Those changes are in this branch: main...fix/ligrec

The notebooks are now reproducable by my local tests. We used notebook tests on the CI in moscot, do you think we can also do this in squidpy @timtreis ?

timtreis · 2025-04-30T20:20:34Z

I'm not a big fan of running notebooks as tests, that's usually unnecessarily computationally expensive if the functions themselves are tested well 🤔 Within SpatialData we do that for a few heavier notebooks on a dedicated runner but the ligrec function seems light enough to not warrent that.

Zethson · 2025-05-02T17:17:04Z

@timtreis I'd always try to run all tutorial notebooks in the CI. Separate job via a tutorial submodule or nbconvert + render on release. Pros:

Ensures that the tutorial notebooks actually run.
Forces you to keep the tutorials and their datasets small

for more information, see https://pre-commit.ci

selmanozleyen · 2025-05-06T10:25:04Z

I tried to write a unit test for it but I am not sure much it makes sense because I am not familiar with the context very well. In general I noticed more nan values when I didn't do the fix so I wrote a test based on that. When I run the code the number of nan-pvalues in my unit tests is:

without fix (fails): 572k nan pvalues
with fix(passes): 424k nan pvalues

So I just made an assertion to see if nan's is less than 500k. I am open to any suggestions on this

flying-sheep · 2025-05-09T09:49:19Z

The docs say:

NaN p-values mark combinations for which the mean expression of one of the interacting components was 0 or it didn’t pass the threshold percentage of cells being expressed within a given cluster.

Is it possible to figure out the number of NaNs we should see?

If not for these data, then maybe for synthetic data?

selmanozleyen · 2025-05-17T17:33:59Z

The reason for the bug

I found the exact reason why this happens. I'd like to write a more explicit solution for this reason. It is because c.sum is and/or using bool values while when cast to float it is always interpreted as integers when summing.

%load_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np
import scipy.sparse as sp
# Create a simple sparse matrix with some small values
data = np.array([
    [1.0, 0.1, 0.0],
    [0.0, 1.0, 0.0],
    [0.0, 1.0, 0.0]
])

# Convert to sparse DataFrame
sparse_df = pd.DataFrame.sparse.from_spmatrix(
    sp.csc_matrix(data),
    columns=['Gene1', 'Gene2', 'Gene3']
)

dense_df = pd.DataFrame(data, columns=['Gene1', 'Gene2', 'Gene3'])

# Let's look at both DataFrames first
print("Sparse DataFrame:")
print(sparse_df)
print("Dense DataFrame:")
print(dense_df)

sparse_gt0 = sparse_df > 0

dense_gt0 = dense_df > 0

print("Sparse sum")
sparse_sums = sparse_gt0.sum()
print(sparse_sums)

print("Dense sum")
dense_sums = dense_gt0.sum()
print(dense_sums)

Sparse DataFrame:
   Gene1  Gene2  Gene3
0    1.0    0.1      0
1      0    1.0      0
2      0    1.0      0
Dense DataFrame:
   Gene1  Gene2  Gene3
0    1.0    0.1    0.0
1    0.0    1.0    0.0
2    0.0    1.0    0.0
Sparse sum
Gene1     True
Gene2     True
Gene3    False
dtype: Sparse[bool, False]
Dense sum
Gene1    1
Gene2    3
Gene3    0
dtype: int64

function to see the number of nan's expected

def compute_expected_nans(adata: AnnData, interactions: pd.DataFrame, cluster_key: str, threshold: float) -> int:
    """Compute expected NaN count in ligrec results based on expression thresholds.

    A value is NaN if either ligand expression in cluster1 or receptor expression in cluster2
    is below threshold.
    """
    # Convert to dense if sparse and compute expression fractions
    X = adata.X.toarray() if hasattr(adata.X, "toarray") else adata.X
    clusters = adata.obs[cluster_key]
    frac = (
        pd.DataFrame((X > 0).astype(int), index=adata.obs_names, columns=adata.var_names)
        .assign(cluster=clusters)
        .groupby("cluster", observed=True)
        .mean()
    )

    # Count NaNs using boolean operations
    cluster_pairs = pd.MultiIndex.from_product([frac.index, frac.index])
    ligand_mask = frac.loc[cluster_pairs.get_level_values(0), interactions["source"]].values < threshold
    receptor_mask = frac.loc[cluster_pairs.get_level_values(1), interactions["target"]].values < threshold

    return np.sum(ligand_mask | receptor_mask)

ilan-gold · 2025-05-19T15:15:46Z

tests/graph/test_ligrec.py

+        A value in the result becomes NaN if either:
+        - The ligand's mask is False in the source cluster, OR
+        - The receptor's mask is False in the target cluster


Sorry, doesn't this imply they both have to be True to be non-NaN? Wouldn't that exlude Gene2→Gene3?

You are right thanks for checking it out. Turns out I didn't fully understand it but I think I now got it and made the test a bit more specific as well. I also fixed the explanation.

ilan-gold

Small comment. Once addressed/CI still passes, good to merge from my perspective

ilan-gold · 2025-05-20T08:57:05Z

src/squidpy/gr/_ligrec.py


    mean = groups.mean().values.T  # (n_genes, n_clusters)
-    mask = groups.apply(lambda c: ((c > 0).sum() / len(c)) >= threshold).values.T  # (n_genes, n_clusters)
+    mask = groups.apply(lambda c: ((c > 0).astype(int).sum() / len(c)) >= threshold).values.T  # (n_genes, n_clusters)


Let's link to the issue on why this astype cast is there. And use "int64" to be explicit (which is what int does anyway).

for more information, see https://pre-commit.ci

cast data

d0d20c1

selmanozleyen requested review from flying-sheep and timtreis April 22, 2025 08:34

selmanozleyen linked an issue Apr 22, 2025 that may be closed by this pull request

ligrec producing 0s in the pvalue table #968

Closed

selmanozleyen removed a link to an issue Apr 22, 2025

ligrec producing 0s in the pvalue table #968

Closed

selmanozleyen linked an issue Apr 22, 2025 that may be closed by this pull request

After removing rows with only NaN interactions, none remain. #840

Closed

Merge branch 'main' into fix/ligrec-cast

294c93b

selmanozleyen and others added 3 commits May 6, 2025 12:19

unit test attempt? old code: 572k new: 424k nans

7b3c965

Merge branch 'main' into fix/ligrec-cast

d72927d

[pre-commit.ci] auto fixes from pre-commit.com hooks

8f43457

for more information, see https://pre-commit.ci

selmanozleyen requested review from flying-sheep and timtreis and removed request for flying-sheep and timtreis May 8, 2025 15:44

Merge branch 'main' into fix/ligrec-cast

0af1761

update the fix and the test case

f5e8084

selmanozleyen requested a review from ilan-gold May 19, 2025 12:47

ilan-gold reviewed May 19, 2025

View reviewed changes

make the document clearer

c8d12a0

selmanozleyen requested a review from ilan-gold May 20, 2025 08:36

ilan-gold approved these changes May 20, 2025

View reviewed changes

selmanozleyen and others added 2 commits May 20, 2025 11:05

added comment

f71421d

[pre-commit.ci] auto fixes from pre-commit.com hooks

46706ab

for more information, see https://pre-commit.ci

selmanozleyen merged commit fef6966 into main May 20, 2025
7 checks passed

selmanozleyen linked an issue May 21, 2025 that may be closed by this pull request

significance of same dots changes based on plotted clusters in sq.pl.ligrec #945

Closed

selmanozleyen removed a link to an issue May 21, 2025

significance of same dots changes based on plotted clusters in sq.pl.ligrec #945

Closed

selmanozleyen self-assigned this May 21, 2025

flying-sheep removed request for flying-sheep and timtreis May 26, 2025 12:46

This was referenced Aug 21, 2025

ligrec producing 0s in the pvalue table #968

Closed

Receptor-ligand analysis cannot be reproduced #922

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ligrec Reproducibility and Bug Fix in Sparse Data.#991

Ligrec Reproducibility and Bug Fix in Sparse Data.#991
selmanozleyen merged 10 commits intomainfrom
fix/ligrec-cast

selmanozleyen commented Apr 22, 2025 •

edited

Loading

Uh oh!

timtreis commented Apr 30, 2025

Uh oh!

Zethson commented May 2, 2025

Uh oh!

selmanozleyen commented May 6, 2025

Uh oh!

flying-sheep commented May 9, 2025 •

edited

Loading

Uh oh!

selmanozleyen commented May 17, 2025 •

edited

Loading

Uh oh!

ilan-gold May 19, 2025

Uh oh!

selmanozleyen May 20, 2025

Uh oh!

ilan-gold left a comment

Uh oh!

ilan-gold May 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

selmanozleyen commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timtreis commented Apr 30, 2025

Uh oh!

Zethson commented May 2, 2025

Uh oh!

selmanozleyen commented May 6, 2025

Uh oh!

flying-sheep commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

selmanozleyen commented May 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The reason for the bug

function to see the number of nan's expected

Uh oh!

ilan-gold May 19, 2025

Choose a reason for hiding this comment

Uh oh!

selmanozleyen May 20, 2025

Choose a reason for hiding this comment

Uh oh!

ilan-gold left a comment

Choose a reason for hiding this comment

Uh oh!

ilan-gold May 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

selmanozleyen commented Apr 22, 2025 •

edited

Loading

flying-sheep commented May 9, 2025 •

edited

Loading

selmanozleyen commented May 17, 2025 •

edited

Loading