Skip to content

Commit 6b44468

Browse files
authored
New metric - ratio of inconsistent peak (#114)
* first prototype of new implementation * add scikit tda and persistent peak as alternative output * update description * update workflow script * add missing comma - facepalm * add small epsilon to harmonypy to fix kmeans bug * increase bras chunk to 1000 * testing bras with jax gpu * testing bras with cuda * missing pip and sci-b metrics *facepalm * downgrading image * update the image name * testing openproblems image * undo changes to run script * downgrade scib metrics package * reverting as gpu doesn't work * testing bin shifting * increase chunk size for bras * small changes to the bin * disable metrics * disable two metrics again * update cytovi * switched training to TF32 * remove persistent peaks * removed scaling from cytovi * increase batch size * reduce max epochs and train size * reverting config to default values * update description * adding scaling back into cytovi * add changelog
1 parent 971a82a commit 6b44468

File tree

17 files changed

+539
-81
lines changed

17 files changed

+539
-81
lines changed

CHANGELOG.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,6 @@
5757
* Added CytoNorm with aggregate of samples as controls (`methods/cytonorm_no_controls`).
5858
* Added parameters to tune CytoNorm.
5959

60-
6160
* Added CytoNorm correction to a goal batch (PR #92).
6261
* Added cyCombine correction to a reference batch (PR #90).
6362
* Added `metrics/bras` (PR #91).
@@ -66,6 +65,8 @@
6665

6766
* Added processing scripts for CLL dataset (PR #106).
6867

68+
* Added new metric `ratio_inconsistent_peaks` (PR #114).
69+
6970
## MAJOR CHANGES
7071

7172
* Updated file schema (PR #18):
@@ -100,6 +101,8 @@
100101

101102
* Fix problems identified during a full run (PR #99).
102103

104+
* Update CytoVI (PR #114).
105+
103106
## MINOR CHANGES
104107

105108
* Enabled unit tests (PR #2).

scripts/run_benchmark/run_full_seqeracloud.sh

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,16 +17,15 @@ cat > /tmp/params.yaml << HERE
1717
input_states: s3://openproblems-data/resources/task_cyto_batch_integration/datasets/**/state.yaml
1818
rename_keys: 'input_censored_split1:output_censored_split1;input_censored_split2:output_censored_split2;input_unintegrated:output_unintegrated'
1919
output_state: "state.yaml"
20-
settings: '{"metrics_exclude": ["cms"], "methods_include": ["mnnpy", "cytovi"]}'
2120
publish_dir: "$publish_dir"
2221
HERE
2322

2423
tw launch https://github.com/openproblems-bio/task_cyto_batch_integration.git \
25-
--revision build/fix_failed_stuff \
24+
--revision build/main \
2625
--pull-latest \
2726
--main-script target/nextflow/workflows/run_benchmark/main.nf \
2827
--workspace 53907369739130 \
2928
--params-file /tmp/params.yaml \
3029
--entry-name auto \
3130
--config common/nextflow_helpers/labels_tw.config \
32-
--labels task_cyto_batch_integration,mnnnpy
31+
--labels task_cyto_batch_integration,test_subset

src/control_methods/shuffle_integration/config.vsh.yaml

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,9 @@ name: shuffle_integration
33
label: Shuffle Integration
44
summary: Randomly shuffle cells in the whole dataset.
55
description: |
6-
This negative control randomly permutes cell-to-sample (hence batch)
7-
assignments while keeping each cell's measured markers unchanged.
8-
This destroys any biological and batch specific structure but preserves marker expression.
9-
6+
This negative control randomly shuffles all cells in the input data,
7+
destroying any biological structure (e.g., sample to cell mapping or batch assignments).
8+
109
Purpose:
1110
- Provide a baseline to verify that integration methods outperform
1211
random assignment of cells to batches.

src/control_methods/shuffle_integration_by_batch/config.vsh.yaml

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,9 @@ name: shuffle_integration_by_batch
33
label: Shuffle Integration — within batches
44
summary: Randomly reassign cells to any samples within the same batch.
55
description: |
6-
This negative-control method randomly permutes cell-to-cell type assignments.
7-
Cells remain assigned to their original batch (batch effects preserved).
8-
Within each batch, cells are reassigned to random samples, destroying
9-
biological/sample-specific structure (e.g., KO vs WT differences).
10-
6+
This negative-control method randomly shuffles cells within each batch independently,
7+
destroying cell to sample mapping while preserving batch-specific distributions.
8+
119
Purpose:
1210
- Evaluate whether an integration method preserves differences between samples
1311
and biological groups while removing batch effects.

src/control_methods/shuffle_integration_by_cell_type/config.vsh.yaml

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3,17 +3,21 @@ name: shuffle_integration_by_cell_type
33
label: Shuffle Integration — within cell type
44
summary: Randomly reassign cells to any cell types
55
description: |
6-
This negative-control method randomly permutes cell-to-cell type assignments.
7-
Cells will be assigned to any cell types, regardless of their original cell type
8-
or sample of origin or batch of origin.
6+
This negative-control method randomly shuffles cells within each cell type independently,
7+
destroying batch structure while preserving cell type-specific distributions.
8+
This serves as a negative control that maintains biological groupings but
9+
eliminates batch grouping in each cell type.
910
1011
Purpose:
1112
- Evaluate whether an integration method preserves differences between cell types
1213
while removing batch effects.
1314
1415
Example:
15-
- A Neutrophil from a KO sample in batch 1 may be reassigned to any cell type
16-
(B cell, T cell, Monocyte, etc.) from any sample in any batch.
16+
- A Neutrophil in batch 1 from KO sample may be reassigned to a Neutrophil in batch 2
17+
KO or WT sample or remain in a KO sample in batch 1 but assigned to different donor,
18+
or moved to a WT sample in batch 1 or 2, or remain in the same sample,
19+
but it will never be re-assigned to another cell type.
20+
1721
# status: disabled
1822
resources:
1923
- type: python_script

src/methods/cytovi/config.vsh.yaml

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -38,13 +38,13 @@ arguments:
3838
type: integer
3939
default: 1
4040
description: Number of layers.
41-
- name: --n_clusters
41+
- name: --max_epochs
4242
type: integer
43-
default: 20
44-
description: Number of clusters to use for subsampling.
45-
- name: --subsample_fraction
43+
default: 1000
44+
description: Number of epochs to train the model.
45+
- name: --train_size
4646
type: double
47-
default: 0.5
47+
default: 0.9
4848
description: Fraction of cells to subsample from each cluster for training.
4949

5050
# Resources required to run the component
@@ -68,11 +68,10 @@ engines:
6868
packages:
6969
- anndata>=0.11.0
7070
- scanpy[skmisc]>=1.10
71-
- scvi-tools==1.4.0
71+
- scvi-tools==1.4.0.post1
7272
- pyyaml
7373
- requests
7474
- jsonschema
75-
- scikit-learn
7675
github:
7776
- openproblems-bio/core#subdirectory=packages/python/openproblems
7877

src/methods/cytovi/script.py

Lines changed: 37 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,37 @@
1+
import time
2+
13
import anndata as ad
24
import numpy as np
5+
import scvi
6+
import torch
37
from scvi.external import cytovi
4-
from sklearn.cluster import KMeans
5-
from threadpoolctl import threadpool_limits
8+
9+
# from sklearn.cluster import KMeans
10+
# from threadpoolctl import threadpool_limits
611

712
## VIASH START
813
par = {
914
"input": "resources_test/task_cyto_batch_integration/mouse_spleen_flow_cytometry_subset/censored_split2.h5ad",
1015
"output": "resources_test/task_cyto_batch_integration/mouse_spleen_flow_cytometry_subset/output_cytovi_split2.h5ad",
1116
"n_hidden": 128,
1217
"n_layers": 1,
13-
"n_clusters": 10,
14-
"subsample_fraction": 0.5,
18+
"max_epochs": 1000,
19+
"train_size": 0.9,
1520
}
1621
meta = {"name": "cytovi"}
1722
## VIASH END
1823

24+
# setting calculation to TF32 to speed up training
25+
torch.backends.cuda.matmul.allow_tf32 = True
26+
27+
# increase num workers for data loading
28+
scvi.settings.num_workers = 95
29+
1930
print("Reading and preparing input files", flush=True)
2031
adata = ad.read_h5ad(par["input"])
2132

2233
adata.obs["batch_str"] = adata.obs["batch"].astype(str)
34+
adata.obs["sample_key_str"] = adata.obs["sample"].astype(str)
2335

2436
markers_to_correct = adata.var[adata.var["to_correct"]].index.to_numpy()
2537
markers_not_correct = adata.var[~adata.var["to_correct"]].index.to_numpy()
@@ -33,41 +45,36 @@
3345
adata=adata_to_correct,
3446
transformed_layer_key="preprocessed",
3547
batch_key="batch_str",
48+
scaled_layer_key="scaled",
3649
inplace=True,
3750
)
3851

39-
print("Clustering using k-means with k =", par["n_clusters"], flush=True)
40-
# cluster data using Kmeans
41-
with threadpool_limits(limits=1):
42-
adata_to_correct.obs["clusters"] = (
43-
KMeans(n_clusters=par["n_clusters"], random_state=0)
44-
.fit_predict(adata_to_correct.layers["scaled"])
45-
.astype(str)
46-
)
47-
# concatenate obs so we can use it for subsampling
48-
adata_to_correct.obs["sample_cluster"] = (
49-
adata_to_correct.obs["sample"].astype(str) + "_" + adata_to_correct.obs["clusters"]
50-
)
51-
# subsample cells without replacement
52-
print("Subsampling cells", flush=True)
53-
subsampled_cells = adata_to_correct.obs.groupby("sample_cluster")[
54-
"sample_cluster"
55-
].apply(lambda x: x.sample(n=round(len(x) * par["subsample_fraction"]), replace=False))
56-
# need the cell id included in the subsample
57-
subsampled_cells_idx = [x[1] for x in subsampled_cells.index.to_list()]
58-
59-
adata_subsampled = adata_to_correct[subsampled_cells_idx, :].copy()
60-
6152
print(
62-
f"Train CytoVI on subsampled data containing {adata_subsampled.shape[0]} cells",
53+
f"Train CytoVI on {adata_to_correct.shape[0]} cells",
6354
flush=True,
6455
)
6556

66-
cytovi.CYTOVI.setup_anndata(adata_subsampled, layer="scaled", batch_key="batch_str")
57+
cytovi.CYTOVI.setup_anndata(
58+
adata_to_correct,
59+
layer="scaled",
60+
batch_key="batch_str",
61+
sample_key="sample_key_str",
62+
)
63+
6764
model = cytovi.CYTOVI(
68-
adata=adata_subsampled, n_hidden=par["n_hidden"], n_layers=par["n_layers"]
65+
adata_to_correct, n_hidden=par["n_hidden"], n_layers=par["n_layers"]
66+
)
67+
68+
print("Start training CytoVI model", flush=True)
69+
70+
start = time.time()
71+
model.train(
72+
batch_size=8192,
73+
max_epochs=par["max_epochs"],
74+
train_size=par["train_size"],
6975
)
70-
model.train()
76+
end = time.time()
77+
print(f"Training took {end - start:.2f} seconds", flush=True)
7178

7279
# get batch corrected data
7380
print("Correcting data", flush=True)

src/methods/harmonypy/script.py

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,16 @@
44

55
## VIASH START
66
par = {
7-
"input": "resources_test/task_cyto_batch_integration/mouse_spleen_flow_cytometry_subset/censored_split2.h5ad",
8-
"output": "resources_test/task_cyto_batch_integration/mouse_spleen_flow_cytometry_subset/output_harmony_split2.h5ad",
7+
"input": "/Users/putri.g/Documents/cytobenchmark/debug_general/_viash_par/input_1/censored_split1.h5ad",
8+
"output": "/Users/putri.g/Documents/cytobenchmark/debug_general/_viash_par/output_1/output_harmony_split1.h5ad",
99
}
1010
meta = {"name": "harmonypy"}
1111
## VIASH END
1212

1313
print("Reading and preparing input files", flush=True)
1414
adata = ad.read_h5ad(par["input"])
1515

16+
# harmony can't handle integer batch labels
1617
adata.obs["batch_str"] = adata.obs["batch"].astype(str)
1718

1819
markers_to_correct = adata.var[adata.var["to_correct"]].index.to_numpy()
@@ -21,10 +22,13 @@
2122
adata_to_correct = adata[:, markers_to_correct].copy()
2223

2324
print("Run harmony", flush=True)
24-
# harmony can't handle integer batch labels
25+
26+
# TODO numerical instability in kmeans causing problem with harmony.
27+
# so adding a very small value to all entries to make sure there are no zeros
28+
epsilon = 1e-20
2529

2630
out = harmonypy.run_harmony(
27-
data_mat=adata_to_correct.layers["preprocessed"],
31+
data_mat=adata_to_correct.layers["preprocessed"] + epsilon,
2832
meta_data=adata_to_correct.obs,
2933
vars_use="batch_str",
3034
)

src/metrics/bras/config.vsh.yaml

Lines changed: 9 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,9 @@
1-
# The API specifies which type of component this is.
2-
# It contains specifications for:
3-
# - The input/output files
4-
# - Common parameters
5-
# - A unit test
61
__merge__: ../../api/comp_metric.yaml
72

83
# A unique identifier for your component (required).
94
# Can contain only lowercase letters or underscores.
105
name: bras
11-
12-
13-
6+
status: disabled
147
# Metadata for your component
158
info:
169
metrics:
@@ -56,13 +49,6 @@ info:
5649
# Whether a higher value represents a 'better' solution (required)
5750
maximize: true
5851

59-
# Component-specific parameters (optional)
60-
# arguments:
61-
# - name: "--n_neighbors"
62-
# type: "integer"
63-
# default: 5
64-
# description: Number of neighbors to use.
65-
6652
# Resources required to run the component
6753
resources:
6854
# The script of your component (required)
@@ -73,6 +59,14 @@ resources:
7359

7460
engines:
7561
# Specifications for the Docker image for this component.
62+
# testing gpu jax version
63+
# - type: docker
64+
# image: openproblems/base_pytorch_nvidia:1.1
65+
# setup:
66+
# - type: python
67+
# packages:
68+
# - jax[cuda_12_pip]
69+
# - scib-metrics~=0.5.6
7670
- type: docker
7771
image: python:3.11
7872
setup:

0 commit comments

Comments
 (0)