Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
65a35a9
first prototype of new implementation
ghar1821 Oct 12, 2025
3996bfa
add scikit tda and persistent peak as alternative output
ghar1821 Oct 12, 2025
4a793f6
update description
ghar1821 Oct 12, 2025
0bb39a0
update workflow script
ghar1821 Oct 12, 2025
d36e9c1
add missing comma - facepalm
ghar1821 Oct 12, 2025
f7be6f2
add small epsilon to harmonypy to fix kmeans bug
ghar1821 Oct 13, 2025
a0740cc
increase bras chunk to 1000
ghar1821 Oct 13, 2025
b6beb83
testing bras with jax gpu
ghar1821 Oct 14, 2025
8e65e03
testing bras with cuda
ghar1821 Oct 14, 2025
9569f6b
missing pip and sci-b metrics *facepalm
ghar1821 Oct 14, 2025
42e8bd2
downgrading image
ghar1821 Oct 14, 2025
a7cef47
update the image name
ghar1821 Oct 14, 2025
913a70b
testing openproblems image
ghar1821 Oct 14, 2025
f8c9a4d
undo changes to run script
ghar1821 Oct 14, 2025
320fd3c
downgrade scib metrics package
ghar1821 Oct 14, 2025
7c98d96
reverting as gpu doesn't work
ghar1821 Oct 14, 2025
d9e146f
testing bin shifting
ghar1821 Nov 1, 2025
ea44a38
increase chunk size for bras
ghar1821 Nov 1, 2025
c50a0c0
small changes to the bin
ghar1821 Nov 3, 2025
695562a
disable metrics
ghar1821 Nov 4, 2025
48fb41c
disable two metrics again
ghar1821 Nov 4, 2025
9563252
update cytovi
ghar1821 Nov 18, 2025
d361dd0
switched training to TF32
ghar1821 Nov 19, 2025
328929e
remove persistent peaks
ghar1821 Nov 20, 2025
51552a4
removed scaling from cytovi
ghar1821 Nov 20, 2025
165dc99
increase batch size
ghar1821 Nov 22, 2025
65e48a6
reduce max epochs and train size
ghar1821 Nov 22, 2025
e94a30e
reverting config to default values
ghar1821 Nov 22, 2025
96d688b
update description
ghar1821 Nov 23, 2025
962143a
Merge branch 'main' into update_n_inconsistent_peak
ghar1821 Nov 24, 2025
ca35329
adding scaling back into cytovi
ghar1821 Nov 24, 2025
f9fe9f2
add changelog
ghar1821 Dec 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,6 @@
* Added CytoNorm with aggregate of samples as controls (`methods/cytonorm_no_controls`).
* Added parameters to tune CytoNorm.


* Added CytoNorm correction to a goal batch (PR #92).
* Added cyCombine correction to a reference batch (PR #90).
* Added `metrics/bras` (PR #91).
Expand All @@ -66,6 +65,8 @@

* Added processing scripts for CLL dataset (PR #106).

* Added new metric `ratio_inconsistent_peaks` (PR #114).

## MAJOR CHANGES

* Updated file schema (PR #18):
Expand Down Expand Up @@ -100,6 +101,8 @@

* Fix problems identified during a full run (PR #99).

* Update CytoVI (PR #114).

## MINOR CHANGES

* Enabled unit tests (PR #2).
Expand Down
5 changes: 2 additions & 3 deletions scripts/run_benchmark/run_full_seqeracloud.sh
Original file line number Diff line number Diff line change
Expand Up @@ -17,16 +17,15 @@ cat > /tmp/params.yaml << HERE
input_states: s3://openproblems-data/resources/task_cyto_batch_integration/datasets/**/state.yaml
rename_keys: 'input_censored_split1:output_censored_split1;input_censored_split2:output_censored_split2;input_unintegrated:output_unintegrated'
output_state: "state.yaml"
settings: '{"metrics_exclude": ["cms"], "methods_include": ["mnnpy", "cytovi"]}'
publish_dir: "$publish_dir"
HERE

tw launch https://github.com/openproblems-bio/task_cyto_batch_integration.git \
--revision build/fix_failed_stuff \
--revision build/main \
--pull-latest \
--main-script target/nextflow/workflows/run_benchmark/main.nf \
--workspace 53907369739130 \
--params-file /tmp/params.yaml \
--entry-name auto \
--config common/nextflow_helpers/labels_tw.config \
--labels task_cyto_batch_integration,mnnnpy
--labels task_cyto_batch_integration,test_subset
7 changes: 3 additions & 4 deletions src/control_methods/shuffle_integration/config.vsh.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,9 @@ name: shuffle_integration
label: Shuffle Integration
summary: Randomly shuffle cells in the whole dataset.
description: |
This negative control randomly permutes cell-to-sample (hence batch)
assignments while keeping each cell's measured markers unchanged.
This destroys any biological and batch specific structure but preserves marker expression.

This negative control randomly shuffles all cells in the input data,
destroying any biological structure (e.g., sample to cell mapping or batch assignments).

Purpose:
- Provide a baseline to verify that integration methods outperform
random assignment of cells to batches.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,9 @@ name: shuffle_integration_by_batch
label: Shuffle Integration — within batches
summary: Randomly reassign cells to any samples within the same batch.
description: |
This negative-control method randomly permutes cell-to-cell type assignments.
Cells remain assigned to their original batch (batch effects preserved).
Within each batch, cells are reassigned to random samples, destroying
biological/sample-specific structure (e.g., KO vs WT differences).

This negative-control method randomly shuffles cells within each batch independently,
destroying cell to sample mapping while preserving batch-specific distributions.

Purpose:
- Evaluate whether an integration method preserves differences between samples
and biological groups while removing batch effects.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,21 @@ name: shuffle_integration_by_cell_type
label: Shuffle Integration — within cell type
summary: Randomly reassign cells to any cell types
description: |
This negative-control method randomly permutes cell-to-cell type assignments.
Cells will be assigned to any cell types, regardless of their original cell type
or sample of origin or batch of origin.
This negative-control method randomly shuffles cells within each cell type independently,
destroying batch structure while preserving cell type-specific distributions.
This serves as a negative control that maintains biological groupings but
eliminates batch grouping in each cell type.

Purpose:
- Evaluate whether an integration method preserves differences between cell types
while removing batch effects.

Example:
- A Neutrophil from a KO sample in batch 1 may be reassigned to any cell type
(B cell, T cell, Monocyte, etc.) from any sample in any batch.
- A Neutrophil in batch 1 from KO sample may be reassigned to a Neutrophil in batch 2
KO or WT sample or remain in a KO sample in batch 1 but assigned to different donor,
or moved to a WT sample in batch 1 or 2, or remain in the same sample,
but it will never be re-assigned to another cell type.

# status: disabled
resources:
- type: python_script
Expand Down
13 changes: 6 additions & 7 deletions src/methods/cytovi/config.vsh.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -38,13 +38,13 @@ arguments:
type: integer
default: 1
description: Number of layers.
- name: --n_clusters
- name: --max_epochs
type: integer
default: 20
description: Number of clusters to use for subsampling.
- name: --subsample_fraction
default: 1000
description: Number of epochs to train the model.
- name: --train_size
type: double
default: 0.5
default: 0.9
description: Fraction of cells to subsample from each cluster for training.

# Resources required to run the component
Expand All @@ -68,11 +68,10 @@ engines:
packages:
- anndata>=0.11.0
- scanpy[skmisc]>=1.10
- scvi-tools==1.4.0
- scvi-tools==1.4.0.post1
- pyyaml
- requests
- jsonschema
- scikit-learn
github:
- openproblems-bio/core#subdirectory=packages/python/openproblems

Expand Down
67 changes: 37 additions & 30 deletions src/methods/cytovi/script.py
Original file line number Diff line number Diff line change
@@ -1,25 +1,37 @@
import time

import anndata as ad
import numpy as np
import scvi
import torch
from scvi.external import cytovi
from sklearn.cluster import KMeans
from threadpoolctl import threadpool_limits

# from sklearn.cluster import KMeans
# from threadpoolctl import threadpool_limits

## VIASH START
par = {
"input": "resources_test/task_cyto_batch_integration/mouse_spleen_flow_cytometry_subset/censored_split2.h5ad",
"output": "resources_test/task_cyto_batch_integration/mouse_spleen_flow_cytometry_subset/output_cytovi_split2.h5ad",
"n_hidden": 128,
"n_layers": 1,
"n_clusters": 10,
"subsample_fraction": 0.5,
"max_epochs": 1000,
"train_size": 0.9,
}
meta = {"name": "cytovi"}
## VIASH END

# setting calculation to TF32 to speed up training
torch.backends.cuda.matmul.allow_tf32 = True

# increase num workers for data loading
scvi.settings.num_workers = 95

print("Reading and preparing input files", flush=True)
adata = ad.read_h5ad(par["input"])

adata.obs["batch_str"] = adata.obs["batch"].astype(str)
adata.obs["sample_key_str"] = adata.obs["sample"].astype(str)

markers_to_correct = adata.var[adata.var["to_correct"]].index.to_numpy()
markers_not_correct = adata.var[~adata.var["to_correct"]].index.to_numpy()
Expand All @@ -33,41 +45,36 @@
adata=adata_to_correct,
transformed_layer_key="preprocessed",
batch_key="batch_str",
scaled_layer_key="scaled",
inplace=True,
)

print("Clustering using k-means with k =", par["n_clusters"], flush=True)
# cluster data using Kmeans
with threadpool_limits(limits=1):
adata_to_correct.obs["clusters"] = (
KMeans(n_clusters=par["n_clusters"], random_state=0)
.fit_predict(adata_to_correct.layers["scaled"])
.astype(str)
)
# concatenate obs so we can use it for subsampling
adata_to_correct.obs["sample_cluster"] = (
adata_to_correct.obs["sample"].astype(str) + "_" + adata_to_correct.obs["clusters"]
)
# subsample cells without replacement
print("Subsampling cells", flush=True)
subsampled_cells = adata_to_correct.obs.groupby("sample_cluster")[
"sample_cluster"
].apply(lambda x: x.sample(n=round(len(x) * par["subsample_fraction"]), replace=False))
# need the cell id included in the subsample
subsampled_cells_idx = [x[1] for x in subsampled_cells.index.to_list()]

adata_subsampled = adata_to_correct[subsampled_cells_idx, :].copy()

print(
f"Train CytoVI on subsampled data containing {adata_subsampled.shape[0]} cells",
f"Train CytoVI on {adata_to_correct.shape[0]} cells",
flush=True,
)

cytovi.CYTOVI.setup_anndata(adata_subsampled, layer="scaled", batch_key="batch_str")
cytovi.CYTOVI.setup_anndata(
adata_to_correct,
layer="scaled",
batch_key="batch_str",
sample_key="sample_key_str",
)

model = cytovi.CYTOVI(
adata=adata_subsampled, n_hidden=par["n_hidden"], n_layers=par["n_layers"]
adata_to_correct, n_hidden=par["n_hidden"], n_layers=par["n_layers"]
)

print("Start training CytoVI model", flush=True)

start = time.time()
model.train(
batch_size=8192,
max_epochs=par["max_epochs"],
train_size=par["train_size"],
)
model.train()
end = time.time()
print(f"Training took {end - start:.2f} seconds", flush=True)

# get batch corrected data
print("Correcting data", flush=True)
Expand Down
12 changes: 8 additions & 4 deletions src/methods/harmonypy/script.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,16 @@

## VIASH START
par = {
"input": "resources_test/task_cyto_batch_integration/mouse_spleen_flow_cytometry_subset/censored_split2.h5ad",
"output": "resources_test/task_cyto_batch_integration/mouse_spleen_flow_cytometry_subset/output_harmony_split2.h5ad",
"input": "/Users/putri.g/Documents/cytobenchmark/debug_general/_viash_par/input_1/censored_split1.h5ad",
"output": "/Users/putri.g/Documents/cytobenchmark/debug_general/_viash_par/output_1/output_harmony_split1.h5ad",
}
meta = {"name": "harmonypy"}
## VIASH END

print("Reading and preparing input files", flush=True)
adata = ad.read_h5ad(par["input"])

# harmony can't handle integer batch labels
adata.obs["batch_str"] = adata.obs["batch"].astype(str)

markers_to_correct = adata.var[adata.var["to_correct"]].index.to_numpy()
Expand All @@ -21,10 +22,13 @@
adata_to_correct = adata[:, markers_to_correct].copy()

print("Run harmony", flush=True)
# harmony can't handle integer batch labels

# TODO numerical instability in kmeans causing problem with harmony.
# so adding a very small value to all entries to make sure there are no zeros
epsilon = 1e-20

out = harmonypy.run_harmony(
data_mat=adata_to_correct.layers["preprocessed"],
data_mat=adata_to_correct.layers["preprocessed"] + epsilon,
meta_data=adata_to_correct.obs,
vars_use="batch_str",
)
Expand Down
24 changes: 9 additions & 15 deletions src/metrics/bras/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -1,16 +1,9 @@
# The API specifies which type of component this is.
# It contains specifications for:
# - The input/output files
# - Common parameters
# - A unit test
__merge__: ../../api/comp_metric.yaml

# A unique identifier for your component (required).
# Can contain only lowercase letters or underscores.
name: bras



status: disabled
# Metadata for your component
info:
metrics:
Expand Down Expand Up @@ -56,13 +49,6 @@ info:
# Whether a higher value represents a 'better' solution (required)
maximize: true

# Component-specific parameters (optional)
# arguments:
# - name: "--n_neighbors"
# type: "integer"
# default: 5
# description: Number of neighbors to use.

# Resources required to run the component
resources:
# The script of your component (required)
Expand All @@ -73,6 +59,14 @@ resources:

engines:
# Specifications for the Docker image for this component.
# testing gpu jax version
# - type: docker
# image: openproblems/base_pytorch_nvidia:1.1
# setup:
# - type: python
# packages:
# - jax[cuda_12_pip]
# - scib-metrics~=0.5.6
- type: docker
image: python:3.11
setup:
Expand Down
2 changes: 2 additions & 0 deletions src/metrics/bras/script.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@
labels=ct_labels_s1,
batch=batch_labels_s1,
metric="euclidean",
chunk_size=512,
)

batch_labels_s2 = integrated_s2.obs["batch"].values
Expand All @@ -67,6 +68,7 @@
labels=ct_labels_s2,
batch=batch_labels_s2,
metric="euclidean",
chunk_size=512,
)

bras_score = np.mean([bras_s1, bras_s2])
Expand Down
2 changes: 1 addition & 1 deletion src/metrics/n_inconsistent_peaks/config.vsh.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
__merge__: ../../api/comp_metric.yaml

name: n_inconsistent_peaks
# status: disabled
status: disabled

info:
metrics:
Expand Down
Loading