generated from openproblems-bio/task_template
-
Notifications
You must be signed in to change notification settings - Fork 16
Add scPRINT method #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 12 commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
a49d642
Add scPRINT component files
lazappi d5b31d6
Load and preprocess data for scPRINT
lazappi c9e8a74
Try running model...
lazappi 5ece7e2
Merge remote-tracking branch 'origin/main' into feature/no-ref/add-sc…
lazappi feab72b
Adjust scPRINT installation
lazappi 195a34e
Embed and save scPRINT output
lazappi 8eaea11
Detect available cores
lazappi d271396
Adjust arguments if GPU available
lazappi 37a3f50
Add model argument to scPRINT
lazappi 62704e4
Add scPRINT to benchmark workflow
lazappi 4774c98
Make scPRINT inherit from base method
lazappi c3ce8df
style code
rcannood 18b8600
Apply suggestions from code review
lazappi 5162d00
Remove test workflow file
lazappi 7d4906a
Fix test data path
lazappi File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,75 @@ | ||
| __merge__: /src/api/base_method.yaml | ||
|
|
||
| name: scprint | ||
| label: scPRINT | ||
| summary: scPRINT is a large transformer model built for the inference of gene networks | ||
| description: | | ||
| scPRINT is a large transformer model built for the inference of gene networks | ||
| (connections between genes explaining the cell's expression profile) from | ||
| scRNAseq data. | ||
|
|
||
| It uses novel encoding and decoding of the cell expression profile and new | ||
| pre-training methodologies to learn a cell model. | ||
|
|
||
| scPRINT can be used to perform the following analyses: | ||
|
|
||
| - expression denoising: increase the resolution of your scRNAseq data | ||
| - cell embedding: generate a low-dimensional representation of your dataset | ||
| - label prediction: predict the cell type, disease, sequencer, sex, and | ||
| ethnicity of your cells | ||
| - gene network inference: generate a gene network from any cell or cell | ||
| cluster in your scRNAseq dataset | ||
|
|
||
| references: | ||
| doi: | ||
| - 10.1101/2024.07.29.605556 | ||
|
|
||
| links: | ||
| documentation: https://cantinilab.github.io/scPRINT/ | ||
| repository: https://github.com/cantinilab/scPRINT | ||
|
|
||
| info: | ||
| preferred_normalization: counts | ||
| method_types: [embedding] | ||
| variants: | ||
| scprint_large: | ||
| model: "large" | ||
| scprint_medium: | ||
| model: "medium" | ||
| scprint_small: | ||
| model: "small" | ||
|
|
||
| arguments: | ||
| - name: "--model" | ||
| type: "string" | ||
| description: String representing the Geneformer model to use | ||
| choices: ["large", "medium", "small"] | ||
| default: "large" | ||
|
|
||
| resources: | ||
| - type: python_script | ||
| path: script.py | ||
| - path: /src/utils/read_anndata_partial.py | ||
|
|
||
| engines: | ||
| - type: docker | ||
| image: openproblems/base_pytorch_nvidia:1.0.0 | ||
| setup: | ||
| - type: python | ||
| pip: | ||
| - huggingface_hub | ||
| - scprint | ||
| - type: docker | ||
| run: lamin init --storage ./main --name main --schema bionty | ||
| - type: python | ||
| script: import bionty as bt; bt.core.sync_all_sources_to_latest() | ||
| - type: docker | ||
| run: lamin load anonymous/main | ||
| - type: python | ||
| script: from scdataloader.utils import populate_my_ontology; populate_my_ontology() | ||
|
|
||
| runners: | ||
| - type: executable | ||
| - type: nextflow | ||
| directives: | ||
| label: [midtime, midmem, midcpu, gpu] | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,102 @@ | ||
| import anndata as ad | ||
| from scdataloader import Preprocessor | ||
| import sys | ||
| from huggingface_hub import hf_hub_download | ||
| from scprint.tasks import Embedder | ||
| from scprint import scPrint | ||
| import scprint | ||
| import torch | ||
| import os | ||
|
|
||
| ## VIASH START | ||
| par = { | ||
| "input": "resources_test/.../input.h5ad", | ||
| "output": "output.h5ad", | ||
| "model": "large", | ||
lazappi marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| } | ||
| meta = {"name": "scprint"} | ||
| ## VIASH END | ||
|
|
||
| sys.path.append(meta["resources_dir"]) | ||
| from read_anndata_partial import read_anndata | ||
|
|
||
| print(f"====== scPRINT version {scprint.__version__} ======", flush=True) | ||
|
|
||
| print("\n>>> Reading input data...", flush=True) | ||
| input = read_anndata(par["input"], X="layers/counts", obs="obs", var="var", uns="uns") | ||
| if input.uns["dataset_organism"] == "homo_sapiens": | ||
| input.obs["organism_ontology_term_id"] = "NCBITaxon:9606" | ||
| elif input.uns["dataset_organism"] == "mus_musculus": | ||
| input.obs["organism_ontology_term_id"] = "NCBITaxon:10090" | ||
| else: | ||
| raise ValueError( | ||
| f"scPRINT requires human or mouse data, not '{input.uns['dataset_organism']}'" | ||
| ) | ||
| adata = input.copy() | ||
|
|
||
| print("\n>>> Preprocessing data...", flush=True) | ||
| preprocessor = Preprocessor( | ||
| # Lower this threshold for test datasets | ||
| min_valid_genes_id=1000 if input.n_vars < 2000 else 10000, | ||
| # Turn off cell filtering to return results for all cells | ||
| filter_cell_by_counts=False, | ||
| min_nnz_genes=False, | ||
| do_postp=False, | ||
| # Skip ontology checks | ||
| skip_validate=True, | ||
| ) | ||
| adata = preprocessor(adata) | ||
|
|
||
| print(f"\n>>> Downloading '{par['model']}' model...", flush=True) | ||
| model_checkpoint_file = hf_hub_download( | ||
| repo_id="jkobject/scPRINT", filename=f"{par['model']}.ckpt" | ||
| ) | ||
lazappi marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| print(f"Model checkpoint file: '{model_checkpoint_file}'", flush=True) | ||
| model = scPrint.load_from_checkpoint( | ||
| model_checkpoint_file, | ||
| transformer="normal", # Don't use this for GPUs with flashattention | ||
| precpt_gene_emb=None, | ||
| ) | ||
|
|
||
| print("\n>>> Embedding data...", flush=True) | ||
| if torch.cuda.is_available(): | ||
| print("CUDA is available, using GPU", flush=True) | ||
| precision = "16" | ||
| dtype = torch.float16 | ||
| else: | ||
| print("CUDA is not available, using CPU", flush=True) | ||
| precision = "32" | ||
| dtype = torch.float32 | ||
| n_cores_available = len(os.sched_getaffinity(0)) | ||
| print(f"Using {n_cores_available} worker cores") | ||
| embedder = Embedder( | ||
| how="random expr", | ||
| max_len=4000, | ||
| add_zero_genes=0, | ||
| num_workers=n_cores_available, | ||
| doclass=False, | ||
| doplot=False, | ||
| precision=precision, | ||
| dtype=dtype, | ||
| ) | ||
| embedded, _ = embedder(model, adata, cache=False) | ||
|
|
||
| print("\n>>> Storing output...", flush=True) | ||
| output = ad.AnnData( | ||
| obs=input.obs[[]], | ||
| var=input.var[[]], | ||
| obsm={ | ||
| "X_emb": embedded.obsm["scprint"], | ||
| }, | ||
| uns={ | ||
| "dataset_id": input.uns["dataset_id"], | ||
| "normalization_id": input.uns["normalization_id"], | ||
| "method_id": meta["name"], | ||
| }, | ||
| ) | ||
| print(output) | ||
|
|
||
| print("\n>>> Writing output AnnData to file...", flush=True) | ||
| output.write_h5ad(par["output"], compression="gzip") | ||
|
|
||
| print("\n>>> Done!", flush=True) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| workflow auto { | ||
| findStates(params, meta.config) | ||
| | view{"In: $it"} | ||
| | meta.workflow.run( | ||
| auto: [publish: "state"] | ||
| ) | ||
| } | ||
|
|
||
| workflow run_wf { | ||
| take: | ||
| input_ch | ||
|
|
||
| main: | ||
| output_ch = input_ch | ||
| | view{"Mid: $it"} | ||
|
|
||
| emit: | ||
| output_ch | ||
| } |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.