1111[ ![ Project generated with PyScaffold] ( https://img.shields.io/badge/-PyScaffold-005CA0?logo=pyscaffold )] ( https://pyscaffold.org/ )
1212[ ![ PyPI-Server] ( https://img.shields.io/pypi/v/singler.svg )] ( https://pypi.org/project/singler/ )
1313[ ![ Monthly Downloads] ( https://static.pepy.tech/badge/singler/month )] ( https://pepy.tech/project/singler )
14- ![ Unit tests] ( https://github.com/SingleR-inc/singler-py/actions/workflows/pypi-test .yml/badge.svg )
14+ ![ Unit tests] ( https://github.com/SingleR-inc/singler-py/actions/workflows/run-tests .yml/badge.svg )
1515
1616# Tinder for single-cell data
1717
1818## Overview
1919
20- This package provides Python bindings to the [ C++ implementation] ( https://github.com/SingleR-inc/singlepp ) of the [ SingleR method] ( https://github.com/SingleR-inc/SingleR ) ,
20+ This package provides Python bindings to the [ C++ implementation] ( https://github.com/SingleR-inc/singlepp ) of the [ ** SingleR** method] ( https://github.com/SingleR-inc/SingleR ) ,
2121originally developed by [ Aran et al. (2019)] ( https://www.nature.com/articles/s41590-018-0276-y ) .
2222It is designed to annotate cell types by matching cells to known references based on their expression profiles.
2323So kind of like Tinder, but for cells.
2424
2525## Quick start
2626
27- Firstly, let's load in the famous PBMC 4k dataset from 10X Genomics:
27+ Firstly, let's load in the famous PBMC 4k dataset from 10X Genomics.
28+ Any [ ` SummarizedExperiment ` ] ( https://github.com/biocpy/SummarizedExperiment ) can be used here.
2829
2930``` python
30- import singlecellexperiment as sce
31- data = sce.read_tenx_h5(" pbmc4k-tenx.h5" , realize_assays = True )
32- mat = data.assay(" counts" )
33- features = [str (x) for x in data.row_data[" name" ]]
31+ import singlecellexperiment
32+ sce = singlecellexperiment.read_tenx_h5(" pbmc4k-tenx.h5" , realize_assays = True )
33+ # # class: SingleCellExperiment
34+ # # dimensions: (33694, 4340)
35+ # # assays(1): ['counts']
36+ # # row_data columns(2): ['id', 'name']
37+ # # row_names(0):
38+ # # column_data columns(0): []
39+ # # column_names(0):
40+ # # main_experiment_name:
41+ # # reduced_dims(0): []
42+ # # alternative_experiments(0): []
43+ # # row_pairs(0): []
44+ # # column_pairs(0): []
45+ # # metadata(0):
3446```
3547
36- or if you are coming from scverse ecosystem, i.e. ` AnnData ` , simply read the object as ` SingleCellExperiment ` and extract the matrix and the features.
37- Read more on [ SingleCellExperiment here] ( https://biocpy.github.io/tutorial/chapters/experiments/single_cell_experiment.html ) .
38-
39-
40- ``` python
41- import singlecellexperiment as sce
42-
43- sce_adata = sce.SingleCellExperiment.from_anndata(adata)
44-
45- # or from a h5ad file
46- sce_h5ad = sce.read_h5ad(" tests/data/adata.h5ad" )
47- ```
48-
49- Now, we fetch the Blueprint/ENCODE reference:
48+ Now, we fetch the Blueprint/ENCODE reference from the [ ** celldex** ] ( https://pypi.org/project/celldex ) package:
5049
5150``` python
5251import celldex
53-
5452ref_data = celldex.fetch_reference(" blueprint_encode" , " 2024-02-26" , realize_assays = True )
53+ # # class: SummarizedExperiment
54+ # # dimensions: (19859, 259)
55+ # # assays(1): ['logcounts']
56+ # # row_data columns(0): []
57+ # # row_names(19859): ['TSPAN6', 'TNMD', 'DPM1', ..., 'MIR522', 'LINC00550', 'GIMAP1-GIMAP5']
58+ # # column_data columns(3): ['label.main', 'label.fine', 'label.ont']
59+ # # column_names(259): ['mature.neutrophil', 'CD14.positive..CD16.negative.classical.monocyte', 'mature.neutrophil.1', ..., 'fibroblast.of.dermis.1', 'epithelial.cell.of.umbilical.artery.1', 'dermis.lymphatic.vessel.endothelial.cell.1']
60+ # # metadata(0):
5561```
5662
57- We can annotate each cell in ` mat ` with the reference:
63+ We annotate each cell in ` sce ` against the reference.
64+ This yields a data frame that contains all of the assignments and the scores for each label:
5865
5966``` python
6067import singler
6168results = singler.annotate_single(
62- test_data = mat ,
63- test_features = features ,
69+ test_data = sce ,
70+ test_features = sce.get_row_data()[ " name " ] ,
6471 ref_data = ref_data,
6572 ref_labels = ref_data.get_column_data().column(" label.main" ),
6673)
67- ```
68-
69- The ` results ` data frame contains all of the assignments and the scores for each label:
70-
71- ``` python
72- results.column(" best" )
73- # # ['Monocytes',
74- # # 'Monocytes',
75- # # 'Monocytes',
76- # # 'CD8+ T-cells',
77- # # 'CD4+ T-cells',
78- # # 'CD8+ T-cells',
79- # # 'Monocytes',
80- # # 'Monocytes',
81- # # 'B-cells',
82- # # ...
83- # # ]
84-
85- results.column(" scores" ).column(" Macrophages" )
86- # # array([0.35935275, 0.40833545, 0.37430726, ..., 0.32135929, 0.29728435,
87- # # 0.40208581])
74+ print (results)
75+ # # BiocFrame with 4340 rows and 3 columns
76+ # # best scores delta
77+ # # <list> <BiocFrame> <ndarray[float64]>
78+ # # [0] Monocytes 0.2562168476981947:0.1254343439610945... 0.4378177347327983
79+ # # [1] Monocytes 0.2834593285584352:0.1350551446328624... 0.06708042619997218
80+ # # [2] Monocytes 0.27001789110872965:0.149733483922888... 0.29630159290612557
81+ # # ... ... ...
82+ # # [4337] NK cells 0.22504679944584366:0.128832705528845... 0.09253938940916262
83+ # # [4338] B-cells 0.21466213533061748:0.143717963254983... 0.06727011631382662
84+ # # [4339] Monocytes 0.2880677943712168:0.1327331541412791... 0.06576621116161818
85+ # # ------
86+ # # metadata(2): used markers
87+
88+ print (results[" scores" ][" Macrophages" ])
89+ # # [0.3553803 0.40346796 0.3680465 ... 0.32339334 0.29082273 0.39644526]
8890```
8991
9092## Calling low-level functions
@@ -95,40 +97,42 @@ This allows us to re-use the same reference for multiple datasets without repeat
9597
9698``` python
9799built = singler.train_single(
98- ref_data = ref_data.assay( " logcounts " ) ,
100+ ref_data = ref_data,
99101 ref_labels = ref_data.get_column_data().column(" label.main" ),
100102 ref_features = ref_data.get_row_names(),
101- test_features = features,
103+ test_features = sce.get_row_data()[ " name " ]
102104)
103105```
104106
105- And finally , we apply the pre-built reference to the test dataset to obtain our label assignments.
107+ Then , we apply the pre-built reference to the test dataset to obtain our label assignments.
106108This can be repeated with different datasets that have the same features as ` test_features= ` .
107109
108110``` python
109111output = singler.classify_single(mat, ref_prebuilt = built)
112+ print (output)
113+ # # BiocFrame with 4340 rows and 3 columns
114+ # # best scores delta
115+ # # <list> <BiocFrame> <ndarray[float64]>
116+ # # [0] Monocytes 0.2562168476981947:0.1254343439610945... 0.4378177347327983
117+ # # [1] Monocytes 0.2834593285584352:0.1350551446328624... 0.06708042619997218
118+ # # [2] Monocytes 0.27001789110872965:0.149733483922888... 0.29630159290612557
119+ # # ... ... ...
120+ # # [4337] NK cells 0.22504679944584366:0.128832705528845... 0.09253938940916262
121+ # # [4338] B-cells 0.21466213533061748:0.143717963254983... 0.06727011631382662
122+ # # [4339] Monocytes 0.2880677943712168:0.1327331541412791... 0.06576621116161818
123+ # # ------
124+ # # metadata(2): used markers
110125```
111126
112- ## output
113- BiocFrame with 4340 rows and 3 columns
114- best scores delta
115- <list> <BiocFrame> <ndarray[float64]>
116- [0] Monocytes 0.33265560369962943:0.407117403330602... 0.40706830113982534
117- [1] Monocytes 0.4078771641637374:0.4783396310685646... 0.07000418564184802
118- [2] Monocytes 0.3517036021728629:0.4076971245524348... 0.30997293412307647
119- ... ... ...
120- [4337] NK cells 0.3472631136865701:0.3937898240670208... 0.09640242155786138
121- [4338] B-cells 0.26974632191999887:0.334862058137758... 0.061215905058676856
122- [4339] Monocytes 0.39390119034537324:0.468867490667427... 0.06678168346812047
123-
124127## Integrating labels across references
125128
126- We can use annotations from multiple references through the ` annotate_integrated() ` function:
129+ We can use annotations from multiple references through the ` annotate_integrated() ` function.
130+ This annotates the test dataset against each reference individually to obtain the best per-reference label,
131+ and then it compares across references to find the best label from all references.
127132
128133``` python
129134import singler
130135import celldex
131-
132136blueprint_ref = celldex.fetch_reference(" blueprint_encode" , " 2024-02-26" , realize_assays = True )
133137immune_cell_ref = celldex.fetch_reference(" dice" , " 2024-02-26" , realize_assays = True )
134138
@@ -142,36 +146,21 @@ integrated_res = singler.annotate_integrated(
142146 blueprint_ref.get_column_data().column(" label.main" ),
143147 immune_cell_ref.get_column_data().column(" label.main" )
144148 ],
145- test_features = features,
146- num_threads = 6
149+ test_features = features
147150)
148- ```
149151
150- This annotates the test dataset against each reference individually to obtain the best per-reference label,
151- and then it compares across references to find the best label from all references.
152-
153- ``` python
154- integrated_res[" integrated" ].column(" best_label" )
155- # # ['Monocytes',
156- # # 'Monocytes',
157- # # 'Monocytes',
158- # # 'CD8+ T-cells',
159- # # 'CD4+ T-cells',
160- # # 'CD8+ T-cells',
161- # # 'Monocytes',
162- # # 'Monocytes',
163- # # ...
164- # # ]
165-
166- integrated_res[" integrated" ].column(" best_reference" )
167- # # [0,
168- # # 0,
169- # # 0,
170- # # 0,
171- # # 0,
172- # # 0,
173- # # 0,
174- # # 0,
175- # # ...
176- # # ]
152+ print (integrated_res[" integrated" ])
153+ # # BiocFrame with 4340 rows and 4 columns
154+ # # best_label best_reference scores delta
155+ # # <list> <ndarray[uint32]> <BiocFrame> <ndarray[float64]>
156+ # # [0] Monocytes 0 Monocytes:0.4601040318745395:Monocyte... 0.07172619402931646
157+ # # [1] Monocytes 0 Monocytes:0.5569436588644365:Monocyte... 0.10337145230321299
158+ # # [2] Monocytes 0 Monocytes:0.460675384672641:Monocytes... 0.06302300967458618
159+ # # ... ... ... ...
160+ # # [4337] NK cells 0 NK cells:0.5639386082584756:NK cells:... 0.02453897370863012
161+ # # [4338] B-cells 0 B-cells:0.49462921210156113:B cells:0... 0.0259893105975339
162+ # # [4339] Monocytes 0 Monocytes:0.49997247809330014:Monocyt... 0.08744894116986357
177163```
164+
165+ The `` best_label `` columns contains the best label for each cell across all references,
166+ while the `` best_reference `` specifies the reference (in the same order as `` ref_data `` ) that contains the best label.
0 commit comments