Skip to content

Commit f01e59f

Browse files
committed
Updated CHANGELOG and streamlined README with new results.
1 parent b36040b commit f01e59f

File tree

2 files changed

+93
-94
lines changed

2 files changed

+93
-94
lines changed

CHANGELOG.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,16 @@
11
# Changelog
22

3-
## Version 0.4.1 - 0.4.2
3+
## Version 0.5.0
4+
5+
- Updates to work with the latest versions of dependencies, namely **mattress**, **knncolle**, **scrapper**.
6+
- `annotate_integrated()` now returns a named `NamedList` for easier interpretation.
7+
- Accept named references in `train_integrated()` and propagate this to the results of `classify_integrated()`.
8+
9+
## Version 0.4.2
10+
11+
- Remove **pandas** as a dependency.
12+
13+
## Version 0.4.1
414

515
- Added the `aggregate_reference()` function to aggregate references for speed.
616
This is conveniently used via the `aggregate=` option in `train_single()`.

README.md

Lines changed: 82 additions & 93 deletions
Original file line numberDiff line numberDiff line change
@@ -11,80 +11,82 @@
1111
[![Project generated with PyScaffold](https://img.shields.io/badge/-PyScaffold-005CA0?logo=pyscaffold)](https://pyscaffold.org/)
1212
[![PyPI-Server](https://img.shields.io/pypi/v/singler.svg)](https://pypi.org/project/singler/)
1313
[![Monthly Downloads](https://static.pepy.tech/badge/singler/month)](https://pepy.tech/project/singler)
14-
![Unit tests](https://github.com/SingleR-inc/singler-py/actions/workflows/pypi-test.yml/badge.svg)
14+
![Unit tests](https://github.com/SingleR-inc/singler-py/actions/workflows/run-tests.yml/badge.svg)
1515

1616
# Tinder for single-cell data
1717

1818
## Overview
1919

20-
This package provides Python bindings to the [C++ implementation](https://github.com/SingleR-inc/singlepp) of the [SingleR method](https://github.com/SingleR-inc/SingleR),
20+
This package provides Python bindings to the [C++ implementation](https://github.com/SingleR-inc/singlepp) of the [**SingleR** method](https://github.com/SingleR-inc/SingleR),
2121
originally developed by [Aran et al. (2019)](https://www.nature.com/articles/s41590-018-0276-y).
2222
It is designed to annotate cell types by matching cells to known references based on their expression profiles.
2323
So kind of like Tinder, but for cells.
2424

2525
## Quick start
2626

27-
Firstly, let's load in the famous PBMC 4k dataset from 10X Genomics:
27+
Firstly, let's load in the famous PBMC 4k dataset from 10X Genomics.
28+
Any [`SummarizedExperiment`](https://github.com/biocpy/SummarizedExperiment) can be used here.
2829

2930
```python
30-
import singlecellexperiment as sce
31-
data = sce.read_tenx_h5("pbmc4k-tenx.h5", realize_assays=True)
32-
mat = data.assay("counts")
33-
features = [str(x) for x in data.row_data["name"]]
31+
import singlecellexperiment
32+
sce = singlecellexperiment.read_tenx_h5("pbmc4k-tenx.h5", realize_assays=True)
33+
## class: SingleCellExperiment
34+
## dimensions: (33694, 4340)
35+
## assays(1): ['counts']
36+
## row_data columns(2): ['id', 'name']
37+
## row_names(0):
38+
## column_data columns(0): []
39+
## column_names(0):
40+
## main_experiment_name:
41+
## reduced_dims(0): []
42+
## alternative_experiments(0): []
43+
## row_pairs(0): []
44+
## column_pairs(0): []
45+
## metadata(0):
3446
```
3547

36-
or if you are coming from scverse ecosystem, i.e. `AnnData`, simply read the object as `SingleCellExperiment` and extract the matrix and the features.
37-
Read more on [SingleCellExperiment here](https://biocpy.github.io/tutorial/chapters/experiments/single_cell_experiment.html).
38-
39-
40-
```python
41-
import singlecellexperiment as sce
42-
43-
sce_adata = sce.SingleCellExperiment.from_anndata(adata)
44-
45-
# or from a h5ad file
46-
sce_h5ad = sce.read_h5ad("tests/data/adata.h5ad")
47-
```
48-
49-
Now, we fetch the Blueprint/ENCODE reference:
48+
Now, we fetch the Blueprint/ENCODE reference from the [**celldex**](https://pypi.org/project/celldex) package:
5049

5150
```python
5251
import celldex
53-
5452
ref_data = celldex.fetch_reference("blueprint_encode", "2024-02-26", realize_assays=True)
53+
## class: SummarizedExperiment
54+
## dimensions: (19859, 259)
55+
## assays(1): ['logcounts']
56+
## row_data columns(0): []
57+
## row_names(19859): ['TSPAN6', 'TNMD', 'DPM1', ..., 'MIR522', 'LINC00550', 'GIMAP1-GIMAP5']
58+
## column_data columns(3): ['label.main', 'label.fine', 'label.ont']
59+
## column_names(259): ['mature.neutrophil', 'CD14.positive..CD16.negative.classical.monocyte', 'mature.neutrophil.1', ..., 'fibroblast.of.dermis.1', 'epithelial.cell.of.umbilical.artery.1', 'dermis.lymphatic.vessel.endothelial.cell.1']
60+
## metadata(0):
5561
```
5662

57-
We can annotate each cell in `mat` with the reference:
63+
We annotate each cell in `sce` against the reference.
64+
This yields a data frame that contains all of the assignments and the scores for each label:
5865

5966
```python
6067
import singler
6168
results = singler.annotate_single(
62-
test_data = mat,
63-
test_features = features,
69+
test_data = sce,
70+
test_features = sce.get_row_data()["name"],
6471
ref_data = ref_data,
6572
ref_labels = ref_data.get_column_data().column("label.main"),
6673
)
67-
```
68-
69-
The `results` data frame contains all of the assignments and the scores for each label:
70-
71-
```python
72-
results.column("best")
73-
## ['Monocytes',
74-
## 'Monocytes',
75-
## 'Monocytes',
76-
## 'CD8+ T-cells',
77-
## 'CD4+ T-cells',
78-
## 'CD8+ T-cells',
79-
## 'Monocytes',
80-
## 'Monocytes',
81-
## 'B-cells',
82-
## ...
83-
## ]
84-
85-
results.column("scores").column("Macrophages")
86-
## array([0.35935275, 0.40833545, 0.37430726, ..., 0.32135929, 0.29728435,
87-
## 0.40208581])
74+
print(results)
75+
## BiocFrame with 4340 rows and 3 columns
76+
## best scores delta
77+
## <list> <BiocFrame> <ndarray[float64]>
78+
## [0] Monocytes 0.2562168476981947:0.1254343439610945... 0.4378177347327983
79+
## [1] Monocytes 0.2834593285584352:0.1350551446328624... 0.06708042619997218
80+
## [2] Monocytes 0.27001789110872965:0.149733483922888... 0.29630159290612557
81+
## ... ... ...
82+
## [4337] NK cells 0.22504679944584366:0.128832705528845... 0.09253938940916262
83+
## [4338] B-cells 0.21466213533061748:0.143717963254983... 0.06727011631382662
84+
## [4339] Monocytes 0.2880677943712168:0.1327331541412791... 0.06576621116161818
85+
## ------
86+
## metadata(2): used markers
87+
88+
print(results["scores"]["Macrophages"])
89+
## [0.3553803 0.40346796 0.3680465 ... 0.32339334 0.29082273 0.39644526]
8890
```
8991

9092
## Calling low-level functions
@@ -95,40 +97,42 @@ This allows us to re-use the same reference for multiple datasets without repeat
9597

9698
```python
9799
built = singler.train_single(
98-
ref_data = ref_data.assay("logcounts"),
100+
ref_data = ref_data,
99101
ref_labels = ref_data.get_column_data().column("label.main"),
100102
ref_features = ref_data.get_row_names(),
101-
test_features = features,
103+
test_features = sce.get_row_data()["name"]
102104
)
103105
```
104106

105-
And finally, we apply the pre-built reference to the test dataset to obtain our label assignments.
107+
Then, we apply the pre-built reference to the test dataset to obtain our label assignments.
106108
This can be repeated with different datasets that have the same features as `test_features=`.
107109

108110
```python
109111
output = singler.classify_single(mat, ref_prebuilt=built)
112+
print(output)
113+
## BiocFrame with 4340 rows and 3 columns
114+
## best scores delta
115+
## <list> <BiocFrame> <ndarray[float64]>
116+
## [0] Monocytes 0.2562168476981947:0.1254343439610945... 0.4378177347327983
117+
## [1] Monocytes 0.2834593285584352:0.1350551446328624... 0.06708042619997218
118+
## [2] Monocytes 0.27001789110872965:0.149733483922888... 0.29630159290612557
119+
## ... ... ...
120+
## [4337] NK cells 0.22504679944584366:0.128832705528845... 0.09253938940916262
121+
## [4338] B-cells 0.21466213533061748:0.143717963254983... 0.06727011631382662
122+
## [4339] Monocytes 0.2880677943712168:0.1327331541412791... 0.06576621116161818
123+
## ------
124+
## metadata(2): used markers
110125
```
111126

112-
## output
113-
BiocFrame with 4340 rows and 3 columns
114-
best scores delta
115-
<list> <BiocFrame> <ndarray[float64]>
116-
[0] Monocytes 0.33265560369962943:0.407117403330602... 0.40706830113982534
117-
[1] Monocytes 0.4078771641637374:0.4783396310685646... 0.07000418564184802
118-
[2] Monocytes 0.3517036021728629:0.4076971245524348... 0.30997293412307647
119-
... ... ...
120-
[4337] NK cells 0.3472631136865701:0.3937898240670208... 0.09640242155786138
121-
[4338] B-cells 0.26974632191999887:0.334862058137758... 0.061215905058676856
122-
[4339] Monocytes 0.39390119034537324:0.468867490667427... 0.06678168346812047
123-
124127
## Integrating labels across references
125128

126-
We can use annotations from multiple references through the `annotate_integrated()` function:
129+
We can use annotations from multiple references through the `annotate_integrated()` function.
130+
This annotates the test dataset against each reference individually to obtain the best per-reference label,
131+
and then it compares across references to find the best label from all references.
127132

128133
```python
129134
import singler
130135
import celldex
131-
132136
blueprint_ref = celldex.fetch_reference("blueprint_encode", "2024-02-26", realize_assays=True)
133137
immune_cell_ref = celldex.fetch_reference("dice", "2024-02-26", realize_assays=True)
134138

@@ -142,36 +146,21 @@ integrated_res = singler.annotate_integrated(
142146
blueprint_ref.get_column_data().column("label.main"),
143147
immune_cell_ref.get_column_data().column("label.main")
144148
],
145-
test_features = features,
146-
num_threads = 6
149+
test_features = features
147150
)
148-
```
149151

150-
This annotates the test dataset against each reference individually to obtain the best per-reference label,
151-
and then it compares across references to find the best label from all references.
152-
153-
```python
154-
integrated_res["integrated"].column("best_label")
155-
## ['Monocytes',
156-
## 'Monocytes',
157-
## 'Monocytes',
158-
## 'CD8+ T-cells',
159-
## 'CD4+ T-cells',
160-
## 'CD8+ T-cells',
161-
## 'Monocytes',
162-
## 'Monocytes',
163-
## ...
164-
## ]
165-
166-
integrated_res["integrated"].column("best_reference")
167-
## [0,
168-
## 0,
169-
## 0,
170-
## 0,
171-
## 0,
172-
## 0,
173-
## 0,
174-
## 0,
175-
## ...
176-
## ]
152+
print(integrated_res["integrated"])
153+
## BiocFrame with 4340 rows and 4 columns
154+
## best_label best_reference scores delta
155+
## <list> <ndarray[uint32]> <BiocFrame> <ndarray[float64]>
156+
## [0] Monocytes 0 Monocytes:0.4601040318745395:Monocyte... 0.07172619402931646
157+
## [1] Monocytes 0 Monocytes:0.5569436588644365:Monocyte... 0.10337145230321299
158+
## [2] Monocytes 0 Monocytes:0.460675384672641:Monocytes... 0.06302300967458618
159+
## ... ... ... ...
160+
## [4337] NK cells 0 NK cells:0.5639386082584756:NK cells:... 0.02453897370863012
161+
## [4338] B-cells 0 B-cells:0.49462921210156113:B cells:0... 0.0259893105975339
162+
## [4339] Monocytes 0 Monocytes:0.49997247809330014:Monocyt... 0.08744894116986357
177163
```
164+
165+
The ``best_label`` columns contains the best label for each cell across all references,
166+
while the ``best_reference`` specifies the reference (in the same order as ``ref_data``) that contains the best label.

0 commit comments

Comments
 (0)