Skip to content

Commit cbaf43e

Browse files
authored
Merge pull request #6418 from pavanvidem/scatac-sinto-param
Fix a param in scATAC-seq preprocessing tutorial and update the tools
2 parents 5ad8e18 + 4f6e37c commit cbaf43e

File tree

6 files changed

+1345
-18
lines changed

6 files changed

+1345
-18
lines changed

topics/single-cell/tutorials/scatac-preprocessing-tenx/tutorial.md

Lines changed: 18 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,15 @@ contributions:
4141
authorship:
4242
- pavanvidem
4343

44+
answer_histories:
45+
- label: UseGalaxy.eu - scATAC-seq FASTQ to Count Matrix
46+
history: https://usegalaxy.eu/u/videmp/h/scatac-seq-fastq-to-count-matrix
47+
date: 2025-10-15
48+
- label: UseGalaxy.eu - scATAC-seq Count Matrix Filtering
49+
history: https://usegalaxy.eu/u/videmp/h/scatac-seq-count-matrix-filtering
50+
date: 2025-10-15
51+
52+
4453
gitter: Galaxy-Training-Network/galaxy-single-cell
4554

4655
---
@@ -242,6 +251,7 @@ An ATAC-seq fragment file is a BED file with Tn5 integration sites, the cell bar
242251
> - *"Regular expression used to extract cell barcode from read name"*: `[^:]*` (matches all characters up to the first colon)
243252
> - *"Number of bases to shift Tn5 insertion position by on the forward strand"*: `4`
244253
> - *"Number of bases to shift Tn5 insertion position by on the reverse strand"*: `-5`
254+
> - *"Take cell barcode into account when collapsing duplicate fragments"*: `Yes`
245255
>
246256
> 1. {% tool [bedtools SortBED](toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_sortbed/2.30.0+galaxy2) %} with the following parameters:
247257
> - *"Sort the following BED/bedGraph/GFF/VCF/EncodePeak file *"*: `fragments BED` (output of **Sinto fragments** {% icon tool%})`
@@ -313,7 +323,7 @@ For count matrix creation, we will use **Build count matrix** from **EpiScanpy**
313323
> >
314324
> > > <solution-title></solution-title>
315325
> > >
316-
> > > 1. There were initially 1064 regions in the `narrow Peaks` file. Now there are 891 regions after deduplication. More than 15% (173) of regions have the same peak boundaries.
326+
> > > 1. There were initially 1229 regions in the `narrow Peaks` file. Now there are 1046 regions after deduplication. Around 15% (184) of regions have the same peak boundaries.
317327
> > >
318328
> > {: .solution}
319329
> >
@@ -360,16 +370,16 @@ Because the `AnnData` format is an extension of the HDF5 format, i.e. a binary f
360370
> > <question-title></question-title>
361371
> >
362372
> > ```
363-
> > AnnData object with n_obs × n_vars = 18426 × 891
373+
> > AnnData object with n_obs × n_vars = 27388 × 1046
364374
> > ```
365375
> >
366376
> > 1. How many observations are there? What do they represent?
367377
> > 2. How many variables are there? What do they represent?
368378
> >
369379
> > > <solution-title></solution-title>
370380
> > >
371-
> > > 1. There are 18,426 observations, representing the cells.
372-
> > > 2. There are 891 variables, representing the peaks.
381+
> > > 1. There are 27,388 observations, representing the cells.
382+
> > > 2. There are 1046 variables, representing the peaks.
373383
> > >
374384
> > {: .solution}
375385
> >
@@ -385,7 +395,7 @@ Because the `AnnData` format is an extension of the HDF5 format, i.e. a binary f
385395
> >
386396
> > ```
387397
> > [n_obs x n_vars]
388-
> > - 18426 x 891
398+
> > - 27388 x 1046
389399
> > ```
390400
> > * For more specific queries, {% tool [Inspect AnnData](toolshed.g2.bx.psu.edu/repos/iuc/anndata_inspect/anndata_inspect/0.7.5+galaxy1) %} is required.
391401
> {: .comment}
@@ -411,7 +421,7 @@ Because the `AnnData` format is an extension of the HDF5 format, i.e. a binary f
411421
> >
412422
> > > <solution-title></solution-title>
413423
> > >
414-
> > > The file is a table with 18,426 lines (observations or cells) and 891 columns (variables or peaks): the count matrix for each of the 891 peaks and 18,426 cells. The 1st row contains the peak location as an annotation of the columns and the 1st column the barcodes of the cells as an annotation of the rows.
424+
> > > The file is a table with 27,388 lines (observations or cells) and 1046 columns (variables or peaks): the count matrix for each of the 1046 peaks and 27,388 cells. The 1st row contains the peak location as an annotation of the columns and the 1st column the barcodes of the cells as an annotation of the rows.
415425
> > >
416426
> > {: .solution}
417427
> >
@@ -516,7 +526,7 @@ First remove any potential empty features or barcodes. A non-empty cell should h
516526
> >
517527
> > > <solution-title></solution-title>
518528
> > >
519-
> > > The resulting matrix has dimensions of 1815 x 67766, i.e., more than 99.5% of the cells and less than 4% of features were filtered out. This indicates the high sparsity of the count matrix.
529+
> > > The resulting matrix has dimensions of 1815 x 67766, i.e., more than 99.5% of the cells and less than 4% of features were filtered out. This indicates the high sparsity of the count matrix.
520530
> > >
521531
> > {: .solution}
522532
> >
@@ -620,7 +630,7 @@ To determine decent filtering thresholds, we will further look at some histogram
620630
> > > 1. The plots show a histogram of the number of cells sharing a feature. As we initially pooled the data from all the cells to detect the peaks, it is expected to see only a small number of cells have more than 10000 peaks in common.
621631
> > > 2. The red vertical line of our 5 cells threshold is nearly at the left end of the histogram representing the majority of the features have at least 5 cells in common.
622632
> > > From the log scale plot it is also clear that there is a sharp increase in the feature commonness from at least 10 cells (x-axis 1.0).
623-
> > > So our threshold of 5 is a decent cutoff for filtering out the features. From the plots, only a very few non-informative features are left to be filtered out.
633+
> > > So our threshold of 5 is a decent cutoff for filtering out the features. From the plots, only a very few non-informative features are left to be filtered out.
624634
> > >
625635
> > {: .solution}
626636
> >

topics/single-cell/tutorials/scatac-preprocessing-tenx/workflows/scATAC-seq-Count-Matrix-Filtering-test.yml

Lines changed: 0 additions & 7 deletions
This file was deleted.
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
- doc: Test outline for scATAC-seq-Count-Matrix-Filtering
2+
job:
3+
scATAC-seq Anndata:
4+
class: File
5+
location: https://zenodo.org/api/files/d554c6c9-a28d-47bc-96be-5e34bd58266d/atac_pbmc_1k_uniq_peaks.h5ad
6+
filetype: h5ad
7+
outputs:
8+
Filtered Anndata:
9+
asserts:
10+
has_h5_keys:
11+
keys: "obs/nb_features"
12+
keys: "obs/log_nb_features"
13+
keys: "var/n_cells"
14+
keys: "var/commonness"
15+
Anndata Info:
16+
asserts:
17+
has_line:
18+
line: "AnnData object with n_obs × n_vars = 1024 × 67719"

0 commit comments

Comments
 (0)