You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -242,6 +251,7 @@ An ATAC-seq fragment file is a BED file with Tn5 integration sites, the cell bar
242
251
> - *"Regular expression used to extract cell barcode from read name"*: `[^:]*` (matches all characters up to the first colon)
243
252
> - *"Number of bases to shift Tn5 insertion position by on the forward strand"*: `4`
244
253
> - *"Number of bases to shift Tn5 insertion position by on the reverse strand"*: `-5`
254
+
> - *"Take cell barcode into account when collapsing duplicate fragments"*: `Yes`
245
255
>
246
256
> 1. {% tool [bedtools SortBED](toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_sortbed/2.30.0+galaxy2) %} with the following parameters:
247
257
> - *"Sort the following BED/bedGraph/GFF/VCF/EncodePeak file *"*: `fragments BED` (output of **Sinto fragments** {% icon tool%})`
@@ -313,7 +323,7 @@ For count matrix creation, we will use **Build count matrix** from **EpiScanpy**
313
323
> >
314
324
> > > <solution-title></solution-title>
315
325
> > >
316
-
> > > 1. There were initially 1064 regions in the `narrow Peaks` file. Now there are 891 regions after deduplication. More than 15% (173) of regions have the same peak boundaries.
326
+
> > > 1. There were initially 1229 regions in the `narrow Peaks` file. Now there are 1046 regions after deduplication. Around 15% (184) of regions have the same peak boundaries.
317
327
> > >
318
328
> > {: .solution}
319
329
> >
@@ -360,16 +370,16 @@ Because the `AnnData` format is an extension of the HDF5 format, i.e. a binary f
> > 1. How many observations are there? What do they represent?
367
377
> > 2. How many variables are there? What do they represent?
368
378
> >
369
379
> > > <solution-title></solution-title>
370
380
> > >
371
-
> > > 1. There are 18,426 observations, representing the cells.
372
-
> > > 2. There are 891 variables, representing the peaks.
381
+
> > > 1. There are 27,388 observations, representing the cells.
382
+
> > > 2. There are 1046 variables, representing the peaks.
373
383
> > >
374
384
> > {: .solution}
375
385
> >
@@ -385,7 +395,7 @@ Because the `AnnData` format is an extension of the HDF5 format, i.e. a binary f
385
395
> >
386
396
> > ```
387
397
> > [n_obs x n_vars]
388
-
> > - 18426 x 891
398
+
> > - 27388 x 1046
389
399
> > ```
390
400
> > * For more specific queries, {% tool [Inspect AnnData](toolshed.g2.bx.psu.edu/repos/iuc/anndata_inspect/anndata_inspect/0.7.5+galaxy1) %} is required.
391
401
> {: .comment}
@@ -411,7 +421,7 @@ Because the `AnnData` format is an extension of the HDF5 format, i.e. a binary f
411
421
> >
412
422
> > > <solution-title></solution-title>
413
423
> > >
414
-
> > > The file is a table with 18,426 lines (observations or cells) and 891 columns (variables or peaks): the count matrix for each of the 891 peaks and 18,426 cells. The 1st row contains the peak location as an annotation of the columns and the 1st column the barcodes of the cells as an annotation of the rows.
424
+
> > > The file is a table with 27,388 lines (observations or cells) and 1046 columns (variables or peaks): the count matrix for each of the 1046 peaks and 27,388 cells. The 1st row contains the peak location as an annotation of the columns and the 1st column the barcodes of the cells as an annotation of the rows.
415
425
> > >
416
426
> > {: .solution}
417
427
> >
@@ -516,7 +526,7 @@ First remove any potential empty features or barcodes. A non-empty cell should h
516
526
> >
517
527
> > > <solution-title></solution-title>
518
528
> > >
519
-
> > > The resulting matrix has dimensions of 1815 x 67766, i.e., more than 99.5% of the cells and less than 4% of features were filtered out. This indicates the high sparsity of the count matrix.
529
+
> > > The resulting matrix has dimensions of 1815 x 67766, i.e., more than 99.5% of the cells and less than 4% of features were filtered out. This indicates the high sparsity of the count matrix.
520
530
> > >
521
531
> > {: .solution}
522
532
> >
@@ -620,7 +630,7 @@ To determine decent filtering thresholds, we will further look at some histogram
620
630
> > > 1. The plots show a histogram of the number of cells sharing a feature. As we initially pooled the data from all the cells to detect the peaks, it is expected to see only a small number of cells have more than 10000 peaks in common.
621
631
> > > 2. The red vertical line of our 5 cells threshold is nearly at the left end of the histogram representing the majority of the features have at least 5 cells in common.
622
632
> > > From the log scale plot it is also clear that there is a sharp increase in the feature commonness from at least 10 cells (x-axis 1.0).
623
-
> > > So our threshold of 5 is a decent cutoff for filtering out the features. From the plots, only a very few non-informative features are left to be filtered out.
633
+
> > > So our threshold of 5 is a decent cutoff for filtering out the features. From the plots, only a very few non-informative features are left to be filtered out.
Copy file name to clipboardExpand all lines: topics/single-cell/tutorials/scatac-preprocessing-tenx/workflows/scATAC-seq-Count-Matrix-Filtering-test.yml
0 commit comments