Questions about ArchR synthetic doublet generation #594

danielcgingerich · 2021-03-05T17:54:44Z

danielcgingerich
Mar 5, 2021

I love the ideas behind archR and scrublet doublet predictions - very interesting and creative. I would like to know more about how the synthetic doublets are generated.

In section 2.1 of the ArchR manual it says "To predict which “cells” are actually doublets, we synthesize in silico doublets from the data by mixing the reads from thousands of combinations of individual cells." In the associated figure, it shows groups of 2 cells being combined and then divided by two

First question, are all synthetic doublets divided by two, as shown in the figure? Wouldn't a doublet contain the sum of fragments from two cells rather than dividing them by two? Also, wouldn't this method not be able to properly identify doublets of the same cell type? Example: a hybrid of two astrocytes would still look like an astrocyte, leading to false positives. For this reason, I believe a higher fragment count should also be accounted for in doublet generation.
In the nature genetics paper: "To validate this approach, we carried out scATAC-seq on a mixture of ten human cell lines (n = 38,072 cells), allowing for genotype-based identification of doublets via demuxlet as a ground-truth comparison for computational identification of doublets by ArchR". This only identifies doublets based on genotype - What about cells that failed to dissociate properly in nuclei isolation (i.e. same donor, same genotype)? Depending on the tissue composition, cells that didn't dissociate could often be the same cell type, leading me back to my first question.

I feel like a solution to these problems could be to generate two synthetic doublets from each cell-cell pair: the first using your original method, and the second accounting for the higher read count of doublets (i.e. sum of the fragments)

rcorces · 2021-03-05T19:31:42Z

rcorces
Mar 5, 2021
Maintainer

Please note that this is a question and not a bug and has therefore been moved to the Discussions forum.
I'm not actually 100% sure how @jgranja24 coded this and the diagram may or may not be 100% correct but the concept is the same.

Wouldn't a doublet contain the sum of fragments from two cells rather than dividing them by two

Sometimes but not always - I think you would be surprised how unbalanced doublets can be.

wouldn't this method not be able to properly identify doublets of the same cell type?

Yes - we probably could be more clear on this in the paper and manual. Homo-typic doublets are not captured but I havent seen any approach that doesnt use genotype able to accurately capture homotypic doublets. We show the performance of just using the number of fragments in the manuscript and it is not very effective.

I feel like a solution to these problems could be to generate two synthetic doublets from each cell-cell pair: the first using your original method, and the second accounting for the higher read count of doublets (i.e. sum of the fragments)

You're welcome to benchmark this yourself on our cell line data but I think you'll find your hypothesis to not play out as expected. A cursory check shows this though I did this very quickly:

5 replies

danielcgingerich Mar 5, 2021
Author

@rcorces thanks for the explanation! I am using archR now - a wonderful tool!

marvinquiet Oct 12, 2021

Thank you for the above discussions which are very helpful and inspiring. I have a quick follow-up question about how are doublets generated in detail.

Please do not hesitate to correct me if I am wrong. My understanding is, say there are 1,000 cells, then C(1000, 2) new "cells" are averaged to be doublets. However, if this is the case, let's imagine 100 of them belong to one cell type C1, then some newly generated "doublets" will kind of lie within the C1 cell type cluster. Then by using KNN, all cells in C1 will be identified as "doublets" in this way.

I could not think through it and could you please give me some hints or explanations? Many thanks!

rcorces Oct 12, 2021
Maintainer

I dont think your interpretation is correct. Ultimately, ArchR is looking for enrichment. If there is sufficient heterogeneity in your dataset (which ArchR checks for) then you wouldnt see enrichment in the situation you describe and these cells would not be incorrectly called doublets.

marvinquiet Oct 12, 2021

Thank you @rcorces so much for your quick reply!

Yeah, I noticed in the supplementary note, there was a sentence saying "By iterating this procedure N times (user-defined, default 3 times the total number of cells), we can compute binomial enrichment statistics (assuming every cell could be a doublet with equal probability) for each single cell based on the presence of nearby simulated projected doublets (in the LSI or UMAP subspace defined by the user)".

Can I understand it as for those simulated doublets residing within clusters, they are likely to have the same neighbors while those true doublets might have different neighbors during the iteration?

rcorces Oct 12, 2021
Maintainer

No I think your interpretation is backwards. It is less likely to have simulated doublets projected into clusters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Questions about ArchR synthetic doublet generation #594

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Questions about ArchR synthetic doublet generation #594

Uh oh!

Uh oh!

danielcgingerich Mar 5, 2021

Replies: 1 comment · 5 replies

Uh oh!

Uh oh!

rcorces Mar 5, 2021 Maintainer

Uh oh!

danielcgingerich Mar 5, 2021 Author

Uh oh!

marvinquiet Oct 12, 2021

Uh oh!

rcorces Oct 12, 2021 Maintainer

Uh oh!

marvinquiet Oct 12, 2021

Uh oh!

rcorces Oct 12, 2021 Maintainer

danielcgingerich
Mar 5, 2021

Replies: 1 comment 5 replies

rcorces
Mar 5, 2021
Maintainer

danielcgingerich Mar 5, 2021
Author

rcorces Oct 12, 2021
Maintainer

rcorces Oct 12, 2021
Maintainer