treatment vs control pairwise analysis ArchR #696

cswoboda · 2021-04-16T16:30:51Z

cswoboda
Apr 16, 2021

Hello!

I'd like to ask if there are any field best practices associated with pseudobulk profile generation and peak calling when looking at differences in chromatin accessibility in a celltype across conditions.

For example, the ArchR manual states that "ArchR makes multiple such pseudo-bulk samples for each desired cell grouping, hence the term pseudo-bulk replicates. The underlying assumption in this process is that the single cells that are being grouped together are sufficiently similar that we do not care to understand the differences between them."

Now, pseudobulk profile generation is done on a per sample basis. So, as long as you're ensuring your profiles are done on a per sample basis by specifying replicates and min cells, you should be okay here to generate based on cluster, peak calling becomes the next issue.

So peak calling is done on the pseudobulk replicates, using MACS2. The iterative overlap method is where this becomes confusing for me.

It's unclear to me what one should consider when adding the reproducible peak set. The manual only explains the iterative overlap peak method using cell types. If I group by sample, how does this differ?

Would love some insight from anyone doing this analysis what has worked for them, im sitting at an epistemological struggle with my data that I do see some differences, but my inexperience with this type of analysis leads me to skepticism becuase I've performed it so many different ways and gotten different results. I've been working on this data for about two months now and am at a loss, so ant dialogue with others would be incredible for me.

Also, to the ArchR creators, thanks so much for this package. It's been the easiest to use atac-seq workflow for me by far and i've not found a better resource to understand both how atac-seq is performed and downstream analysis that is possible with atac-seq data. You've done amazing work!

Thanks,

Casey

Answered by rcorces

Apr 16, 2021

Thanks for your kind words about the package and documentation. Always nice to hear that it is working for people since most of our feedback comes in the form of bug reports.

It's unclear to me what one should consider when adding the reproducible peak set. The manual only explains the iterative overlap peak method using cell types. If I group by sample, how does this differ?

Can you clarify this a bit more? In case the documentation isnt clear, there is a difference between making pseudobulk replicates on a per-sample basis and "grouping by sample". If you group based on cluster (or cell type) ArchR tries to prevent you from making multiple pseudobulk replicates that fail to capture bi…

View full answer

rcorces · 2021-04-16T16:59:06Z

rcorces
Apr 16, 2021
Maintainer

Thanks for your kind words about the package and documentation. Always nice to hear that it is working for people since most of our feedback comes in the form of bug reports.

It's unclear to me what one should consider when adding the reproducible peak set. The manual only explains the iterative overlap peak method using cell types. If I group by sample, how does this differ?

Can you clarify this a bit more? In case the documentation isnt clear, there is a difference between making pseudobulk replicates on a per-sample basis and "grouping by sample". If you group based on cluster (or cell type) ArchR tries to prevent you from making multiple pseudobulk replicates that fail to capture biological variability. For example, if you took all cells in a cluster and divided them into three equal-sized but randomly-selected groups, you would have multiple biological donors per pseudobulk replicate and this would obscure biological variability. Instead, ArchR attempts to create pseudobulk replicates that contain cells from only a specific sample. This is what we refer to as sample-aware. But the grouping is still being performed on a cluster. Does that clarify things?

4 replies

cswoboda Apr 16, 2021
Author

Hey @rcorces,
What I mean by that is at the peak calling stage, I would try to call peaks using the group by "sample" parameter rather than by the pseudobulk clusters or "Clusters2". I saw on an issues thread that in the case of pairwise analysis between samples it may be recommended to call peaks based on the lowest grouping of cells (ie control and experimental). Maybe this is further obscuring things, so I'm going to detail my workflow for clarity:

I have 3 Control samples and 4 Experimental samples, all integrated following the preprocessing methods outlined in the manual. CellType A is reduced in number in the experimental groups by about 80% (condition of the experiment), but still integrate with CellType A from Control. I have at least 200 of CellType A in every single sample, so I set up my group coverages command as follows:
projHeme4 <- addGroupCoverages(ArchRProj = projHeme2, groupBy = "Clusters2", minCells = 200, maxCells = 1000, minReplicates = 3, maxReplicates = 7)

The pseudobulk replicates are generated based on the idea that you don't care about the differences cells your are grouping together. In my case, this is only true when those replicates are made in what you're referring to as a sample-aware fashion, but the grouping is the cluster as you clarified.

This should ensure that there are seven pseudobulk replicates of celltype A, three from control and four from experimental conditions. When I perform peak calling using MACS2, I would then group by "Clusters2" with the following line of code:

projHeme4 <- addReproduciblePeakSet(
ArchRProj = projHeme4,
groupBy = "Clusters2",
pathToMacs2 = pathToMacs2
)

With your expertise, do you think this would be the most robust method of generating a peak matrix capable of identifying differences between CellTypeA-Exp and CelltypeA-Control doing a pairwise test like in the manual?

rcorces Apr 16, 2021
Maintainer

I would try to call peaks using the group by "sample" parameter rather than by the pseudobulk clusters or "Clusters2"

This is not possible - whatever you pass to groupBy when you perform addGroupCoverages() must be passed to groupBy when you perform addReproduciblePeakSet()

I'm still not really following your logic but I think you are concerned that if you perform addGroupCoverages() based on cluster, then you might lose the ability to call peaks based on condition (exp. vs control). Assuming you have enough cells, the best solution is create a new column in cellColData that represents the product of cluster and condition (i.e. "Cluster1-exp" and "Cluster1-control")

For example

projHeme4 <- addCellColData(ArchRProj = projHeme4, data = paste0(projHeme4@cellColData$Clusters2,"_x_",projHeme4@cellColData$Sample), name = "ClusterBySample", cells = getCellNames(projHeme4), force = TRUE)

> head(projHeme4@cellColData$ClusterBySample)
[1] "GMP_x_scATAC_BMMC_R1"   "CD4.N_x_scATAC_BMMC_R1" "PreB_x_scATAC_BMMC_R1"  "PreB_x_scATAC_BMMC_R1"  "PreB_x_scATAC_BMMC_R1" 
[6] "GMP_x_scATAC_BMMC_R1"

cswoboda Apr 19, 2021
Author

Hey @rcorces,
That's exactly what I was looking for. Apologies for my lack of communicative clarity, I appreciate your time in helping me out. I'm going to mark this as answered for now and will perform the analysis and if there's any concerns I'll hop back onto this thread. Thanks again!

Casey

YaoyJiang Apr 19, 2024

I would like to ask what if I have 2 or 3 replicates for one condition？for example, "Cluster1-1-exp" and "Cluster1-2-exp''，"Cluster1-1-ctrl" and "Cluster1-2-ctrl''

cswoboda · 2021-04-28T20:34:08Z

cswoboda
Apr 28, 2021
Author

Hi @rcorces thanks again for all your help. I actually think I figured out the problem with my analysis and I'd like to get your feedback if you wouldn't mind. In my data, it's suspected that Cell Type A- treatment has much more open chromatin than Cell Type A-control. When we look at the nFrags summary, on average CellType A-treatment has double the amount of frags compared to CellType A-control per cell, and this is the only cell type in the dataset this occurs in. When I'm doing a pairwise analysis between these groups using getmarkerfeatures, and include nFrags as a bias, I'm assuming that this is significantly confounding the results when comparing between control and treatment. Do you think just using TSSEnrichment as a bias for this analysis would be sufficient, or would removing the log10(nFrags) metric be a flawed comparison?

3 replies

rcorces Apr 28, 2021
Maintainer

I would not remove the nFrags bias argument. I tend to think that global changes in chromatin accessibility are often artefactual and dont reflect true biology. The question you are asking is "is this particular peak more accessible in control vs treatment?". Imagine you sequence those to very different depths or they have highly different quality metrics.
That being said, you need to assess your system and the biology that you are interested in and make decisions about how you want to do the analysis.

cswoboda Apr 28, 2021
Author

Hi @rcorces thanks for such a quick reply! We expected the treatment group to have much higher chromatin accessibility due to the model we're using, and what we see is on a global scale (sample wide), QC metrics vary rather little across samples, but at the cell type resolution, nFrags is on average double in this one only population compared to control, and this trend is consistent with our 8 sample replicates. I don't want to take too much of your time, as you're right that we need to make the decisions on our end how best to do the analysis if this is indeed a biologically relevant change.

In terms of ArchR, I'm having trouble understanding exactly where bias fits into this usage of getmarkerfeatures, so it's hard for me to think of some of the analytical caveats. Looking in the source code and from what I've read from the manual, the bias allows you to identify cells of similar quality to your group of interest, in unique fragments and TSSenrichment for example, so you can ensure you're comparing cells with similar profiles. So when you don't specify the bdgGroup argument, ArchR will pull these cells from all the other clusters in order to identify marker peaks. If I do specify a bgdGroup for a pairwise test, I'm assuming ArchR will be performing a similar function by looking within the bdgGroup to find cells that match TSSEnrichment and nFrags to perform the analysis. If the cells of bdgGroup have on average 50% of the nFrag count of your group of interest, would it even be able to identify a suitable bias matched control without using the lowest quality cells from one dataset and the highest from another?

rcorces Apr 29, 2021
Maintainer

If the cells of bdgGroup have on average 50% of the nFrag count of your group of interest, would it even be able to identify a suitable bias matched control without using the lowest quality cells from one dataset and the highest from another?

Then it will choose cells from bgdGroup that have more than the average number of fragments

DaianeH · 2023-05-30T19:41:25Z

DaianeH
May 30, 2023

Hi,

I've been reading this thread while researching what to plot from my archR object to show global differences in open chromatin between samples of two conditions (WT and TP53-mutated). I plotted nFrags with the following code:

meta <- sc_atac@cellColData
meta <- as.data.frame(meta)
library(ggpubr)
ggplot(data=meta,aes(x=Cluster, y=nFrags, color=Status)) + geom_boxplot() + stat_compare_means(method = "t.test", size=4, label = "p.signif") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, size = 10), panel.background = element_blank(), axis.line = element_line(colour = "black")) + xlab("Cluster")

Which generated the following plot:

What can I conclude from this plot? That:
a) There's significant differences in chromatin accessibility across most of my clusters between WT and TP53 mutated conditions, with more open chromatin in the WT samples; or
b) WT samples are simply datasets of better quality, according to the higher nFrags?

Thank you so much in advance,

1 reply

mary77 Aug 24, 2023

Did you find the answer to your questions? I am seeing the same trend in my data

treatment vs control pairwise analysis ArchR #696

Uh oh!

cswoboda Apr 16, 2021

Replies: 3 comments · 8 replies

Uh oh!

rcorces Apr 16, 2021 Maintainer

Uh oh!

cswoboda Apr 16, 2021 Author

Uh oh!

rcorces Apr 16, 2021 Maintainer

Uh oh!

cswoboda Apr 19, 2021 Author

Uh oh!

YaoyJiang Apr 19, 2024

Uh oh!

cswoboda Apr 28, 2021 Author

Uh oh!

rcorces Apr 28, 2021 Maintainer

Uh oh!

cswoboda Apr 28, 2021 Author

Uh oh!

rcorces Apr 29, 2021 Maintainer

Uh oh!

DaianeH May 30, 2023

Uh oh!

mary77 Aug 24, 2023

cswoboda
Apr 16, 2021

Replies: 3 comments 8 replies

rcorces
Apr 16, 2021
Maintainer

cswoboda Apr 16, 2021
Author

rcorces Apr 16, 2021
Maintainer

cswoboda Apr 19, 2021
Author

cswoboda
Apr 28, 2021
Author

rcorces Apr 28, 2021
Maintainer

cswoboda Apr 28, 2021
Author

rcorces Apr 29, 2021
Maintainer

DaianeH
May 30, 2023