Separate cellNames based on biological variable during the creation of metacell aggregates #1199

RegnerM2015 · 2021-11-30T20:46:37Z

RegnerM2015
Nov 30, 2021

I am working with a scATAC-seq dataset that includes both healthy and diseased tissue samples.

When performing peak-to-gene linkage and co-accessibility analyses, the single cell observations are aggregated into metacells via kNN to create more informative observations for computing correlations.

In this case, I think it is important for cells that make up a metacell aggregate to be all from the same biological condition/group. In other words, healthy cells should form healthy aggregates and disease cells should form disease aggregates. Creating metacells that include both healthy and disease cells could obscure the downstream results.

To ensure that the aggregate metacells contain only cells from one biological condition or group, I made some slight changes to the source code of addCoAccessibility and addPeak2GeneLinks.

I modified addCoAccessibility to addCoAccessibility.mod by adding two additional parameters, group1 and group2 which are character vectors of cellNames (barcodes). In this case, group1 is a vector of disease cellNames and group2 is a vector of healthy cellNames. After the cellsToUse parameter is checked, I first subset the reducedDims to only the cells of group1. Using these group1 cells, I proceed as normally by performing the subsampling for an idx, running the kNN, determining the overlaps in cells between metacell aggregates, and converting the knnObject to knnObj.group1. I then repeat these steps after subsetting the reducedDims to only the cells of group2 resulting in the knnObject knnObj.group2. Finally, I concatenate these knnObj lists into one list that is used for construction of the metacell aggregate matrices.

##########################################################################################
# Co-accessibility Methods
##########################################################################################

#' Add Peak Co-Accessibility to an ArchRProject
#' 
#' This function will add co-accessibility scores to peaks in a given ArchRProject
#'
#' @param ArchRProj An `ArchRProject` object.
#' @param reducedDims The name of the `reducedDims` object (i.e. "IterativeLSI") to retrieve from the designated `ArchRProject`.
#' @param dimsToUse A vector containing the dimensions from the `reducedDims` object to use in clustering.
#' @param scaleDims A boolean value that indicates whether to z-score the reduced dimensions for each cell. This is useful for minimizing
#' the contribution of strong biases (dominating early PCs) and lowly abundant populations. However, this may lead to stronger sample-specific
#' biases since it is over-weighting latent PCs. If set to `NULL` this will scale the dimensions based on the value of `scaleDims` when the
#' `reducedDims` were originally created during dimensionality reduction. This idea was introduced by Timothy Stuart.
#' @param corCutOff A numeric cutoff for the correlation of each dimension to the sequencing depth. If the dimension has a correlation to
#' sequencing depth that is greater than the `corCutOff`, it will be excluded from analysis.
#' @param cellsToUse A character vector of cellNames to compute coAccessibility on if desired to run on a subset of the total cells.
#' @param k The number of k-nearest neighbors to use for creating single-cell groups for correlation analyses.
#' @param knnIteration The number of k-nearest neighbor groupings to test for passing the supplied `overlapCutoff`.
#' @param overlapCutoff The maximum allowable overlap between the current group and all previous groups to permit the current group be
#' added to the group list during k-nearest neighbor calculations.
#' @param maxDist The maximum allowable distance in basepairs between two peaks to consider for co-accessibility.
#' @param scaleTo The total insertion counts from the designated group of single cells is summed across all relevant peak regions from
#' the `peakSet` of the `ArchRProject` and normalized to the total depth provided by `scaleTo`.
#' @param log2Norm A boolean value indicating whether to log2 transform the single-cell groups prior to computing co-accessibility correlations.
#' @param seed A number to be used as the seed for random number generation required in knn determination. It is recommended to keep track
#' of the seed used so that you can reproduce results downstream.
#' @param threads The number of threads to be used for parallel computing.
#' @param verbose A boolean value that determines whether standard output should be printed.
#' @param logFile The path to a file to be used for logging ArchR output.
#' @param group1 A character vector of barcode names to restrict the knn aggregate determination. Useful for preventing barcodes across biological conditions (i.e. disease v. normal) from being grouped together into the same aggregate.  
#' @param group2 A character vector of barcode names to restrict the knn aggregate determination. Useful for preventing barcodes across biological conditions (i.e. disease v. normal) from being grouped together into the same aggregate. 
#' @export
addCoAccessibility.mod <- function(
  ArchRProj = NULL,
  reducedDims = "IterativeLSI",
  dimsToUse = 1:30,
  scaleDims = NULL,
  corCutOff = 0.75,
  cellsToUse = NULL,
  k = 100, 
  knnIteration = 500, 
  overlapCutoff = 0.8, 
  maxDist = 100000,
  scaleTo = 10^4,
  log2Norm = TRUE,
  seed = 1, 
  threads = getArchRThreads(),
  verbose = TRUE,
  logFile = createLogFile("addCoAccessibility"),
  group1 = NULL,
  group2 = NULL
){
  
  .validInput(input = ArchRProj, name = "ArchRProj", valid = c("ArchRProj"))
  .validInput(input = reducedDims, name = "reducedDims", valid = c("character"))
  .validInput(input = dimsToUse, name = "dimsToUse", valid = c("numeric", "null"))
  .validInput(input = scaleDims, name = "scaleDims", valid = c("boolean", "null"))
  .validInput(input = corCutOff, name = "corCutOff", valid = c("numeric", "null"))
  .validInput(input = cellsToUse, name = "cellsToUse", valid = c("character", "null"))
  .validInput(input = k, name = "k", valid = c("integer"))
  .validInput(input = knnIteration, name = "knnIteration", valid = c("integer"))
  .validInput(input = overlapCutoff, name = "overlapCutoff", valid = c("numeric"))
  .validInput(input = maxDist, name = "maxDist", valid = c("integer"))
  .validInput(input = scaleTo, name = "scaleTo", valid = c("numeric"))
  .validInput(input = log2Norm, name = "log2Norm", valid = c("boolean"))
  .validInput(input = threads, name = "threads", valid = c("integer"))
  .validInput(input = verbose, name = "verbose", valid = c("boolean"))
  .validInput(input = logFile, name = "logFile", valid = c("character"))
  .validInput(input = group1, name = "group1", valid = c("character", "null"))
  .validInput(input = group2, name = "group2", valid = c("character", "null"))
  
  tstart <- Sys.time()
  .startLogging(logFile = logFile)
  .logThis(mget(names(formals()),sys.frame(sys.nframe())), "addCoAccessibility Input-Parameters", logFile = logFile)
  
  set.seed(seed)
  
  #Get Peak Set
  peakSet <- getPeakSet(ArchRProj)
  
  #Get Reduced Dims 
  rD <- getReducedDims(ArchRProj, reducedDims = reducedDims, corCutOff = corCutOff, dimsToUse = dimsToUse)
  if(!is.null(cellsToUse)){
    rD <- rD[cellsToUse, ,drop=FALSE]
  }
  
  if(!is.null(group1) | !is.null(group2)){
    rD <- rD[group1, ,drop=FALSE]
    
    #Subsample (group 1)
    idx <- sample(seq_len(nrow(rD)), knnIteration, replace = !nrow(rD) >= knnIteration)
    
    #KNN Matrix (group 1)
    .logDiffTime(main="Computing KNN", t1=tstart, verbose=verbose, logFile=logFile)
    knnObj <- .computeKNN(data = rD, query = rD[idx,], k = k)
    
    #Determin Overlap (group 1)
    .logDiffTime(main="Identifying Non-Overlapping KNN pairs", t1=tstart, verbose=verbose, logFile=logFile)
    keepKnn <- determineOverlapCpp(knnObj, floor(overlapCutoff * k))
    
    #Keep Above Cutoff (group 1)
    knnObj <- knnObj[keepKnn==0,]
    .logDiffTime(paste0("Identified ", nrow(knnObj), " Groupings!"), t1=tstart, verbose=verbose, logFile=logFile)
    
    #Convert To Names List (group 1)
    knnObj.group1 <- lapply(seq_len(nrow(knnObj)), function(x){
      rownames(rD)[knnObj[x, ]]
    }) %>% SimpleList
    
    #Get Reduced Dims (group 2)
    rD <- getReducedDims(ArchRProj, reducedDims = reducedDims, corCutOff = corCutOff, dimsToUse = dimsToUse)
    rD <- rD[group2, ,drop=FALSE]
    
    #Subsample (group 2)
    idx <- sample(seq_len(nrow(rD)), knnIteration, replace = !nrow(rD) >= knnIteration)
    
    #KNN Matrix (group 2)
    .logDiffTime(main="Computing KNN", t1=tstart, verbose=verbose, logFile=logFile)
    knnObj <- .computeKNN(data = rD, query = rD[idx,], k = k)
    
    #Determin Overlap (group 2)
    .logDiffTime(main="Identifying Non-Overlapping KNN pairs", t1=tstart, verbose=verbose, logFile=logFile)
    keepKnn <- determineOverlapCpp(knnObj, floor(overlapCutoff * k))
    
    #Keep Above Cutoff (group 2)
    knnObj <- knnObj[keepKnn==0,]
    .logDiffTime(paste0("Identified ", nrow(knnObj), " Groupings!"), t1=tstart, verbose=verbose, logFile=logFile)
    
    #Convert To Names List (group 2)
    knnObj.group2 <- lapply(seq_len(nrow(knnObj)), function(x){
      rownames(rD)[knnObj[x, ]]
    }) %>% SimpleList
    
    # Concatenate SimpleList knnObjects
    knnObj <- append(knnObj.group1,knnObj.group2)

    #Check Chromosomes
    chri <- gtools::mixedsort(.availableChr(getArrowFiles(ArchRProj), subGroup = "PeakMatrix"))
    chrj <- gtools::mixedsort(unique(paste0(seqnames(getPeakSet(ArchRProj)))))
    stopifnot(identical(chri,chrj))
    
    #Create Ranges
    peakSummits <- resize(peakSet, 1, "center")
    peakWindows <- resize(peakSummits, maxDist, "center")
    
    #Create Pairwise Things to Test
    o <- DataFrame(findOverlaps(peakSummits, peakWindows, ignore.strand = TRUE))
    o <- o[o[,1] != o[,2],]
    o$seqnames <- seqnames(peakSet)[o[,1]]
    o$idx1 <- peakSet$idx[o[,1]]
    o$idx2 <- peakSet$idx[o[,2]]
    o$correlation <- -999.999
    o$Variability1 <- 0.000
    o$Variability2 <- 0.000
    
    #Peak Matrix ColSums
    cS <- .getColSums(getArrowFiles(ArchRProj), chri, verbose = FALSE, useMatrix = "PeakMatrix")
    gS <- unlist(lapply(seq_along(knnObj), function(x) sum(cS[knnObj[[x]]], na.rm=TRUE)))
    
    for(x in seq_along(chri)){
      
      .logDiffTime(sprintf("Computing Co-Accessibility %s (%s of %s)", chri[x], x, length(chri)), t1=tstart, verbose=verbose, logFile=logFile)
      
      #Features
      featureDF <- mcols(peakSet)[BiocGenerics::which(seqnames(peakSet) == chri[x]),]
      featureDF$seqnames <- chri[x]
      
      #Group Matrix
      groupMat <- .getGroupMatrix(
        ArrowFiles = getArrowFiles(ArchRProj), 
        featureDF = featureDF, 
        groupList = knnObj, 
        useMatrix = "PeakMatrix",
        threads = threads,
        verbose = FALSE
      )
      
      #Scale
      groupMat <- t(t(groupMat) / gS) * scaleTo
      
      if(log2Norm){
        groupMat <- log2(groupMat + 1)
      }
      
      #Correlations
      idx <- BiocGenerics::which(o$seqnames==chri[x])
      corVals <- rowCorCpp(idxX = o[idx,]$idx1, idxY = o[idx,]$idx2, X = as.matrix(groupMat), Y = as.matrix(groupMat))
      .logThis(head(corVals), paste0("SubsetCorVals-", x), logFile = logFile)
      
      rowVars <- as.numeric(matrixStats::rowVars(groupMat))
      
      o[idx,]$correlation <- as.numeric(corVals)
      o[idx,]$Variability1 <- rowVars[o[idx,]$idx1]
      o[idx,]$Variability2 <- rowVars[o[idx,]$idx2]
      
      .logThis(groupMat, paste0("SubsetGroupMat-", x), logFile = logFile)
      .logThis(o[idx,], paste0("SubsetCoA-", x), logFile = logFile)
      
    }
    
    o$idx1 <- NULL
    o$idx2 <- NULL
    o <- o[!is.na(o$correlation),]
    
    o$TStat <- (o$correlation / sqrt((pmax(1-o$correlation^2, 0.00000000000000001, na.rm = TRUE))/(length(knnObj)-2))) #T-statistic P-value
    o$Pval <- 2*pt(-abs(o$TStat), length(knnObj) - 2)
    o$FDR <- p.adjust(o$Pval, method = "fdr")
    o$VarQuantile1 <- .getQuantiles(o$Variability1)
    o$VarQuantile2 <- .getQuantiles(o$Variability2)
    
    mcols(peakSet) <- NULL
    o@metadata$peakSet <- peakSet
    
    metadata(ArchRProj@peakSet)$CoAccessibility <- o
    
    .endLogging(logFile = logFile)
    
    ArchRProj
    
  }else{
    #Subsample
    idx <- sample(seq_len(nrow(rD)), knnIteration, replace = !nrow(rD) >= knnIteration)
    
    #KNN Matrix
    .logDiffTime(main="Computing KNN", t1=tstart, verbose=verbose, logFile=logFile)
    knnObj <- .computeKNN(data = rD, query = rD[idx,], k = k)
    
    #Determin Overlap
    .logDiffTime(main="Identifying Non-Overlapping KNN pairs", t1=tstart, verbose=verbose, logFile=logFile)
    keepKnn <- determineOverlapCpp(knnObj, floor(overlapCutoff * k))
    
    #Keep Above Cutoff
    knnObj <- knnObj[keepKnn==0,]
    .logDiffTime(paste0("Identified ", nrow(knnObj), " Groupings!"), t1=tstart, verbose=verbose, logFile=logFile)
    
    #Convert To Names List
    knnObj <- lapply(seq_len(nrow(knnObj)), function(x){
      rownames(rD)[knnObj[x, ]]
    }) %>% SimpleList
    
    #Check Chromosomes
    chri <- gtools::mixedsort(.availableChr(getArrowFiles(ArchRProj), subGroup = "PeakMatrix"))
    chrj <- gtools::mixedsort(unique(paste0(seqnames(getPeakSet(ArchRProj)))))
    stopifnot(identical(chri,chrj))
    
    #Create Ranges
    peakSummits <- resize(peakSet, 1, "center")
    peakWindows <- resize(peakSummits, maxDist, "center")
    
    #Create Pairwise Things to Test
    o <- DataFrame(findOverlaps(peakSummits, peakWindows, ignore.strand = TRUE))
    o <- o[o[,1] != o[,2],]
    o$seqnames <- seqnames(peakSet)[o[,1]]
    o$idx1 <- peakSet$idx[o[,1]]
    o$idx2 <- peakSet$idx[o[,2]]
    o$correlation <- -999.999
    o$Variability1 <- 0.000
    o$Variability2 <- 0.000
    
    #Peak Matrix ColSums
    cS <- .getColSums(getArrowFiles(ArchRProj), chri, verbose = FALSE, useMatrix = "PeakMatrix")
    gS <- unlist(lapply(seq_along(knnObj), function(x) sum(cS[knnObj[[x]]], na.rm=TRUE)))
    
    for(x in seq_along(chri)){
      
      .logDiffTime(sprintf("Computing Co-Accessibility %s (%s of %s)", chri[x], x, length(chri)), t1=tstart, verbose=verbose, logFile=logFile)
      
      #Features
      featureDF <- mcols(peakSet)[BiocGenerics::which(seqnames(peakSet) == chri[x]),]
      featureDF$seqnames <- chri[x]
      
      #Group Matrix
      groupMat <- .getGroupMatrix(
        ArrowFiles = getArrowFiles(ArchRProj), 
        featureDF = featureDF, 
        groupList = knnObj, 
        useMatrix = "PeakMatrix",
        threads = threads,
        verbose = FALSE
      )
      
      #Scale
      groupMat <- t(t(groupMat) / gS) * scaleTo
      
      if(log2Norm){
        groupMat <- log2(groupMat + 1)
      }
      
      #Correlations
      idx <- BiocGenerics::which(o$seqnames==chri[x])
      corVals <- rowCorCpp(idxX = o[idx,]$idx1, idxY = o[idx,]$idx2, X = as.matrix(groupMat), Y = as.matrix(groupMat))
      .logThis(head(corVals), paste0("SubsetCorVals-", x), logFile = logFile)
      
      rowVars <- as.numeric(matrixStats::rowVars(groupMat))
      
      o[idx,]$correlation <- as.numeric(corVals)
      o[idx,]$Variability1 <- rowVars[o[idx,]$idx1]
      o[idx,]$Variability2 <- rowVars[o[idx,]$idx2]
      
      .logThis(groupMat, paste0("SubsetGroupMat-", x), logFile = logFile)
      .logThis(o[idx,], paste0("SubsetCoA-", x), logFile = logFile)
      
    }
    
    o$idx1 <- NULL
    o$idx2 <- NULL
    o <- o[!is.na(o$correlation),]
    
    o$TStat <- (o$correlation / sqrt((pmax(1-o$correlation^2, 0.00000000000000001, na.rm = TRUE))/(length(knnObj)-2))) #T-statistic P-value
    o$Pval <- 2*pt(-abs(o$TStat), length(knnObj) - 2)
    o$FDR <- p.adjust(o$Pval, method = "fdr")
    o$VarQuantile1 <- .getQuantiles(o$Variability1)
    o$VarQuantile2 <- .getQuantiles(o$Variability2)
    
    mcols(peakSet) <- NULL
    o@metadata$peakSet <- peakSet
    
    metadata(ArchRProj@peakSet)$CoAccessibility <- o
    
    .endLogging(logFile = logFile)
    
    ArchRProj
    
  }
}

I implemented the same changes described above and modified addPeak2GeneLinks to addPeak2GeneLinks.mod:

##########################################################################################
# Peak2Gene Links Methods
##########################################################################################

#' Add Peak2GeneLinks to an ArchRProject
#' 
#' This function will add peak-to-gene links to a given ArchRProject
#' 
#' @param ArchRProj An `ArchRProject` object.
#' @param reducedDims The name of the `reducedDims` object (i.e. "IterativeLSI") to retrieve from the designated `ArchRProject`.
#' @param dimsToUse A vector containing the dimensions from the `reducedDims` object to use in clustering.
#' @param scaleDims A boolean value that indicates whether to z-score the reduced dimensions for each cell. This is useful for minimizing
#' the contribution of strong biases (dominating early PCs) and lowly abundant populations. However, this may lead to stronger sample-specific
#' biases since it is over-weighting latent PCs. If set to `NULL` this will scale the dimensions based on the value of `scaleDims` when the
#' `reducedDims` were originally created during dimensionality reduction. This idea was introduced by Timothy Stuart.
#' @param corCutOff A numeric cutoff for the correlation of each dimension to the sequencing depth. If the dimension has a
#' correlation to sequencing depth that is greater than the `corCutOff`, it will be excluded from analysis.
#' @param cellsToUse A character vector of cellNames to compute coAccessibility on if desired to run on a subset of the total cells.
#' @param k The number of k-nearest neighbors to use for creating single-cell groups for correlation analyses.
#' @param knnIteration The number of k-nearest neighbor groupings to test for passing the supplied `overlapCutoff`.
#' @param overlapCutoff The maximum allowable overlap between the current group and all previous groups to permit the current
#' group be added to the group list during k-nearest neighbor calculations.
#' @param maxDist The maximum allowable distance in basepairs between two peaks to consider for co-accessibility.
#' @param scaleTo The total insertion counts from the designated group of single cells is summed across all relevant peak regions
#' from the `peakSet` of the `ArchRProject` and normalized to the total depth provided by `scaleTo`.
#' @param log2Norm A boolean value indicating whether to log2 transform the single-cell groups prior to computing co-accessibility correlations.
#' @param predictionCutoff A numeric describing the cutoff for RNA integration to use when picking cells for groupings.
#' @param addEmpiricalPval Add empirical p-values based on randomly correlating peaks and genes not on the same seqname.
#' @param seed A number to be used as the seed for random number generation required in knn determination. It is recommended
#' to keep track of the seed used so that you can reproduce results downstream.
#' @param threads The number of threads to be used for parallel computing.
#' @param verbose A boolean value that determines whether standard output should be printed.
#' @param logFile The path to a file to be used for logging ArchR output.
#' @param group1 A character vector of barcode names to restrict the knn aggregate determination. Useful for preventing barcodes across biological conditions (i.e. disease v. normal) from being grouped together into the same aggregate.  
#' @param group2 A character vector of barcode names to restrict the knn aggregate determination. Useful for preventing barcodes across biological conditions (i.e. disease v. normal) from being grouped together into the same aggregate. 
#' @export
addPeak2GeneLinks.mod <- function(
  ArchRProj = NULL,
  reducedDims = "IterativeLSI",
  useMatrix = "GeneIntegrationMatrix",
  dimsToUse = 1:30,
  scaleDims = NULL,
  corCutOff = 0.75,
  cellsToUse = NULL,
  k = 100, 
  knnIteration = 500, 
  overlapCutoff = 0.8, 
  maxDist = 250000,
  scaleTo = 10^4,
  log2Norm = TRUE,
  predictionCutoff = 0.4,
  addEmpiricalPval = FALSE,
  seed = 1, 
  threads = max(floor(getArchRThreads() / 2), 1),
  verbose = TRUE,
  logFile = createLogFile("addPeak2GeneLinks"),
  group1 = NULL,
  group2 = NULL
){
  
  .validInput(input = ArchRProj, name = "ArchRProj", valid = c("ArchRProj"))
  .validInput(input = reducedDims, name = "reducedDims", valid = c("character"))
  .validInput(input = dimsToUse, name = "dimsToUse", valid = c("numeric", "null"))
  .validInput(input = scaleDims, name = "scaleDims", valid = c("boolean", "null"))
  .validInput(input = corCutOff, name = "corCutOff", valid = c("numeric", "null"))
  .validInput(input = cellsToUse, name = "cellsToUse", valid = c("character", "null"))
  .validInput(input = k, name = "k", valid = c("integer"))
  .validInput(input = knnIteration, name = "knnIteration", valid = c("integer"))
  .validInput(input = overlapCutoff, name = "overlapCutoff", valid = c("numeric"))
  .validInput(input = maxDist, name = "maxDist", valid = c("integer"))
  .validInput(input = scaleTo, name = "scaleTo", valid = c("numeric"))
  .validInput(input = log2Norm, name = "log2Norm", valid = c("boolean"))
  .validInput(input = threads, name = "threads", valid = c("integer"))
  .validInput(input = verbose, name = "verbose", valid = c("boolean"))
  .validInput(input = logFile, name = "logFile", valid = c("character"))
  .validInput(input = group1, name = "group1", valid = c("character", "null"))
  .validInput(input = group2, name = "group2", valid = c("character", "null"))
  
  tstart <- Sys.time()
  .startLogging(logFile = logFile)
  .logThis(mget(names(formals()),sys.frame(sys.nframe())), "addPeak2GeneLinks Input-Parameters", logFile = logFile)
  
  .logDiffTime(main="Getting Available Matrices", t1=tstart, verbose=verbose, logFile=logFile)
  AvailableMatrices <- getAvailableMatrices(ArchRProj)
  
  if("PeakMatrix" %ni% AvailableMatrices){
    stop("PeakMatrix not in AvailableMatrices")
  }
  
  if(useMatrix %ni% AvailableMatrices){
    stop(paste0(useMatrix, " not in AvailableMatrices"))
  }
  
  ArrowFiles <- getArrowFiles(ArchRProj)
  
  tstart <- Sys.time()
  
  dfAll <- .safelapply(seq_along(ArrowFiles), function(x){
    cNx <- paste0(names(ArrowFiles)[x], "#", h5read(ArrowFiles[x], paste0(useMatrix, "/Info/CellNames")))
    pSx <- tryCatch({
      h5read(ArrowFiles[x], paste0(useMatrix, "/Info/predictionScore"))
    }, error = function(e){
      if(getArchRVerbose()) message("No predictionScore found. Continuing without predictionScore!")
      rep(9999999, length(cNx))
    })
    DataFrame(
      cellNames = cNx,
      predictionScore = pSx
    )
  }, threads = threads) %>% Reduce("rbind", .)
  
  .logDiffTime(
    sprintf("Filtered Low Prediction Score Cells (%s of %s, %s)", 
            sum(dfAll[,2] < predictionCutoff), 
            nrow(dfAll), 
            round(sum(dfAll[,2] < predictionCutoff) / nrow(dfAll), 3)
    ), t1=tstart, verbose=verbose, logFile=logFile)
  
  keep <- sum(dfAll[,2] >= predictionCutoff) / nrow(dfAll)
  dfAll <- dfAll[which(dfAll[,2] > predictionCutoff),]
  
  set.seed(seed)
  
  #Get Peak Set
  peakSet <- getPeakSet(ArchRProj)
  .logThis(peakSet, "peakSet", logFile = logFile)
  
  #Gene Info
  geneSet <- .getFeatureDF(ArrowFiles, useMatrix, threads = threads)
  geneStart <- GRanges(geneSet$seqnames, IRanges(geneSet$start, width = 1), name = geneSet$name, idx = geneSet$idx)
  .logThis(geneStart, "geneStart", logFile = logFile)
  
  #Get Reduced Dims
  rD <- getReducedDims(ArchRProj, reducedDims = reducedDims, corCutOff = corCutOff, dimsToUse = dimsToUse)
  if(!is.null(cellsToUse)){
    rD <- rD[cellsToUse, ,drop=FALSE]
  }
  
  if(!is.null(group1) | !is.null(group2)){
    
    rD <- rD[group1, ,drop=FALSE]
    
    #Subsample (group 1)
    idx <- sample(seq_len(nrow(rD)), knnIteration, replace = !nrow(rD) >= knnIteration)
    
    #KNN Matrix (group 1)
    .logDiffTime(main="Computing KNN", t1=tstart, verbose=verbose, logFile=logFile)
    knnObj <- .computeKNN(data = rD, query = rD[idx,], k = k)
    
    #Determin Overlap (group 1)
    .logDiffTime(main="Identifying Non-Overlapping KNN pairs", t1=tstart, verbose=verbose, logFile=logFile)
    keepKnn <- determineOverlapCpp(knnObj, floor(overlapCutoff * k))
    
    #Keep Above Cutoff (group 1)
    knnObj <- knnObj[keepKnn==0,]
    .logDiffTime(paste0("Identified ", nrow(knnObj), " Groupings!"), t1=tstart, verbose=verbose, logFile=logFile)
    
    #Convert To Names List (group 1)
    knnObj.group1 <- lapply(seq_len(nrow(knnObj)), function(x){
      rownames(rD)[knnObj[x, ]]
    }) %>% SimpleList
    
    #Get Reduced Dims (group 2)
    rD <- getReducedDims(ArchRProj, reducedDims = reducedDims, corCutOff = corCutOff, dimsToUse = dimsToUse)
    if(!is.null(cellsToUse)){
      rD <- rD[cellsToUse, ,drop=FALSE]
    }
    rD <- rD[group2, ,drop=FALSE]
    
    #Subsample (group 2)
    idx <- sample(seq_len(nrow(rD)), knnIteration, replace = !nrow(rD) >= knnIteration)
    
    #KNN Matrix (group 2)
    .logDiffTime(main="Computing KNN", t1=tstart, verbose=verbose, logFile=logFile)
    knnObj <- .computeKNN(data = rD, query = rD[idx,], k = k)
    
    #Determin Overlap (group 2)
    .logDiffTime(main="Identifying Non-Overlapping KNN pairs", t1=tstart, verbose=verbose, logFile=logFile)
    keepKnn <- determineOverlapCpp(knnObj, floor(overlapCutoff * k))
    
    #Keep Above Cutoff (group 2)
    knnObj <- knnObj[keepKnn==0,]
    .logDiffTime(paste0("Identified ", nrow(knnObj), " Groupings!"), t1=tstart, verbose=verbose, logFile=logFile)
    
    #Convert To Names List (group 2)
    knnObj.group2 <- lapply(seq_len(nrow(knnObj)), function(x){
      rownames(rD)[knnObj[x, ]]
    }) %>% SimpleList
    
    # Concatenate SimpleList knnObjects
    knnObj <- append(knnObj.group1,knnObj.group2)
    
    #Check Chromosomes
    chri <- gtools::mixedsort(unique(paste0(seqnames(peakSet))))
    chrj <- gtools::mixedsort(unique(paste0(seqnames(geneStart))))
    chrij <- intersect(chri, chrj)
    
    #Features
    geneDF <- mcols(geneStart)
    peakDF <- mcols(peakSet)
    geneDF$seqnames <- seqnames(geneStart)
    peakDF$seqnames <- seqnames(peakSet)
    
    #Group Matrix RNA
    .logDiffTime(main="Getting Group RNA Matrix", t1=tstart, verbose=verbose, logFile=logFile)
    groupMatRNA <- .getGroupMatrix(
      ArrowFiles = getArrowFiles(ArchRProj), 
      featureDF = geneDF, 
      groupList = knnObj, 
      useMatrix = useMatrix,
      threads = threads,
      verbose = FALSE
    )
    rawMatRNA <- groupMatRNA
    .logThis(groupMatRNA, "groupMatRNA", logFile = logFile)
    
    #Group Matrix ATAC
    .logDiffTime(main="Getting Group ATAC Matrix", t1=tstart, verbose=verbose, logFile=logFile)
    groupMatATAC <- .getGroupMatrix(
      ArrowFiles = getArrowFiles(ArchRProj), 
      featureDF = peakDF, 
      groupList = knnObj, 
      useMatrix = "PeakMatrix",
      threads = threads,
      verbose = FALSE
    )
    rawMatATAC <- groupMatATAC
    .logThis(groupMatATAC, "groupMatATAC", logFile = logFile)
    
    .logDiffTime(main="Normalizing Group Matrices", t1=tstart, verbose=verbose, logFile=logFile)
    
    groupMatRNA <- t(t(groupMatRNA) / colSums(groupMatRNA)) * scaleTo
    groupMatATAC <- t(t(groupMatATAC) / colSums(groupMatATAC)) * scaleTo
    
    if(log2Norm){
      groupMatRNA  <- log2(groupMatRNA + 1)
      groupMatATAC <- log2(groupMatATAC + 1)    
    }
    
    names(geneStart) <- NULL
    
    seRNA <- SummarizedExperiment(
      assays = SimpleList(RNA = groupMatRNA, RawRNA = rawMatRNA), 
      rowRanges = geneStart
    )
    metadata(seRNA)$KNNList <- knnObj
    .logThis(seRNA, "seRNA", logFile = logFile)
    
    names(peakSet) <- NULL
    
    seATAC <- SummarizedExperiment(
      assays = SimpleList(ATAC = groupMatATAC, RawATAC = rawMatATAC), 
      rowRanges = peakSet
    )
    metadata(seATAC)$KNNList <- knnObj
    .logThis(seATAC, "seATAC", logFile = logFile)
    
    rm(groupMatRNA, groupMatATAC)
    gc()
    
    #Overlaps
    .logDiffTime(main="Finding Peak Gene Pairings", t1=tstart, verbose=verbose, logFile=logFile)
    o <- DataFrame(
      findOverlaps(
        .suppressAll(resize(seRNA, 2 * maxDist + 1, "center")), 
        resize(rowRanges(seATAC), 1, "center"), 
        ignore.strand = TRUE
      )
    )
    
    #Get Distance from Fixed point A B 
    o$distance <- distance(rowRanges(seRNA)[o[,1]] , rowRanges(seATAC)[o[,2]] )
    colnames(o) <- c("B", "A", "distance")
    
    #Null Correlations
    if(addEmpiricalPval){
      .logDiffTime(main="Computing Background Correlations", t1=tstart, verbose=verbose, logFile=logFile)
      nullCor <- .getNullCorrelations(seATAC, seRNA, o, 1000)
    }
    
    .logDiffTime(main="Computing Correlations", t1=tstart, verbose=verbose, logFile=logFile)
    o$Correlation <- rowCorCpp(as.integer(o$A), as.integer(o$B), assay(seATAC), assay(seRNA))
    o$VarAssayA <- .getQuantiles(matrixStats::rowVars(assay(seATAC)))[o$A]
    o$VarAssayB <- .getQuantiles(matrixStats::rowVars(assay(seRNA)))[o$B]
    o$TStat <- (o$Correlation / sqrt((pmax(1-o$Correlation^2, 0.00000000000000001, na.rm = TRUE))/(ncol(seATAC)-2))) #T-statistic P-value
    o$Pval <- 2*pt(-abs(o$TStat), ncol(seATAC) - 2)
    o$FDR <- p.adjust(o$Pval, method = "fdr")
    out <- o[, c("A", "B", "Correlation", "FDR", "VarAssayA", "VarAssayB")]
    colnames(out) <- c("idxATAC", "idxRNA", "Correlation", "FDR", "VarQATAC", "VarQRNA")  
    mcols(peakSet) <- NULL
    names(peakSet) <- NULL
    metadata(out)$peakSet <- peakSet
    metadata(out)$geneSet <- geneStart
    
    if(addEmpiricalPval){
      out$EmpPval <- 2*pnorm(-abs(((out$Correlation - mean(nullCor[[2]])) / sd(nullCor[[2]]))))
      out$EmpFDR <- p.adjust(out$EmpPval, method = "fdr")
    }
    
    #Save Group Matrices
    dir.create(file.path(getOutputDirectory(ArchRProj), "Peak2GeneLinks"), showWarnings = FALSE)
    outATAC <- file.path(getOutputDirectory(ArchRProj), "Peak2GeneLinks", "seATAC-Group-KNN.rds")
    .safeSaveRDS(seATAC, outATAC, compress = FALSE)
    outRNA <- file.path(getOutputDirectory(ArchRProj), "Peak2GeneLinks", "seRNA-Group-KNN.rds")
    .safeSaveRDS(seRNA, outRNA, compress = FALSE)
    metadata(out)$seATAC <- outATAC
    metadata(out)$seRNA <- outRNA
    
    metadata(ArchRProj@peakSet)$Peak2GeneLinks <- out
    
    .logDiffTime(main="Completed Peak2Gene Correlations!", t1=tstart, verbose=verbose, logFile=logFile)
    .endLogging(logFile = logFile)
    
    ArchRProj
  }else{
    #Subsample
    idx <- sample(seq_len(nrow(rD)), knnIteration, replace = !nrow(rD) >= knnIteration)
    
    #KNN Matrix
    .logDiffTime(main="Computing KNN", t1=tstart, verbose=verbose, logFile=logFile)
    knnObj <- .computeKNN(data = rD, query = rD[idx,], k = k)
    
    #Determin Overlap
    .logDiffTime(main="Identifying Non-Overlapping KNN pairs", t1=tstart, verbose=verbose, logFile=logFile)
    keepKnn <- determineOverlapCpp(knnObj, floor(overlapCutoff * k))
    
    #Keep Above Cutoff
    knnObj <- knnObj[keepKnn==0,]
    .logDiffTime(paste0("Identified ", nrow(knnObj), " Groupings!"), t1=tstart, verbose=verbose, logFile=logFile)
    
    #Convert To Names List
    knnObj <- lapply(seq_len(nrow(knnObj)), function(x){
      rownames(rD)[knnObj[x, ]]
    }) %>% SimpleList
    
    #Check Chromosomes
    chri <- gtools::mixedsort(unique(paste0(seqnames(peakSet))))
    chrj <- gtools::mixedsort(unique(paste0(seqnames(geneStart))))
    chrij <- intersect(chri, chrj)
    
    #Features
    geneDF <- mcols(geneStart)
    peakDF <- mcols(peakSet)
    geneDF$seqnames <- seqnames(geneStart)
    peakDF$seqnames <- seqnames(peakSet)
    
    #Group Matrix RNA
    .logDiffTime(main="Getting Group RNA Matrix", t1=tstart, verbose=verbose, logFile=logFile)
    groupMatRNA <- .getGroupMatrix(
      ArrowFiles = getArrowFiles(ArchRProj), 
      featureDF = geneDF, 
      groupList = knnObj, 
      useMatrix = useMatrix,
      threads = threads,
      verbose = FALSE
    )
    rawMatRNA <- groupMatRNA
    .logThis(groupMatRNA, "groupMatRNA", logFile = logFile)
    
    #Group Matrix ATAC
    .logDiffTime(main="Getting Group ATAC Matrix", t1=tstart, verbose=verbose, logFile=logFile)
    groupMatATAC <- .getGroupMatrix(
      ArrowFiles = getArrowFiles(ArchRProj), 
      featureDF = peakDF, 
      groupList = knnObj, 
      useMatrix = "PeakMatrix",
      threads = threads,
      verbose = FALSE
    )
    rawMatATAC <- groupMatATAC
    .logThis(groupMatATAC, "groupMatATAC", logFile = logFile)
    
    .logDiffTime(main="Normalizing Group Matrices", t1=tstart, verbose=verbose, logFile=logFile)
    
    groupMatRNA <- t(t(groupMatRNA) / colSums(groupMatRNA)) * scaleTo
    groupMatATAC <- t(t(groupMatATAC) / colSums(groupMatATAC)) * scaleTo
    
    if(log2Norm){
      groupMatRNA  <- log2(groupMatRNA + 1)
      groupMatATAC <- log2(groupMatATAC + 1)    
    }
    
    names(geneStart) <- NULL
    
    seRNA <- SummarizedExperiment(
      assays = SimpleList(RNA = groupMatRNA, RawRNA = rawMatRNA), 
      rowRanges = geneStart
    )
    metadata(seRNA)$KNNList <- knnObj
    .logThis(seRNA, "seRNA", logFile = logFile)
    
    names(peakSet) <- NULL
    
    seATAC <- SummarizedExperiment(
      assays = SimpleList(ATAC = groupMatATAC, RawATAC = rawMatATAC), 
      rowRanges = peakSet
    )
    metadata(seATAC)$KNNList <- knnObj
    .logThis(seATAC, "seATAC", logFile = logFile)
    
    rm(groupMatRNA, groupMatATAC)
    gc()
    
    #Overlaps
    .logDiffTime(main="Finding Peak Gene Pairings", t1=tstart, verbose=verbose, logFile=logFile)
    o <- DataFrame(
      findOverlaps(
        .suppressAll(resize(seRNA, 2 * maxDist + 1, "center")), 
        resize(rowRanges(seATAC), 1, "center"), 
        ignore.strand = TRUE
      )
    )
    
    #Get Distance from Fixed point A B 
    o$distance <- distance(rowRanges(seRNA)[o[,1]] , rowRanges(seATAC)[o[,2]] )
    colnames(o) <- c("B", "A", "distance")
    
    #Null Correlations
    if(addEmpiricalPval){
      .logDiffTime(main="Computing Background Correlations", t1=tstart, verbose=verbose, logFile=logFile)
      nullCor <- .getNullCorrelations(seATAC, seRNA, o, 1000)
    }
    
    .logDiffTime(main="Computing Correlations", t1=tstart, verbose=verbose, logFile=logFile)
    o$Correlation <- rowCorCpp(as.integer(o$A), as.integer(o$B), assay(seATAC), assay(seRNA))
    o$VarAssayA <- .getQuantiles(matrixStats::rowVars(assay(seATAC)))[o$A]
    o$VarAssayB <- .getQuantiles(matrixStats::rowVars(assay(seRNA)))[o$B]
    o$TStat <- (o$Correlation / sqrt((pmax(1-o$Correlation^2, 0.00000000000000001, na.rm = TRUE))/(ncol(seATAC)-2))) #T-statistic P-value
    o$Pval <- 2*pt(-abs(o$TStat), ncol(seATAC) - 2)
    o$FDR <- p.adjust(o$Pval, method = "fdr")
    out <- o[, c("A", "B", "Correlation", "FDR", "VarAssayA", "VarAssayB")]
    colnames(out) <- c("idxATAC", "idxRNA", "Correlation", "FDR", "VarQATAC", "VarQRNA")  
    mcols(peakSet) <- NULL
    names(peakSet) <- NULL
    metadata(out)$peakSet <- peakSet
    metadata(out)$geneSet <- geneStart
    
    if(addEmpiricalPval){
      out$EmpPval <- 2*pnorm(-abs(((out$Correlation - mean(nullCor[[2]])) / sd(nullCor[[2]]))))
      out$EmpFDR <- p.adjust(out$EmpPval, method = "fdr")
    }
    
    #Save Group Matrices
    dir.create(file.path(getOutputDirectory(ArchRProj), "Peak2GeneLinks"), showWarnings = FALSE)
    outATAC <- file.path(getOutputDirectory(ArchRProj), "Peak2GeneLinks", "seATAC-Group-KNN.rds")
    .safeSaveRDS(seATAC, outATAC, compress = FALSE)
    outRNA <- file.path(getOutputDirectory(ArchRProj), "Peak2GeneLinks", "seRNA-Group-KNN.rds")
    .safeSaveRDS(seRNA, outRNA, compress = FALSE)
    metadata(out)$seATAC <- outATAC
    metadata(out)$seRNA <- outRNA
    
    metadata(ArchRProj@peakSet)$Peak2GeneLinks <- out
    
    .logDiffTime(main="Completed Peak2Gene Correlations!", t1=tstart, verbose=verbose, logFile=logFile)
    .endLogging(logFile = logFile)
    
    ArchRProj
  }
  

  
}

Are these changes reasonable and accurate? Both modified functions successfully run to completation on my end. I would really appreciate your help and advice on this matter.

rcorces · 2021-12-01T17:01:00Z

rcorces
Dec 1, 2021
Maintainer

I cant comment on the exact code changes but the concept seems fine. Though I feel like you could accomplish this just by doing a project subset first.

3 replies

RegnerM2015 Dec 1, 2021
Author

By separating healthy v. disease cells in the metacell aggregate formation, I can compute the correlations across healthy and disease metacell aggregates. This analysis would be different from computing correlations using only healthy cells (subset 1) then computing correlations using only disease cells (subset 2). The former case is what I am trying to shoot for.

I am currently assessing if this change is needed at all by counting the number of metacell aggregates that contain both healthy and diesease cells. If <5% of metacell aggregates are "healthy - disease mosaics" then it may not be a problem at all.

I will keep you updated!

Thanks

RegnerM2015 Dec 1, 2021
Author

Update:

Only ~7% of metacell aggregates contain both healthy and disease cells when running the default version of addCoAccessibility (see histogram below). The x axis is the proportion of healthy cells/total cells in an aggregate or disease cells/total cells in an aggregate (the number would be the smallest number of opposite condition cells in each metacell aggregate proportion).

Therefore, this concern when running the default may not be as important as I thought. I suspect this would depend on the dataset and how LSI embeds healthy and disease cells in low dimensional space which could be affected by a number of factors (both biological and technical).

rcorces Dec 1, 2021
Maintainer

interesting! Thanks for posting this!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Separate cellNames based on biological variable during the creation of metacell aggregates #1199

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Separate cellNames based on biological variable during the creation of metacell aggregates #1199

Uh oh!

Uh oh!

RegnerM2015 Nov 30, 2021

Replies: 1 comment · 3 replies

Uh oh!

rcorces Dec 1, 2021 Maintainer

Uh oh!

Uh oh!

RegnerM2015 Dec 1, 2021 Author

Uh oh!

Uh oh!

RegnerM2015 Dec 1, 2021 Author

Uh oh!

rcorces Dec 1, 2021 Maintainer

RegnerM2015
Nov 30, 2021

Replies: 1 comment 3 replies

rcorces
Dec 1, 2021
Maintainer

RegnerM2015 Dec 1, 2021
Author

RegnerM2015 Dec 1, 2021
Author

rcorces Dec 1, 2021
Maintainer