Updating a bit the documentation and small bug correction.

jracle85 · jracle85 · commit 50a4f404f96c · 2023-07-12T11:27:07.000+02:00
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,27 +1,27 @@
-Package: EPIC
-Type: Package
-Title: Estimate the Proportion of Immune and Cancer cells
-Version: 1.1.6
-Authors@R: as.person(c(
-  "Julien Racle <julien.racle@unil.ch> [aut, cre]",
-  "David Gfeller <david.gfeller@unil.ch> [aut]"
-  ))
-Description: Package implementing EPIC method to estimate the proportion of
-    immune, stromal, endothelial and cancer or other cells from bulk gene
-    expression data.
-    It is based on reference gene expression profiles for the main non-malignant
-    cell types and it predicts the proportion of these cells and of the
-    remaining "other cells" (that are mostly cancer cells) for which no
-    reference profile is given.
-Depends:
-    R (>= 3.2.0)
-License: file LICENSE
-LazyData: TRUE
-RoxygenNote: 7.2.1
-Suggests:
-    testthat,
-    knitr,
-    rmarkdown
-Imports:
-    stats
-VignetteBuilder: knitr
+Package: EPIC
+Type: Package
+Title: Estimate the Proportion of Immune and Cancer cells
+Version: 1.1.7
+Authors@R: as.person(c(
+  "Julien Racle <julien.racle@unil.ch> [aut, cre]",
+  "David Gfeller <david.gfeller@unil.ch> [aut]"
+  ))
+Description: Package implementing EPIC method to estimate the proportion of
+    immune, stromal, endothelial and cancer or other cells from bulk gene
+    expression data.
+    It is based on reference gene expression profiles for the main non-malignant
+    cell types and it predicts the proportion of these cells and of the
+    remaining "other cells" (that are mostly cancer cells) for which no
+    reference profile is given.
+Depends:
+    R (>= 3.2.0)
+License: file LICENSE
+LazyData: TRUE
+RoxygenNote: 7.2.1
+Suggests:
+    testthat,
+    knitr,
+    rmarkdown
+Imports:
+    stats
+VignetteBuilder: knitr
diff --git a/NAMESPACE b/NAMESPACE
@@ -1,3 +1,3 @@
-# Generated by roxygen2: do not edit by hand
-
-export(EPIC)
+# Generated by roxygen2: do not edit by hand
+
+export(EPIC)
diff --git a/NEWS b/NEWS
@@ -1,3 +1,13 @@
+Version 1.1.7
+------------------------------------------------------------------------
+* Small changes in the documentation (in particular, explaining in the
+  README's FAQ section when to use the *mRNAProportions* or *cellFractions*).
+* Removed the warning message about unknown *mRNA_cell* values that was written
+  nearly in all runs (writing the caution message about this directly in the FAQ
+  section).
+* Corrected a bug when there were duplicated *empty* gene names (i.e., genes
+  named simply as "").
+
 Version 1.1.6
 ------------------------------------------------------------------------
 * Changed person of contact for commercial licenses to Nadette Bulgin.
diff --git a/R/EPIC_descr.R b/R/EPIC_descr.R
@@ -5,8 +5,8 @@
 #' estimate the proportion of immune, stromal, endothelial and cancer or other
 #' cells from bulk gene expression data.
 #'
-#' See the package \link[=../doc/info.html]{vignette} and function definitions
-#' below.
+#' See the package vignette (command in the R console: \emph{vignette("EPIC")} )
+#' and function definitions below.
 #'
 #' @section EPIC functions:
 #' \code{\link{EPIC}} is the main function to call to estimate the
diff --git a/R/EPIC_fun.R b/R/EPIC_fun.R
@@ -112,7 +112,11 @@
 #' @return A list of 3 matrices:\describe{
 #'  \item{\code{mRNAProportions}}{(\code{nSamples} x (\code{nCellTypes+1})) the
 #'    proportion of mRNA coming from all cell types with a ref profile + the
-#'    uncharacterized other cell.}
+#'    uncharacterized other cell. Please note that if working with reconstructed
+#'    in silico bulk samples built for example from single-cell RNA-seq data,
+#'    then you should compare the 'true' proportions against these
+#'    'mRNAProportions', while if working with true bulk samples, then you should
+#'    compare the cell proportions against the 'cellFractions'.}
 #'  \item{\code{cellFractions}}{(\code{nSamples} x (\code{nCellTypes+1})) this
 #'    gives the proportion of cells from each cell type after accounting for
 #'    the mRNA / cell value.}
@@ -392,18 +396,20 @@ EPIC <- function(bulk, reference=NULL, mRNA_cell=NULL, mRNA_cell_sub=NULL,
   if (anyNA(tInds)){
     defaultInd <- match("default", names(mRNA_cell))
     if (is.na(defaultInd)){
-      tStr <- paste(" and no default value is given for this mRNA per cell,",
-                    "so we cannot estimate the cellFractions, only",
-                    "the mRNA proportions")
+      warning("mRNA_cell value unknown for some cell types: ",
+        paste(colnames(mRNAProportions)[is.na(tInds)], collapse=", "),
+        " and no default value is given for the mRNA per cell, so we cannot ",
+        "estimate the cellFractions, only the mRNA proportions")
     } else {
-      tStr <- paste(" - using the default value of", mRNA_cell[defaultInd],
-                    "for these but this might bias the true cell proportions from",
-                    "all cell types.")
+      # warning("mRNA_cell value unknown for some cell types: ",
+      #   paste(colnames(mRNAProportions)[is.na(tInds)], collapse=", "),
+      #   " - using the default value of", mRNA_cell[defaultInd], " for these but ",
+      #   "this might bias the true cell proportions from all cell types.")
+      # Not indicating this warning message as it comes about always if the
+      # user doesn't define additional mRNA_cell values by himself. Instead,
+      # I've indicated this warning in the documentation directly.
+      tInds[is.na(tInds)] <- defaultInd
     }
-    warning("mRNA_cell value unknown for some cell types: ",
-            paste(colnames(mRNAProportions)[is.na(tInds)], collapse=", "),
-            tStr)
-    tInds[is.na(tInds)] <- defaultInd
   }
   cellFractions <- t( t(mRNAProportions) / mRNA_cell[tInds])
   cellFractions <- cellFractions / rowSums(cellFractions, na.rm=FALSE)
@@ -465,15 +471,17 @@ merge_duplicates <- function(mat, warn=TRUE, in_type=NULL){
     if (warn){
       warning("There are ", length(dupl_genes), " duplicated gene names",
         ifelse(!is.null(in_type), paste(" in the", in_type), ""),
-        ". We'll use the median value for each of these cases.")
+        " (e.g., ", paste0("'", dupl_genes[1:(min(5, length(dupl_genes)))],
+        "'", collapse=", "), "). We'll use the median value for ",
+        "each of these cases.")
     }
     mat_dupl <- mat[rownames(mat) %in% dupl_genes,,drop=F]
     mat_dupl_names <- rownames(mat_dupl)
     mat <- mat[!dupl,,drop=F]
     # First put the dupl cases in a separate matrix and keep only the unique
     # gene names in the mat matrix.
-    mat[dupl_genes,] <- t(sapply(dupl_genes, FUN=function(cgene)
-      apply(mat_dupl[mat_dupl_names == cgene,,drop=F], MARGIN=2, FUN=median)))
+    mat[match(dupl_genes, rownames(mat)),] <- t(sapply(dupl_genes, FUN=function(cgene)
+      apply(mat_dupl[mat_dupl_names == cgene,,drop=F], MARGIN=2, FUN=stats::median)))
   }
   return(mat)
 }
diff --git a/README.Rmd b/README.Rmd
@@ -84,6 +84,21 @@ and David Gfeller ([david.gfeller@unil.ch](mailto:david.gfeller@unil.ch)).
 
 
 ## FAQ
+##### Which proportions returned by EPIC should I use?
+* EPIC is returning two proportion values: *mRNAProportions* and *cellFractions*, 
+where the 2nd represents the true proportion of cells coming from the different
+cell types when considering differences in mRNA expression between cell types.
+So in principle, it is best to consider these *cellFractions*.
+
+  However, please note, that when the goal is to benchmark EPIC predictions, if
+the 'bulk samples' correspond in fact to in silico samples reconstructed for
+example from single-cell RNA-seq data, then it is usually better to compare the
+'true' proportions against the *mRNAProportions* from EPIC. Indeed, when
+building such in silico samples, the fact that different cell types express
+different amount of mRNA is usually not taken into account. On the other side,
+if working with true bulk samples, then you should compare the true cell
+proportions (measured e.g., by FACS) against the *cellFractions*.
+
 ##### What do the "*other cells*" represent?
 * EPIC predicts the proportions of the various cell types for which we have
 gene expression reference profiles (and corresponding gene signatures). But,
@@ -99,7 +114,7 @@ epithelial cells for example.
 Please make sure that your bulk data is in the form of a matrix (and also
 your reference gene expression profiles if using custom ones).
 
-##### What is the meaning of the warning message telling that some mRNA_cell values are unknown?
+##### Is there some caution to consider about the *cellFractions* and *mRNA_cell* values?
 * As described in our manuscript, EPIC first estimates the proportion of mRNA
 per cell type in the bulk and then it uses the fact that some cell types have
 more mRNA copies per cell than other to normalize this and obtain an estimate of
@@ -108,10 +123,10 @@ if you need the one or the other). For this normalization we had either measured
 the amount of mRNA per cell or found it in the literature (fig. 1 – fig.
 supplement 2 of our paper). However we don’t currently have such values for the
 endothelial cells and CAFs. Therefore for these two cell types, we use an average
-value, which might not reflect their true value and this is the reason why we
-output this message. If you have some values for these mRNA/cell abundances, you
-can also add them into EPIC, with help of the parameter "*mRNA_cell*" or
-“*mRNA_cell_sub*” (and that would be great to share these values).
+value, which might not reflect their true value and this could bias a bit the
+predictions, especially for these cell types. If you have some values for these
+mRNA/cell abundances, you can also add them into EPIC, with help of the parameter
+"*mRNA_cell*" or “*mRNA_cell_sub*” (and that would be great to share these values).
 
     If the mRNA proportions of these cell types are low, then even if you don't
 correct the results with their true mRNA/cell abundances, it would not really
diff --git a/README.md b/README.md
@@ -85,6 +85,24 @@ Julien Racle (<julien.racle@unil.ch>), and David Gfeller
 
 ## FAQ
 
+##### Which proportions returned by EPIC should I use?
+
+- EPIC is returning two proportion values: *mRNAProportions* and
+  *cellFractions*, where the 2nd represents the true proportion of cells
+  coming from the different cell types when considering differences in
+  mRNA expression between cell types. So in principle, it is best to
+  consider these *cellFractions*.
+
+  However, please note, that when the goal is to benchmark EPIC
+  predictions, if the ‘bulk samples’ correspond in fact to in silico
+  samples reconstructed for example from single-cell RNA-seq data, then
+  it is usually better to compare the ‘true’ proportions against the
+  *mRNAProportions* from EPIC. Indeed, when building such in silico
+  samples, the fact that different cell types express different amount
+  of mRNA is usually not taken into account. On the other side, if
+  working with true bulk samples, then you should compare the true cell
+  proportions (measured e.g., by FACS) against the *cellFractions*.
+
 ##### What do the “*other cells*” represent?
 
 - EPIC predicts the proportions of the various cell types for which we
@@ -104,7 +122,7 @@ Julien Racle (<julien.racle@unil.ch>), and David Gfeller
   matrix (and also your reference gene expression profiles if using
   custom ones).
 
-##### What is the meaning of the warning message telling that some mRNA_cell values are unknown?
+##### Is there some caution to consider about the *cellFractions* and *mRNA_cell* values?
 
 - As described in our manuscript, EPIC first estimates the proportion of
   mRNA per cell type in the bulk and then it uses the fact that some
@@ -115,11 +133,11 @@ Julien Racle (<julien.racle@unil.ch>), and David Gfeller
   mRNA per cell or found it in the literature (fig. 1 – fig. supplement
   2 of our paper). However we don’t currently have such values for the
   endothelial cells and CAFs. Therefore for these two cell types, we use
-  an average value, which might not reflect their true value and this is
-  the reason why we output this message. If you have some values for
-  these mRNA/cell abundances, you can also add them into EPIC, with help
-  of the parameter “*mRNA_cell*” or “*mRNA_cell_sub*” (and that would be
-  great to share these values).
+  an average value, which might not reflect their true value and this
+  could bias a bit the predictions, especially for these cell types. If
+  you have some values for these mRNA/cell abundances, you can also add
+  them into EPIC, with help of the parameter “*mRNA_cell*” or
+  “*mRNA_cell_sub*” (and that would be great to share these values).
 
   If the mRNA proportions of these cell types are low, then even if you
   don’t correct the results with their true mRNA/cell abundances, it
diff --git a/inst/doc/EPIC.Rmd b/inst/doc/EPIC.Rmd
@@ -80,6 +80,21 @@ and David Gfeller ([david.gfeller@unil.ch](mailto:david.gfeller@unil.ch)).
 
 
 ## FAQ
+##### Which proportions returned by EPIC should I use?
+* EPIC is returning two proportion values: *mRNAProportions* and *cellFractions*, 
+where the 2nd represents the true proportion of cells coming from the different
+cell types when considering differences in mRNA expression between cell types.
+So in principle, it is best to consider these *cellFractions*.
+
+  However, please note, that when the goal is to benchmark EPIC predictions, if
+the 'bulk samples' correspond in fact to in silico samples reconstructed for
+example from single-cell RNA-seq data, then it is usually better to compare the
+'true' proportions against the *mRNAProportions* from EPIC. Indeed, when
+building such in silico samples, the fact that different cell types express
+different amount of mRNA is usually not taken into account. On the other side,
+if working with true bulk samples, then you should compare the true cell
+proportions (measured e.g., by FACS) against the *cellFractions*.
+
 ##### What do the "*other cells*" represent?
 * EPIC predicts the proportions of the various cell types for which we have
 gene expression reference profiles (and corresponding gene signatures). But,
@@ -95,7 +110,7 @@ epithelial cells for example.
 Please make sure that your bulk data is in the form of a matrix (and also
 your reference gene expression profiles if using custom ones).
 
-##### What is the meaning of the warning message telling that some mRNA_cell values are unknown?
+##### Is there some caution to consider about the *cellFractions* and *mRNA_cell* values?
 * As described in our manuscript, EPIC first estimates the proportion of mRNA
 per cell type in the bulk and then it uses the fact that some cell types have
 more mRNA copies per cell than other to normalize this and obtain an estimate of
@@ -104,10 +119,10 @@ if you need the one or the other). For this normalization we had either measured
 the amount of mRNA per cell or found it in the literature (fig. 1 – fig.
 supplement 2 of our paper). However we don’t currently have such values for the
 endothelial cells and CAFs. Therefore for these two cell types, we use an average
-value, which might not reflect their true value and this is the reason why we
-output this message. If you have some values for these mRNA/cell abundances, you
-can also add them into EPIC, with help of the parameter "*mRNA_cell*" or
-“*mRNA_cell_sub*” (and that would be great to share these values).
+value, which might not reflect their true value and this could bias a bit the
+predictions, especially for these cell types. If you have some values for these
+mRNA/cell abundances, you can also add them into EPIC, with help of the parameter
+"*mRNA_cell*" or “*mRNA_cell_sub*” (and that would be great to share these values).
 
     If the mRNA proportions of these cell types are low, then even if you don't
 correct the results with their true mRNA/cell abundances, it would not really
diff --git a/inst/doc/EPIC.html b/inst/doc/EPIC.html
@@ -12,7 +12,7 @@
 
 <meta name="author" content="Julien Racle and David Gfeller" />
 
-<meta name="date" content="2023-03-13" />
+<meta name="date" content="2023-07-12" />
 
 <title>EPIC package</title>
 
@@ -340,7 +340,7 @@
 
 <h1 class="title toc-ignore">EPIC package</h1>
 <h4 class="author">Julien Racle and David Gfeller</h4>
-<h4 class="date">2023-03-13</h4>
+<h4 class="date">2023-07-12</h4>
 
 
 
@@ -409,6 +409,26 @@ <h2>Contact information</h2>
 </div>
 <div id="faq" class="section level2">
 <h2>FAQ</h2>
+<div id="which-proportions-returned-by-epic-should-i-use" class="section level5">
+<h5>Which proportions returned by EPIC should I use?</h5>
+<ul>
+<li><p>EPIC is returning two proportion values: <em>mRNAProportions</em>
+and <em>cellFractions</em>, where the 2nd represents the true proportion
+of cells coming from the different cell types when considering
+differences in mRNA expression between cell types. So in principle, it
+is best to consider these <em>cellFractions</em>.</p>
+<p>However, please note, that when the goal is to benchmark EPIC
+predictions, if the ‘bulk samples’ correspond in fact to in silico
+samples reconstructed for example from single-cell RNA-seq data, then it
+is usually better to compare the ‘true’ proportions against the
+<em>mRNAProportions</em> from EPIC. Indeed, when building such in silico
+samples, the fact that different cell types express different amount of
+mRNA is usually not taken into account. On the other side, if working
+with true bulk samples, then you should compare the true cell
+proportions (measured e.g., by FACS) against the
+<em>cellFractions</em>.</p></li>
+</ul>
+</div>
 <div id="what-do-the-other-cells-represent" class="section level5">
 <h5>What do the “<em>other cells</em>” represent?</h5>
 <ul>
@@ -433,9 +453,9 @@ <h5>I receive an error message “<em>attempt to set ‘colnames’ on an
 ones).</li>
 </ul>
 </div>
-<div id="what-is-the-meaning-of-the-warning-message-telling-that-some-mrna_cell-values-are-unknown" class="section level5">
-<h5>What is the meaning of the warning message telling that some
-mRNA_cell values are unknown?</h5>
+<div id="is-there-some-caution-to-consider-about-the-cellfractions-and-mrna_cell-values" class="section level5">
+<h5>Is there some caution to consider about the <em>cellFractions</em>
+and <em>mRNA_cell</em> values?</h5>
 <ul>
 <li><p>As described in our manuscript, EPIC first estimates the
 proportion of mRNA per cell type in the bulk and then it uses the fact
@@ -446,11 +466,12 @@ <h5>What is the meaning of the warning message telling that some
 mRNA per cell or found it in the literature (fig. 1 – fig. supplement 2
 of our paper). However we don’t currently have such values for the
 endothelial cells and CAFs. Therefore for these two cell types, we use
-an average value, which might not reflect their true value and this is
-the reason why we output this message. If you have some values for these
-mRNA/cell abundances, you can also add them into EPIC, with help of the
-parameter “<em>mRNA_cell</em>” or “<em>mRNA_cell_sub</em>” (and that
-would be great to share these values).</p>
+an average value, which might not reflect their true value and this
+could bias a bit the predictions, especially for these cell types. If
+you have some values for these mRNA/cell abundances, you can also add
+them into EPIC, with help of the parameter “<em>mRNA_cell</em>” or
+“<em>mRNA_cell_sub</em>” (and that would be great to share these
+values).</p>
 <p>If the mRNA proportions of these cell types are low, then even if you
 don’t correct the results with their true mRNA/cell abundances, it would
 not really have a big impact on the results. On the other side, if there
diff --git a/man/EPIC.Rd b/man/EPIC.Rd
diff --git a/man/EPIC.package.Rd b/man/EPIC.package.Rd
diff --git a/vignettes/EPIC.Rmd b/vignettes/EPIC.Rmd