MSKCC-Epi-Bio · hfuchs5 · May 9, 2024 · Jun 6, 2024
diff --git a/R/utils-exported-helpers.R b/R/utils-exported-helpers.R
@@ -147,3 +147,51 @@ extract_patient_id <- function(sample_id) {
   return(patient_id)
 }
 
+
+# Add point mutations ------------------------------------------------
+
+#' Annotates point mutations of interest
+#'
+#' @param df Raw maf dataframe containing mutation data
+#' @param gene_name Hugo symbol of gene of interest
+#' @param chr_num Number of the chromosome with gene of interest
+#' @param start_pos String referencing the start of the gene point location.
+#' @param new_name String providing a name for this point mutation
+#' @param mutually_exclusive Boolean determining if the point mutation should be
+#' mutually exclusive from any other mutation on the gene or not. The default is `TRUE`.
+#' @return  a data frame with updated hugo symbols for point mutations
+#' @export
+#'
+
+add_point_mut <- function(df, gene_name, chr_num, start_pos, new_name,
+                          mutually_exclusive = T){
+
+  df <- .clean_and_check_cols(df) %>%
+    mutate(start_position = as.character(start_position))
+
+  if (mutually_exclusive){
+    pm_df <- df %>%
+      mutate(hugo_symbol = case_when(
+        chromosome == chr_num & grepl(start_pos, start_position) ~ new_name,
+        TRUE ~ hugo_symbol
+      ))
+  } else {
+
+    # select only the rows that have the point mutation of interest
+    pm_only <- df %>%
+      mutate(hugo_symbol = case_when(
+        chromosome == chr_num & grepl(start_pos, start_position) ~ new_name,
+        TRUE ~ hugo_symbol
+      )) %>%
+      filter(.data$hugo_symbol == new_name)
+
+    # The point mutation will be counted as both the original
+    # hugo_symbol and also the new point mutation, so combine rows
+    pm_df <- pf %>%
+      rbind(pm_only)
+
+  }
+
+  return (pm_df)
+
+}
diff --git a/vignettes/data-processing-vignette.Rmd b/vignettes/data-processing-vignette.Rmd
@@ -119,7 +119,7 @@ gnomeR::cna %>%
 
 
 ## Preparing Data For Analysis
- 
+
 ### Process Data with `create_gene_binary()`
 
 Often the first step to analyzing genomic data is organizing it in an event matrix. This matrix will have one row for each sample in your cohort and one column for each type of genomic event.  Each cell will take a value of `0` (no event on that gene/sample), `1` (event on that gene/sample) or `NA` (missing data or gene not tested on panel). The `create_gene_binary()` function helps you process your data into this format for use in downstream analysis. 
@@ -170,6 +170,45 @@ colnames(all_bin)
 - `specify_panel`- If you are working across a set of samples that was sequenced on several different gene panels, this argument will insert NAs for the genes that weren't tested for any given sample. You can pass a string `"impact"` indicating automatically guessing panels and processsing IMPACT samples based on ID, or you can pass a data frame with columns sample_id and gene_panel for more fine grained control of NA annotation. 
 - `recode_aliases` - Sometimes genes have several accepted names or change names over time. This can be an issue if genes are coded under multiple names in studies, or if you are working across studies. By default, this function will search for aliases for genes in your data set and resolved them to their current most common name. 
 
+### Point Mutations: a Special Consideration
+
+Sometimes researchers will be interested in specific changes to DNA sequences at targeted places on a chromosome and allele. 
+
+For example, a few distinct mutations of the TERT gene have been associated with brain tumorgenesis: C228T and C250T.  [1](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8945783/) In both situations, nucleic acid C is swapped for T. For more background information on the biological mechanisms of TERT regulation, please reference the following article [here](https://jcp.bmj.com/content/72/4/281).
+
+Let's say you want to differentiate C228T and C250T from any other TERT mutations. Using a MAF file, you can use the `chromosome` and `start_position` variables to pinpoint the location of these mutations on the TERT gene and classify them by a new `hugo_symbol` using `add_point_muts()`. If you are unsure of these values, [Gene Cards](https://www.genecards.org/) is a helpful resource. Below are examples using the [MSK-IMPACT Clinical Sequencing Cohort](https://www.cbioportal.org/study/summary?id=msk_impact_2017) from cBioPortal used in a [prior study](https://pubmed.ncbi.nlm.nih.gov/28481359/).
+
+The `add_point_mutation()` function has four arguments:
+- `chr_num` - number to identify chromosome
+- `start_pos` - the start position of the mutation we are interesed in, which can be found on Gene Cards
+- `new_name` - a string to indicate what the point mutation should be called moving forward
+- `mutually_exclusive` - logical, default = `TRUE`, indicating if this point mutation should be considered mutually exclusive from the overall gene of interest. This affects mutational frequency calculations and interpretations of results. For example, say you had 10 TERT mutations across 20 individuals and two are TERT.C228T point mutations. By default, the mutational frequencies would be calculated as 40% for TERT (8/20) and 10% for TERT.C228T. If set to false, the mutation frequencies would be 50% for any TERT mutation (10/20) and 10% for TERT.C228T (2/20).
+
+```{r}
+# load in IMPACT data
+# would need an API token from cBioPortal to access data
+
+cbioportalR::set_cbioportal_db("public")
+
+impact_df <- cbioportalR::get_mutations_by_study(study_id = 'msk_impact_2017')
+
+# find number of TERT mutations in dataset regardless of
+impact_df %>% 
+  count(hugoGeneSymbol) %>%
+  filter(hugoGeneSymbol == "TERT")
+
+TERT_point_muts <- impact_df %>%
+  filter(hugoGeneSymbol == "TERT")%>%
+  # the start position could be a set value, or a pattern
+  # that can be searched for in the data
+  add_point_mut(chr_num = 5, start_pos = '1295228', new_name = 'TERT.C228T', mutually_exclusive = T) %>%
+  add_point_mut(chr_num = 5, start_pos = '250$', new_name = 'TERT.C250T', mutually_exclusive = T)
+
+# see new counts of hugo symbols
+count(TERT_point_muts, hugo_symbol)
+
+```
+
 ### Collapse Data with `summarize_by_gene()`
 
 If the type of alteration event (mutation, amplification, deletion, structural variant) does not matter for your analysis, and you want to see if any event occurred for a gene, pipe your `create_gene_binary()` object through the `summarize_by_gene()` function. As you can see, this compresses all alteration types of the same gene into one column. So, where in `all_bin` there was an `ERG.fus` column but no `ERG` column, now `summarize_by_gene()` only has an `ERG` column with a `1` for any type of event.