GO_tools.R

The GO_tools.R script provides a suite of functions for processing and analyzing Gene Ontology (GO) terms in R. It offers utilities to clean, update, and retrieve detailed metadata for GO terms via the QuickGO API, summarize GO terms via the ReviGO API, and perform GO enrichment analysis using hypergeometric statistics.

Usage

Install required packages:

install.packages(c("httr", "jsonlite", "xml2", "rlist", "dplyr", "XML", "readr"))

Clone the repository and source the GO_tools.R script from this repository:

#download GO tools from github repo
system("git clone https://github.com/gabrielnegreira/GO_tools.git")
#source it
source("GO_tools/GO_tools.R")

This will load the functions to the environment.

Overview of Functions

1. `clean_GO_terms()`

This function takes a vector of GO terms and uses the QuickGO API to look for updated terms. It identifies whether each term is primary or secondary, and for secondary terms, it determines the corresponding primary term. It also flags obsolete terms and suggest replacements for them when applicable. It returns a dataframe with these information, including the best_term column which contains a suggested replacement for the input term.

Usage Example:

clean_GO_terms() only requires a vector of GO terms as input:

#inport example file
go_df <- read.table("GO_tools/example.txt", comment.char = "%", header = FALSE, sep = "", col.names = c("GO_ID", "p_value"), stringsAsFactors = FALSE)

head(go_df)

GO_ID	p_value
GO:0040017	0.0001894
GO:0051272	0.0001894
GO:0016049	0.0019016
GO:0008361	0.0022678
GO:0007091	0.0023467
GO:0030335	0.0029797

cleaned_terms <- clean_GO_terms(go_df$GO_ID) 
cleaned_terms %>%
  filter(input_term != best_term)

The result looks like this:

input_term	is_in_db	is_secondary	primary_term	db_term	db_term_name	is_obsolete	db_term_aspect	replaced_by	best_term	best_term_name	best_term_aspect
GO:0006465	TRUE	FALSE	NA	GO:0006465	obsolete signal peptide processing	TRUE	biological_process	GO:0051604	GO:0051604	protein maturation	biological_process
GO:0043193	TRUE	TRUE	GO:0045893	GO:0045893	positive regulation of DNA-templated transcription	FALSE	biological_process	NA	GO:0045893	positive regulation of DNA-templated transcription	biological_process
GO:0030384	TRUE	TRUE	GO:0046488	GO:0046488	phosphatidylinositol metabolic process	FALSE	biological_process	NA	GO:0046488	phosphatidylinositol metabolic process	biological_process
GO:0015893	TRUE	TRUE	GO:0042908	GO:0042908	xenobiotic transport	FALSE	biological_process	NA	GO:0042908	xenobiotic transport	biological_process
GO:0046489	TRUE	TRUE	GO:0006661	GO:0006661	phosphatidylinositol biosynthetic process	FALSE	biological_process	NA	GO:0006661	phosphatidylinositol biosynthetic process	biological_process
GO:0005830	TRUE	TRUE	GO:0022626	GO:0022626	cytosolic ribosome	FALSE	cellular_component	NA	GO:0022626	cytosolic ribosome	cellular_component
GO:0004289	TRUE	FALSE	NA	GO:0004289	obsolete subtilase activity	TRUE	molecular_function	GO:0004252	GO:0004252	serine-type endopeptidase activity	molecular_function
GO:0008022	TRUE	FALSE	NA	GO:0008022	obsolete protein C-terminus binding	TRUE	molecular_function	GO:0005515	GO:0005515	protein binding	molecular_function

We can then replace the terms by the suggested ones:

go_df <- go_df %>%
  left_join(cleaned_terms %>% rename(GO_ID = input_term) %>% select(GO_ID, best_term), by = "GO_ID")

go_df %>%
  filter(GO_ID != best_term)

GO_ID	p_value	cleaned_GO_ID
GO:0006465	0.0049359	GO:0051604
GO:0043193	0.0066019	GO:0045893
GO:0030384	0.0142193	GO:0046488
GO:0015893	0.0184365	GO:0042908
GO:0046489	0.0425017	GO:0006661
GO:0005830	0.0338005	GO:0022626
GO:0004289	0.0108298	GO:0004252
GO:0008022	0.0265540	GO:0005515

2. `revigo_query()`

This function summarizes GO terms into simpler, higher representative terms using the ReviGO API. This function can take either a vector of GO terms or a data frame with GO terms and associated values (e.g., p-values).

Parameters:

cutoff: Similarity cutoff for grouping terms (default is "0.7").
valueType: Type of the provided value (default "PValue"). Other options include "Higher", "Lower", "HigherAbsolute", and "HigherAbsLog2".
speciesTaxon: NCBI taxon ID (default "0").
measure: Similarity measure to use; options include "SIMREL", "LIN", "RESNIK", "JIANG".
removeObsolete: Logical flag indicating whether to remove obsolete terms (default is TRUE).

See the Revigo documentation for more details.

Usage

#run `revigo_query()`
reduced_terms <- revigo_query(go_df$cleaned_GO_ID)
reduced_terms  %>%
  filter(term_id != repr_id) %>%
  head()

The result looks like this:

term_id	name	frequency	value	uniqueness	dispensability	eliminated	representative	go_type	repr_id	repr_name
GO:0030334	regulation of cell migration	0.355%	NaN	0.73	0.87	True	40013	biological_process	GO:0040013	negative regulation of locomotion
GO:0030335	positive regulation of cell migration	0.216%	NaN	0.70	0.91	True	40013	biological_process	GO:0040013	negative regulation of locomotion
GO:0030336	negative regulation of cell migration	0.064%	NaN	0.75	0.76	True	40013	biological_process	GO:0040013	negative regulation of locomotion
GO:0040017	positive regulation of locomotion	0.240%	NaN	0.71	0.91	True	40013	biological_process	GO:0040013	negative regulation of locomotion
GO:0006661	phosphatidylinositol biosynthetic process	0.457%	NaN	0.91	0.85	True	46488	biological_process	GO:0046488	phosphatidylinositol metabolic process
GO:0001710	mesodermal cell fate commitment	0.005%	NaN	0.83	0.85	True	48333	biological_process	GO:0048333	mesodermal cell differentiation

To use it to replace terms by their reduced alternatives, we can do this:

#arrange by reduced terms
go_df <- go_df %>%
  filter(cleaned_GO_ID != reduced_GO_ID) %>%
  group_by(reduced_GO_ID) %>%
  mutate(n_terms = n()) %>%
  arrange(desc(n_terms)) %>%
  select(-n_terms)

#print table  
go_df %>%
  head()

GO_ID	p_value	cleaned_GO_ID	reduced_GO_ID
GO:0040017	0.0001894	GO:0040017	GO:0040013
GO:0030335	0.0029797	GO:0030335	GO:0040013
GO:0030334	0.0140864	GO:0030334	GO:0040013
GO:0030336	0.0321750	GO:0030336	GO:0040013
GO:0004861	0.0002106	GO:0004861	GO:0004867
GO:0004860	0.0108788	GO:0004860	GO:0004867

Notice how the GO:0040017, GO:0030335, GO:0030334, and GO:0030336 terms were colapsed to the GO:0040013 term.

3. `get_GO_data`

This function fetches additional metadata for a given vector of GO terms by querying the QuickGO API.

Usage

go_meta <- get_GO_data(unique(go_df$reduced_GO_ID))
go_meta %>%
  head()

Result:

input_term	is_in_db	is_secondary	primary_term	db_term	name	is_obsolete	aspect	replaced_by
GO:0040013	TRUE	FALSE	NA	GO:0040013	negative regulation of locomotion	FALSE	biological_process	NA
GO:0004867	TRUE	FALSE	NA	GO:0004867	serine-type endopeptidase inhibitor activity	FALSE	molecular_function	NA
GO:0050678	TRUE	FALSE	NA	GO:0050678	regulation of epithelial cell proliferation	FALSE	biological_process	NA
GO:0030134	TRUE	FALSE	NA	GO:0030134	COPII-coated ER to Golgi transport vesicle	FALSE	cellular_component	NA
GO:0005254	TRUE	FALSE	NA	GO:0005254	chloride channel activity	FALSE	molecular_function	NA
GO:0016209	TRUE	FALSE	NA	GO:0016209	antioxidant activity	FALSE	molecular_function	NA

We can then use it to append metadata of interest to our main dataframe:

#append name to go_df
go_df <- go_df %>%
  left_join(go_meta %>%
              rename(reduced_GO_ID = input_term, reduced_GO_name = name, reduced_GO_type = aspect) %>%
              select(reduced_GO_ID, reduced_GO_name, reduced_GO_type),
            by = "reduced_GO_ID")

go_df %>%
  head()

GO_ID	p_value	cleaned_GO_ID	reduced_GO_ID	reduced_GO_name	reduced_GO_type
GO:0040017	0.0001894	GO:0040017	GO:0040013	negative regulation of locomotion	biological_process
GO:0030335	0.0029797	GO:0030335	GO:0040013	negative regulation of locomotion	biological_process
GO:0030334	0.0140864	GO:0030334	GO:0040013	negative regulation of locomotion	biological_process
GO:0030336	0.0321750	GO:0030336	GO:0040013	negative regulation of locomotion	biological_process
GO:0004861	0.0002106	GO:0004861	GO:0004867	serine-type endopeptidase inhibitor activity	molecular_function
GO:0004860	0.0108788	GO:0004860	GO:0004867	serine-type endopeptidase inhibitor activity	molecular_function

4. `GO_enrich`

This function performs a GO enrichment analysis using hypergeometric statistics and Fisher's exact test. It compares a set of GO terms (e.g., from differentially expressed genes) against a reference set (e.g., the entire genome).

Parameters:

set_terms: A vector of GO terms from the test set.
ref_terms: A vector of GO terms from the reference set.
test_type: Specifies the alternative hypothesis. Options are:
"enrichment" (uses "greater"),
"depletion" (uses "less"),
"both" (uses "two.sided").

Usage:

#perform GO enrichment analysis
set_terms <- go_df$reduced_GO_ID[1:6]
ref_terms <- go_df$reduced_GO_ID
GO_enrich(set_terms, ref_terms)

term	count_in_set	pvalue	count_in_ref	prop_in_set	prop_in_ref	log2enrichment
GO:0040013	4	0.0020506	4	0.6666667	0.1818182	1.874469
GO:0004867	2	0.1688312	3	0.3333333	0.1363636	1.289507

Notes

Internet Connection:
Both the QuickGO and Revigo APIs require an active internet connection.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GO_tools.R

Usage

Overview of Functions

1. `clean_GO_terms()`

2. `revigo_query()`

3. `get_GO_data`

4. `GO_enrich`

Notes

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

GO_tools.R

Usage

Overview of Functions

1. clean_GO_terms()

2. revigo_query()

3. get_GO_data

4. GO_enrich

Notes

1. `clean_GO_terms()`

2. `revigo_query()`

3. `get_GO_data`

4. `GO_enrich`