-
Notifications
You must be signed in to change notification settings - Fork 0
Home
GeNLP: An interactive web application for microbial gene exploration and prediction
🌐 Visit the GeNLP website
This repository contains the implementation of GeNLP, a user-friendly web server to explore gene relationships!
The server is based on a pre-trained published language model:
"Deciphering microbial gene function using natural language processing"
Weights and trained model are available on the paper's GitHub repository.
The application contains two main modes:
The map in the main display is an interactive map, where each gene is represented by a dot. The map is color-coded by functional group, where unknown proteins are colored in light grey. The map supports zoom-in and zoom-out.
Upon sufficient zoom-in the points are clickable, providing additional information on a given gene family.
The data points are color-coded based on their functional groups, which have been adapted from the KEGG (Kyoto Encyclopedia of Genes and Genomes) database. A legend depicting these functional groups is located in the bottom left corner of the visualization.
Using the dropdown menu, select an option from the following:
-
KEGG ortholog: search by KEGG ortholog group identifier (KO).
For example, the KOK07464(cas4):
The orange circles mark the location of the highlighted interactive points of cas4 representatives.
-
Functional category: Highlight specific functional category.
For example, selecting
Bacterial Motility Proteinswill result in all related proteins being highlighted as interactive points:
-
Gene description: Highlight genes sharing the same gene description.
For example, selecting
CRISPR system Cascade subunit casBwill result in all related proteins being highlighted as interactive points:
-
Gene Name: Highlight genes according to their gene name.
For example, selecting
cas3will result in all related proteins being highlighted as interactive points:
In this detailed zoom-in resolution, it is important to observe that all points are interactive, and the selected points identified as cas3 are distinctly highlighted with a white edge color.
-
Neighbors: Highlight the 10-closest genes for a selected gene family.
For example, selecting the word
CRISPR(which corresponds to a CRISPR array identifier will highlight its neighbors as interactive points:
Notice: The distance calculations were performed in a 300-dimensional space. As a result, it is possible that the genes closest to each other may appear to be far apart in the two-dimensional projections. A banner noting this appears when entering the web application, and again when performing a search for neighbors.
-
Model word: search by KEGG ortholog group (KO) sub-cluster or by a hypothetical identifier (used for uncharacterized genes). Multiple selections are supported. The aforementioned description also applies when selecting points in interactive mode.
When selecting a specific word (point), we offer relevant information about the associated gene, which may vary depending on whether the gene is known or unknown. Furthermore, we provide two interactive panels: NEIGHBORS, FUNC PRED, and TAX MAP, offering additional interactive functionalities for further exploration and analysis.
For hypothetical proteins, the field Prediction confidence denotes whether the prediction assigned by the model is reliable. The score is obtained by the functional classifier, high prediction confidence stands for a score that passed our defined cutoff, whereas low prediction confidence is for scores below the cutoff (for more technical details see the manuscript).
Upon selecting the NEIGHBORS tab, a graph displaying the ten closest gene families will be presented. Clicking on a specific neighbor within the graph will trigger a zoom-in effect on the corresponding neighboring genes, providing a more detailed view.
Upon selecting the FUNC PRED tab, a graph displaying the prediction score per inspected category will be presented.
In the TAX MAP tab, a graph showcases the taxonomic distribution at the order level for genes obtained from identified organisms within the database. The top 10 orders are displayed, while the remaining categories are consolidated under 'other.' Additionally, the percentage of genes mapped to a known taxonomy within the entire gene family is provided.
You can download all the results by utilizing the designated download button, including gene family sequences, neighbor information, and predictions. Please take into consideration that the functionality prediction feature (FUNC PRED) is deactivated for known genes, and the taxonomic mapping (TAX MAP) is exclusively accessible for genes that have appeared in the WGS Genomes dataset.
This mode allows users to submit a sequence query in Fasta format or by directly pasting a protein sequence (with a proceeding >).
The web server will provide the predictions for a specific gene or set of genes.
The results will be accessible for download and will be displayed in the information bar on the left. This will encompass essential information about each gene family, along with details regarding the quality of the hit when mapping a sequence to a model word in our database. It is important to note that we only support sequences that exhibit a substantial hit to our database. Sequences that are rare or do not demonstrate a significant hit may not be linked to our resources, as the model was not trained on them.