This repo was branched from the CAFA-evaluator and additional functionality was added. Visit the original CAFA-evaluator Wiki for more information about the algorithm.
The two main new functionalities are:
-
Flexible terms-of-interest A file can be passed in with a list of ontology terms that will be evaluated for all proteins. Only terms included in the file will be evaluated. This can be used to exclude terms that were added to the ontology since predictions were collected or to exclude terms that have been deemed uninformative or for whatever reason should not be included in evaluation.
In the figure below, the terms of interest are shown with a red outline. These terms will be evaluated for all proteins for which there are annotations in the ground truth file.
-
Protein-specific known annotations To evaluate under the "Partial Knowledge" evaluation setting, any annotations known previous to prediction should be excluded from evaluation. This exclusion is done similarly to the terms-of-interst, but for each individual protein.
In the figure below, the annotations are shown for a single protein. Terms with annotations known before the prediction phase are shown in yellow in the Cellular Component and Molecular Function aspects of the Gene Ontology. These terms should be excluded from evaluation. New annotations are shown to the right in blue. Evaluation should happen only for these terms. If the newly annotated terms do not appear in the terms-of-interest file, they will not be evaluated.
A known-annotations file can be passed in for any evaluation setting and will not have and effect in No-Knowledge or Limited-Knowledge evaluation. For example in the figure below, if a known-annotations file is passed in with the terms in yellow listed, the Limited-Knowlege evaluation of the Biological Process (BP) aspect will produce the same results as if the file was not passed in at all. This is because there are no known terms in BP for this process and they will not appear in the known annotations file.
Please cite the original CAFA Evaluator and, once available, the forthcoming CAFA 5 papers if you use this code in published research
CAFA-evaluator: A Python Tool for Benchmarking Ontological Classification Methods
D Piovesan, D Zago, P Joshi, MC De Paolis Kaluza, M Mehdiabadi, R Ramola, AM Monzon, W Reade, I Friedberg, P Radivojac, SCE Tosatto
Bioinformatics Advances (2024) - DOI: 10.1093/bioadv/vbae043
Crowdsourcing the fifth critical assessment of protein function annotation algorithms (CAFA 5) yields improvement in protein function prediction
TBD
TBD - DOI: TBD
The program can be executing the command line interface or as a library. Both the command line and the library accept the following positional arguments:
-
Ontology file in OBO format
-
NEW: Terms of Interst file contains term names that appear in the OBO file to be included in evaluation
-
NEW: Known annotations file contains annotations known before the evaluation phase
-
Prediction folder contain prediction files. Files can be organized into sub-folders, sub-folders are processed recursively and the sub-folder name is used as prefix for the method name
-
Ground truth file containing targets and associated ontology terms
Example input files are provided inside the data/example folder.
When executed from the command line the script logs information about the calculation in the console (standard error) and
will create a folder named results containing the evaluation results.
A different folder can be specified using the -out_dir option.
The original CAFA-evaluator functionality works as before without additional imput arguments
python3 /path/to/CAFA-evaluator/src/cafaeval/__main__.py ontology_file prediction_folder ground_truth_file
NEW: By default, all terms in the ontology will be considered in the evaluation. To include only specific terms, provide a terms-of-interest file
python3 /path/to/CAFA-evaluator/src/cafaeval/__main__.py ontology_file prediction_folder ground_truth_file -toi terms_of_interest_file
NEW: To evaluate Partial Knowledge annotations, the known annotations file must be passed in with the -known option.
python3 /path/to/CAFA-evaluator/src/cafaeval/__main__.py ontology_file prediction_folder ground_truth_file -known known_annotations_file
NEW: You can pass in both terms-of-interest and known annotations:
python3 /path/to/CAFA-evaluator/src/cafaeval/__main__.py ontology_file prediction_folder ground_truth_file -toi terms_of_interest_file -known known_annotations_file
Prediction file - Tab separated file with the target ID, term ID and score columns.
A0A0A6YY25 GO:0010468 3.6396e-05
O54963 GO:0010033 0.0617
O54963 GO:1990841 0.035
X1WHY6 GO:0010033 0.1352
X1WHY6 GO:0048731 0.0478
...Ground truth file - Tab separated file with the target ID and term ID. Additional columns are discarded.
A0A0A6YY25 GO:0009892
A0A0A6YY25 GO:0010468
O54963 GO:1990841
O54963 GO:0010033
O88898 GO:1990841
X1WHY6 GO:0048609
X1WHY6 GO:0048731
...
Information accretion file (optional) - If not provided, the weighted and S statistics are not generated. Information accretion (IA) can be calculated as described in Wyatt and Radivojac, Bioinformatics, 2013 and implemented in https://github.com/claradepaolis/InformationAccretion
GO:0009892 0.3286
GO:0010468 0.0184
GO:0008150 0
GO:1990841 3.1613
...
NEW: Known annotations (optional) - Tab separated file with target ID and term ID. File containing known annotations to exclude from partial-knowledge evaluation. If not provided, all terms will be used in evaluation
A0A009IHW8 GO:0072523
A0A009IHW8 GO:0046700
O54963 GO:0045596
O54963 GO:2000064
O54963 GO:0062014
...
NEW: Terms of Interest (optional) - File with term ID to include in evaluation for all proteins, one ID per line.
If not provided, all terms will be used in evaluation.
This file is used to specify terms that will be evaluated, usually used to exclude terms in the ontology that have since
been obsoleted or are not of interest for the evaluation.
GO:0055039
GO:0072523
GO:0003882
GO:0010139
...
Output files are generated in the results folder. The same files are gerated by both
the command line and the write_results function.
evaluation_all.tsvcorresponds to the first object returned by thecafa_evalfunction.evaluation_best_< metric >.tsvcorresponds to the second object returned by thecafa_evalfunction. A different file for each metric is created.
Note: Weighted scores are generated only if the Information Accretion file is provided.
| Argument | Default value | Description |
|---|---|---|
| -out_dir | 'results' | Output directory (tsv files + log). Either relative to current path or absolute |
| -ia | Information accretion file | |
| -known | Known annotations for each protein that should be excluded in partial-knowledge evaluation | |
| -toi | Terms of interest file (terms considered for all proteins) | |
| -no_orphans | False (flag) | Exclude orphans nodes (e.g. roots) from the calculation |
| -norm | 'cafa' | Normalization strategy. cafa normalize precision by the number of predicted targets and recall by the number of targets in the ground truth. pred normalize by the number of predicted targets. gt normalize by the number of ground truth proteins |
| -prop | 'max' | Ancestor propagation strategy. max propagate the max score of the traversed subgraph iteratively. fill propagate with max until a different score is found |
| -th_step | 0.01 | Step size of prediction score thresholds to consider in the range [0, 1). A smaller step, means more calculation |
| -max_terms | Number of terms for protein and namespace to consider in the evaluation. Parsing stops when the target limit for every namespace is reached. The score is not checked, meaning that terms are not sorted before the check, and the check is performed before propagation. | |
| -threads | 4 | Parallel threads. 0 means use all available CPU threads. Do not use multi thread if you are short in memory |
