-
Notifications
You must be signed in to change notification settings - Fork 12
Open
Description
The current description of the output format contains a lot of ambiguities and several algorithms interpret them in different ways.
A non-exhaustive list:
aa_scores
- biatNovo-DDA: a string of comma-separated negative float values with two decimal digits:
"-3.20,-3.77,-4.74,-5.10,-4.31,-3.78,-3.91,-4.04,-3.52,-4.12,-3.13,-7.27,-4.34,-3.76" - pepnet: space separated positive float values of six decimal digits between square brackets:
[0.164515 0.235235 0.218719 0.358655 0.252523 0.227940 0.342400 0.456557 0.576003 0.679042 0.927740 0.996059 0.999982 0.999307 0.999602 0.999992 0.999995 0.999997 0.999989 0.999987 0.999997 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000] - pi-HelixNovo: pipe-separated positive float values of two decimal digits:
0.03|0.03|0.03|0.03|0.03|1.0|0.67|0.4|0.14|0.11|0.24|0.22|0.19|0.31
spectrum_id
- pepnet: includes the subfolder and file extension:
9_species_human/151009_exo4_1.mgf:0 - all the other algorithms: filename without file extension:
151009_exo4_1:0
sequence
- biatNovo-DDA: sometimes has
nanas sequence:
151009_exo4_1:11001,1.0,nan,,,430.887054443359,2.0,,151009_exo4_1:11001,151009_exo4_1:11001,
score
- biatNovo-DDA: sometimes has no score value:
151009_exo4_1:11001,1.0,nan,,,430.887054443359,2.0,,151009_exo4_1:11001,151009_exo4_1:11001, - biatNovo-DDA and casanovo uses negative score values, instanovo, pepnet and pi-HelixNovo use positive values. So clarify if score means log probabilities or confidence score.
Are extra columns allowed?
- biatNovo-DDA:
spectrum_id,feature_area,sequence,score,aa_scores,precursor_mz,precursor_charge,protein_access_id,scan_list_middle,scan_list_original,predicted_score_max - casanovo:
sequence,PSM_ID,accession,unique,database,database_version,search_engine,score,modifications,retention_time,charge,exp_mass_to_charge,calc_mass_to_charge,spectrum_id,pre,post,start,end,aa_scores
Ideally one would want to have a validator added to the codebase that checks the output file and complains if the output.csv is not up to spec. An output file that does not confirm to the spec can cause the evaluation step to fail or give wrong results.
Something like:
import csv
from typing import List, Dict, Any
from pydantic import BaseModel, field_validator, ValidationError, model_validator
import re
import os
class CSVRow(BaseModel):
sequence: str
score: float
aa_scores: str
spectrum_id: str
@field_validator('sequence')
@classmethod
def validate_sequence(cls, v: str) -> str:
valid_aa = set("GASPVTCLINDQKEMHFRYW")
stripped_seq = re.sub(r'\[UNIMOD:\d+\]', '', v)
if not all(aa in valid_aa for aa in stripped_seq):
invalid_aa = set(stripped_seq) - valid_aa
raise ValueError(f"Invalid amino acid(s) {', '.join(invalid_aa)} in sequence {v}")
return v
@field_validator('score')
@classmethod
def validate_score(cls, v: float) -> float:
if not 0 <= v <= 1:
raise ValueError("Score must be between 0 and 1")
return v
@field_validator('aa_scores')
@classmethod
def validate_aa_scores(cls, v: str) -> str:
try:
scores = [float(score) for score in v.split(',')]
except ValueError:
raise ValueError("Invalid aa_scores format. Must be a string of comma-separated floats.")
return v
@field_validator('spectrum_id')
@classmethod
def validate_spectrum_id(cls, v: str) -> str:
if '/' in v:
raise ValueError("spectrum_id cannot contain forward slashes")
if not re.match(r'^.+:\d+$', v):
raise ValueError("Invalid spectrum_id format. Must be in the format 'filename:index'")
return v
def validate_csv(file_path: str) -> List[CSVRow]:
validated_rows = []
with open(file_path, 'r') as csvfile:
reader = csv.DictReader(csvfile)
if set(reader.fieldnames) != {'sequence', 'score', 'aa_scores', 'spectrum_id'}:
raise ValueError("CSV file must contain columns: sequence, score, aa_scores, spectrum_id")
for row in reader:
try:
validated_row = CSVRow(**row)
validated_rows.append(validated_row)
except ValidationError as e:
print(f"Validation error in row: {row}")
print(e)
return validated_rows
# Example usage
if __name__ == "__main__":
output_dir = "outputs/9_species_human"
for filename in os.listdir(output_dir):
file_path = os.path.join(output_dir, filename)
try:
validated_data = validate_csv(file_path)
print(f"Successfully validated {len(validated_data)} rows in {filename}")
except ValueError as e:
print(f"Validation failed for {filename}: {str(e)}")Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels