Skip to content

Output file format not strictly enough defined. #33

@BioGeek

Description

@BioGeek

The current description of the output format contains a lot of ambiguities and several algorithms interpret them in different ways.

A non-exhaustive list:

aa_scores

  • biatNovo-DDA: a string of comma-separated negative float values with two decimal digits: "-3.20,-3.77,-4.74,-5.10,-4.31,-3.78,-3.91,-4.04,-3.52,-4.12,-3.13,-7.27,-4.34,-3.76"
  • pepnet: space separated positive float values of six decimal digits between square brackets:
    [0.164515 0.235235 0.218719 0.358655 0.252523 0.227940 0.342400 0.456557 0.576003 0.679042 0.927740 0.996059 0.999982 0.999307 0.999602 0.999992 0.999995 0.999997 0.999989 0.999987 0.999997 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000]
  • pi-HelixNovo: pipe-separated positive float values of two decimal digits:
    0.03|0.03|0.03|0.03|0.03|1.0|0.67|0.4|0.14|0.11|0.24|0.22|0.19|0.31

spectrum_id

  • pepnet: includes the subfolder and file extension: 9_species_human/151009_exo4_1.mgf:0
  • all the other algorithms: filename without file extension: 151009_exo4_1:0

sequence

  • biatNovo-DDA: sometimes has nan as sequence:
    151009_exo4_1:11001,1.0,nan,,,430.887054443359,2.0,,151009_exo4_1:11001,151009_exo4_1:11001,

score

  • biatNovo-DDA: sometimes has no score value:
    151009_exo4_1:11001,1.0,nan,,,430.887054443359,2.0,,151009_exo4_1:11001,151009_exo4_1:11001,
  • biatNovo-DDA and casanovo uses negative score values, instanovo, pepnet and pi-HelixNovo use positive values. So clarify if score means log probabilities or confidence score.

Are extra columns allowed?

  • biatNovo-DDA: spectrum_id,feature_area,sequence,score,aa_scores,precursor_mz,precursor_charge,protein_access_id,scan_list_middle,scan_list_original,predicted_score_max
  • casanovo: sequence,PSM_ID,accession,unique,database,database_version,search_engine,score,modifications,retention_time,charge,exp_mass_to_charge,calc_mass_to_charge,spectrum_id,pre,post,start,end,aa_scores

Ideally one would want to have a validator added to the codebase that checks the output file and complains if the output.csv is not up to spec. An output file that does not confirm to the spec can cause the evaluation step to fail or give wrong results.

Something like:

import csv
from typing import List, Dict, Any
from pydantic import BaseModel, field_validator, ValidationError, model_validator
import re
import os


class CSVRow(BaseModel):
    sequence: str
    score: float
    aa_scores: str
    spectrum_id: str

    @field_validator('sequence')
    @classmethod
    def validate_sequence(cls, v: str) -> str:
        valid_aa = set("GASPVTCLINDQKEMHFRYW")
        
        stripped_seq = re.sub(r'\[UNIMOD:\d+\]', '', v)
        
        if not all(aa in valid_aa for aa in stripped_seq):
            invalid_aa = set(stripped_seq) - valid_aa
            raise ValueError(f"Invalid amino acid(s) {', '.join(invalid_aa)} in sequence {v}")
        
        return v
    
    @field_validator('score')
    @classmethod
    def validate_score(cls, v: float) -> float:
        if not 0 <= v <= 1:
            raise ValueError("Score must be between 0 and 1")
        return v

    @field_validator('aa_scores')
    @classmethod
    def validate_aa_scores(cls, v: str) -> str:
        try:
            scores = [float(score) for score in v.split(',')]
        except ValueError:
                raise ValueError("Invalid aa_scores format. Must be a string of comma-separated floats.")
        
        return v

    @field_validator('spectrum_id')
    @classmethod
    def validate_spectrum_id(cls, v: str) -> str:
        if '/' in v:
            raise ValueError("spectrum_id cannot contain forward slashes")
        if not re.match(r'^.+:\d+$', v):
            raise ValueError("Invalid spectrum_id format. Must be in the format 'filename:index'")
        return v


def validate_csv(file_path: str) -> List[CSVRow]:
    validated_rows = []
    
    with open(file_path, 'r') as csvfile:
        reader = csv.DictReader(csvfile)
        
        if set(reader.fieldnames) != {'sequence', 'score', 'aa_scores', 'spectrum_id'}:
           raise ValueError("CSV file must contain columns: sequence, score, aa_scores, spectrum_id")

        for row in reader:
            try:
                validated_row = CSVRow(**row)
                validated_rows.append(validated_row)
            except ValidationError as e:
                print(f"Validation error in row: {row}")
                print(e)
    
    return validated_rows

# Example usage
if __name__ == "__main__":
    output_dir = "outputs/9_species_human"
    for filename in os.listdir(output_dir):
        file_path = os.path.join(output_dir, filename)
        try:
            validated_data = validate_csv(file_path)
            print(f"Successfully validated {len(validated_data)} rows in {filename}")
        except ValueError as e:
            print(f"Validation failed for {filename}: {str(e)}")

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions