Output file format not strictly enough defined.

The current description of the output format contains a lot of ambiguities and several algorithms interpret them in different ways.

A non-exhaustive list:

 `aa_scores`
---
  * biatNovo-DDA: a string of comma-separated negative float values with two decimal digits: `"-3.20,-3.77,-4.74,-5.10,-4.31,-3.78,-3.91,-4.04,-3.52,-4.12,-3.13,-7.27,-4.34,-3.76"`
  *  pepnet: space separated positive float values of six decimal digits between square brackets:
  `[0.164515 0.235235 0.218719 0.358655 0.252523 0.227940 0.342400 0.456557 0.576003 0.679042 0.927740 0.996059 0.999982 0.999307 0.999602 0.999992 0.999995 0.999997 0.999989 0.999987 0.999997 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000]`
  * pi-HelixNovo: pipe-separated positive float values of two decimal digits:
  `0.03|0.03|0.03|0.03|0.03|1.0|0.67|0.4|0.14|0.11|0.24|0.22|0.19|0.31` 
  
 `spectrum_id`
---
  * pepnet: includes the subfolder and file extension: `9_species_human/151009_exo4_1.mgf:0`
  * all the other algorithms: filename without file extension: `151009_exo4_1:0` 

`sequence`
---
* biatNovo-DDA: sometimes has `nan` as sequence:
`151009_exo4_1:11001,1.0,nan,,,430.887054443359,2.0,,151009_exo4_1:11001,151009_exo4_1:11001,`

`score`
---
* biatNovo-DDA: sometimes has no score value:
`151009_exo4_1:11001,1.0,nan,,,430.887054443359,2.0,,151009_exo4_1:11001,151009_exo4_1:11001,`
* biatNovo-DDA and casanovo uses negative score values, instanovo, pepnet and pi-HelixNovo use positive values. So clarify if score means log probabilities  or confidence score.
  
Are extra columns allowed? 
  ---
* biatNovo-DDA: `spectrum_id,feature_area,sequence,score,aa_scores,precursor_mz,precursor_charge,protein_access_id,scan_list_middle,scan_list_original,predicted_score_max`
* casanovo: `sequence,PSM_ID,accession,unique,database,database_version,search_engine,score,modifications,retention_time,charge,exp_mass_to_charge,calc_mass_to_charge,spectrum_id,pre,post,start,end,aa_scores`




Ideally one would want to have a validator added to the codebase that checks the output file and complains if the `output.csv` is not up to spec. An output file that does not confirm to the spec can cause the evaluation step to fail or give wrong results.

Something like:

```python
import csv
from typing import List, Dict, Any
from pydantic import BaseModel, field_validator, ValidationError, model_validator
import re
import os


class CSVRow(BaseModel):
    sequence: str
    score: float
    aa_scores: str
    spectrum_id: str

    @field_validator('sequence')
    @classmethod
    def validate_sequence(cls, v: str) -> str:
        valid_aa = set("GASPVTCLINDQKEMHFRYW")
        
        stripped_seq = re.sub(r'\[UNIMOD:\d+\]', '', v)
        
        if not all(aa in valid_aa for aa in stripped_seq):
            invalid_aa = set(stripped_seq) - valid_aa
            raise ValueError(f"Invalid amino acid(s) {', '.join(invalid_aa)} in sequence {v}")
        
        return v
    
    @field_validator('score')
    @classmethod
    def validate_score(cls, v: float) -> float:
        if not 0 <= v <= 1:
            raise ValueError("Score must be between 0 and 1")
        return v

    @field_validator('aa_scores')
    @classmethod
    def validate_aa_scores(cls, v: str) -> str:
        try:
            scores = [float(score) for score in v.split(',')]
        except ValueError:
                raise ValueError("Invalid aa_scores format. Must be a string of comma-separated floats.")
        
        return v

    @field_validator('spectrum_id')
    @classmethod
    def validate_spectrum_id(cls, v: str) -> str:
        if '/' in v:
            raise ValueError("spectrum_id cannot contain forward slashes")
        if not re.match(r'^.+:\d+$', v):
            raise ValueError("Invalid spectrum_id format. Must be in the format 'filename:index'")
        return v


def validate_csv(file_path: str) -> List[CSVRow]:
    validated_rows = []
    
    with open(file_path, 'r') as csvfile:
        reader = csv.DictReader(csvfile)
        
        if set(reader.fieldnames) != {'sequence', 'score', 'aa_scores', 'spectrum_id'}:
           raise ValueError("CSV file must contain columns: sequence, score, aa_scores, spectrum_id")

        for row in reader:
            try:
                validated_row = CSVRow(**row)
                validated_rows.append(validated_row)
            except ValidationError as e:
                print(f"Validation error in row: {row}")
                print(e)
    
    return validated_rows

# Example usage
if __name__ == "__main__":
    output_dir = "outputs/9_species_human"
    for filename in os.listdir(output_dir):
        file_path = os.path.join(output_dir, filename)
        try:
            validated_data = validate_csv(file_path)
            print(f"Successfully validated {len(validated_data)} rows in {filename}")
        except ValueError as e:
            print(f"Validation failed for {filename}: {str(e)}")
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output file format not strictly enough defined. #33

`aa_scores`

`spectrum_id`

`sequence`

`score`

Are extra columns allowed?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Output file format not strictly enough defined. #33

Description

aa_scores

spectrum_id

sequence

score

Are extra columns allowed?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`aa_scores`

`spectrum_id`

`sequence`

`score`