Skip to content

Conversation

@dmartmillan
Copy link
Collaborator

@dmartmillan dmartmillan commented May 20, 2025

Be able to parse any tabular, coma and semicolon file.

I decided not to set [\t,;] as the DELIMITER_DEFAULT because it can lead to incorrect parsing of lines where fields are separated by tabs, but commas or semicolons appear inside the fields, like in this example line:

CCDC66	0	.	GRCh37	3	56627026	56627026	+	Missense_Mutation	SNP	G	G	A	novel	.	TCGA-AB-2854-03B-01W-0728-08	TCGA-AB-2854-11B-01W-0729-08	G	G	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	c.965G>A	p.Ser322Asn	p.S322N	ENST00000394672	8/18	310	299	11	245	245	0	CCDC66,missense_variant,p.Ser322Asn,ENST00000394672,;CCDC66,missense_variant,p.Ser288Asn,ENST00000326595,;CCDC66,missense_variant,p.Ser322Asn,ENST00000436465,;CCDC66,intron_variant,,ENST00000422222,;CCDC66,non_coding_transcript_exon_variant,,ENST00000484623,;CCDC66,3_prime_UTR_variant,,ENST00000471681,;CCDC66,3_prime_UTR_variant,,ENST00000341455,;CCDC66,non_coding_transcript_exon_variant,,ENST00000494672,;CCDC66,upstream_gene_variant,,ENST00000468108,;	AENSG00000180376	ENST00000394672	Transcript	missense_variant	1035	965	322	S/N	aGt/aAt	.	.	.	1	CCDC66	HGNC	27709	protein_coding	YES	CCDS46852.1	ENSP00000378167	CCD66_HUMAN	F8WCY0_HUMAN	UPI000020ADBC	.	tolerated(0.11)	benign(0.046)	8/18	.	hmmpanther:PTHR22736:SF1,hmmpanther:PTHR22736	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	MODERATE	.	SNV	.	.	...	.	.	.	.	.	.	.	.	wga	NONE	MUTECT|MUSE	TTTCAGTGCTG	.	2

Now, the first line (header) of each file is checked to determine which delimiter will work, using the following function located in openvariant/variant/variant.py:

def _detect_delimiter(line: str):
    """Detects the dominant delimiter in a line"""
    counts = {
        '\t': line.count('\t'),
        ',': line.count(','),
        ';': line.count(';')
    }
    return max(counts, key=counts.get)

As well, user can determine its own delimiter on annotation file:
delimiter: '[\t,-]'
or
delimiter: \t
or
delimiter: \t,-;

However, with default set up user doesn't need to add any additional config. to be able to parse tsv,csv and ssv files simultaneously. What do you think?

@dmartmillan dmartmillan self-assigned this May 20, 2025
@dmartmillan
Copy link
Collaborator Author

Not merge yet!
Need to add some docs.

@dmartmillan dmartmillan added bug Something isn't working documentation Improvements or additions to documentation labels May 20, 2025
@dmartmillan dmartmillan linked an issue May 20, 2025 that may be closed by this pull request
@FedericaBrando
Copy link
Member

FedericaBrando commented May 21, 2025

However, with default set up user doesn't need to add any additional config. to be able to parse tsv,csv and ssv files simultaneously. What do you think?

I haven't looked at the code yet, but my fear with this approach is that, for example, a vcf with few columns (say CHR POS REF ALT STRAND) and a lot of other fields in the INFO separated by ;or ,' would results in parsing the INFO column rather than the \t ones.

One example is VCF output of VEP, which condenses every possible annotation of a single variant into the INFO column.

@dmartmillan
Copy link
Collaborator Author

dmartmillan commented May 21, 2025

However, with default set up user doesn't need to add any additional config. to be able to parse tsv,csv and ssv files simultaneously. What do you think?

I haven't looked at the code yet, but my fear with this approach is that, for example, a vcf with few columns (say CHR POS REF ALT STRAND) and a lot of other fields in the INFO separated by ;or ,' would results in parsing the INFO column rather than the \t ones.

One example is VCF output of VEP, which condenses every possible annotation of a single variant into the INFO column.

The delimiter is determined only based on the header (the first line) of the file. For example, if the header is tab-separated, the entire file is expected to use tabs as the delimiter.

There are two important considerations:

  • Every file must include a header.
  • The scenario you mentioned should not occur in the header.

@FedericaBrando FedericaBrando requested a review from Copilot May 21, 2025 15:32
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the delimiter configuration and detection logic for parsing annotation files, allowing the parsing of files with tab, comma, or semicolon delimiters without extra configuration. Key changes include:

  • Removing the AnnotationDelimiter enum and DEFAULT_DELIMITER from the configuration.
  • Updating the delimiter detection logic in both the parser and annotation initialization.
  • Adjusting unit tests to reflect the updated delimiter configuration.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/test_annotation/test_annotation.py Removed tests for invalid/non-existent delimiter and added a debug print
tests/data/annotation/invalid_delimiter.yaml Removed invalid example configuration
tests/data/annotation/annotation.yaml Updated the delimiter setting to use a literal tab character
openvariant/variant/variant.py Introduced the _detect_delimiter function and updated parsing logic
openvariant/annotation/config_annotation.py Removed the AnnotationDelimiter enum and DEFAULT_DELIMITER constant
openvariant/annotation/annotation.py Updated delimiter handling in initialization and key validation
Comments suppressed due to low confidence (2)

tests/test_annotation/test_annotation.py:55

  • Consider removing the debug print statement from the unit test to keep test output clean.
print(annotation.delimiter, '	')

openvariant/annotation/annotation.py:38

  • The removal of the delimiter value validation (previously checking allowed values) may allow invalid delimiters to pass unnoticed. Consider reintroducing validation to ensure the provided delimiter meets expected patterns.
if AnnotationGeneralKeys.DELIMITER.value in annot and not isinstance(annot[AnnotationGeneralKeys.DELIMITER.value], str):

Copy link
Member

@FedericaBrando FedericaBrando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a test with a VCF (with commented line ## and # in the header) and as a separator ;. And it failed.

##FORMAT=<ID=SAF,Number=A,Type=Integer,Description="Alternate allele observations on the forward strand">
##FORMAT=<ID=SAR,Number=A,Type=Integer,Description="Alternate allele observations on the reverse strand">
##FORMAT=<ID=SRF,Number=1,Type=Integer,Description="Number of reference observations on the forward strand">
##FORMAT=<ID=SRR,Number=1,Type=Integer,Description="Number of reference observations on the reverse strand">
#CHROM;POS;ID;REF;ALT;QUAL;FILTER;INFO;FORMAT;23P_5232
chr1;115256528;.;TT;TC;11486.3;PASS;AF=0.331738;AO=1440;DP=4387;FAO=1456;FDP=4389;FDVR=10;FR=.;FRO=2933;FSAF=831;FSAR=625;FSRF=1668;FSRR=1265;FWDB=-0.0187354;FXX=0.00113792;GCM=0;HRUN=2;LEN=3;MLLD=233.749;NID=vc.novel.1;OALT=C;OID=.;OMAPALT=TC;OPOS=115256529;OREF=T;PB=.;PBP=.;PPA=.;PPD=0;QD=10.4682;RBI=0.0202428;REFB=-0.00833745;REVB=0.00766507;RO=2921;SAF=827;SAR=613;SPD=0;SRF=1661;SRR=1260;SSEN=0;SSEP=0;SSSB=0.00680865;STB=0.501391;STBP=0.898;TYPE=snp;UFR=.,.,.,.,unknown,.;VARB=0.016353;VCFALT=TCG;VCFPOS=115256528;VCFREF=TTG;FUNC=[{'origPos':'115256528','origRef':'TT','normalizedRef':'T','gene':'NRAS','normalizedPos':'115256529','normalizedAlt':'C','polyphen':'0.214','gt':'pos','codon':'CGA','coding':'c.182A>G','sift':'0.0','grantham':'43.0','transcript':'NM_002524.5','function':'missense','protein':'p.Q61R','location':'exonic','origAlt':'TC','exon':'3','oncomineGeneClass':'Gain-of-Function','oncomineVariantClass':'Hotspot','CLNACC1':'13900','CLNSIG1':'Pathogenic','CLNREVSTAT1':'criteria_provided,_multiple_submitters,_no_conflicts','CLNID1':'rs11554290'}];GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR;0/1:11486:4387:4389:2921:2933:1440:1456:0.331738:613:827:1661:1260:625:831:1668:1265
chr1;115256529;COSM584;T;C;11439.6;PASS;AF=0.329765;AO=1440;DP=4390;FAO=1448;FDP=4391;FDVR=10;FR=.;FRO=2939;FSAF=827;FSAR=621;FSRF=1673;FSRR=1266;FWDB=-0.0193984;FXX=6.82749E-4;GCM=0;HRUN=2;LEN=1;MLLD=235.553;OALT=TC;OID=COSM584;OMAPALT=C;OPOS=115256528;OREF=TT;PB=.;PBP=.;PPA=.;PPD=0;QD=10.4209;RBI=0.0216282;REFB=-0.00712873;REVB=0.00956471;RO=2925;SAF=827;SAR=613;SPD=0;SRF=1662;SRR=1263;SSEN=0;SSEP=0;SSSB=0.00733417;STB=0.501293;STBP=0.907;TYPE=snp;VARB=0.0138076;VCFALT=CG;VCFPOS=115256529;VCFREF=TG;HS;FUNC=[{'origPos':'115256529','origRef':'T','normalizedRef':'T','gene':'NRAS','normalizedPos':'115256529','normalizedAlt':'C','polyphen':'0.214','gt':'pos','codon':'CGA','coding':'c.182A>G','sift':'0.0','grantham':'43.0','transcript':'NM_002524.5','function':'missense','protein':'p.Q61R','location':'exonic','origAlt':'C','exon':'3','oncomineGeneClass':'Gain-of-Function','oncomineVariantClass':'Hotspot','CLNACC1':'13900','CLNSIG1':'Pathogenic','CLNREVSTAT1':'criteria_provided,_multiple_submitters,_no_conflicts','CLNID1':'rs11554290'}];GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR;0/1:11396:4390:4391:2925:2939:1440:1448:0.329765:613:827:1662:1263:621:827:1673:1266
chr2;29416338;ALK;A;<CNV>;100.0;GAIN;HS;FR=.;PRECISE=FALSE;SVTYPE=CNV;END=29940583;LEN=524246;NUMTILES=12;SD=0.2;CDF_MAPD=.001:3.89382,0.005:4.64302,0.01:5.01559,0.025:5.57397,0.05:6.06513,0.1:6.64428,0.2:7.36448,0.25:7.64358,0.5:8.8,0.75:10.0086,0.8:10.3168,0.9:11.1443,0.95:11.8457,0.975:12.4675,0.99:13.2067,0.995:13.7201,0.999:14.805;RAW_CN=2.68;REF_CN=2;PVAL=7.983E-6;FD=1.34;CI=0.05:6.06513,0.95:11.8457;FUNC=[{'gene':'ALK','oncomineGeneClass':'Gain-of-Function','oncomineVariantClass':'Amplification'}];GT:FD:CN;./.:1.34:8.8
chr2;29455169;ALK;C;.;.;PASS;SVTYPE=RNAExonTiles;FUNCTION=ExpressionImbalance;GENE_NAME=ALK;IMBALANCE_SCORE=4.323;IMBALANCE_PVAL=7.0E-4;BREAKPOINT_RANGE=exon15-exon20;NUM_ISOFORMS_DETECTED=0;FUNC=[{'gene':'ALK','oncomineGeneClass':'Gain-of-Function','oncomineVariantClass':'ExpressionImbalance'}];GT:GQ;./.:.

The way the delimiter function is implemented is that it will only get into the _detect_delimiter() if the l_num == 0 and the delimiter is None. The problem is that the _detect_delimiter() function will return a value even if the count for each of the key is 0. In fact, the tie break of the max function is the first key that checks (in our case \t) [source].

So the l_num condition will be met only on the first line of the file (independent if its a comment or not since the skip is implemented later) and in the first hit it will return a \teven if the count of the separator is 0.

We should detect that the line that we are taking to check the delimiter count is the actual header.

        for l_num, line in enumerate(iter(mm_obj.readline, b'')):
            line = line.decode('utf-8')

            # Skip comments
            if (line.startswith('#') or line.startswith('##') or line.startswith('browser') or
                line.startswith('track')) and not line.startswith('#CHROM'):
                continue

            if delimiter is None:
                delimiter = _detect_delimiter(line)

            row_line = re.split(delimiter, line)
            row_line = list(map(lambda w: w.rstrip("\r\n"), row_line))

            if len(row_line) == 0:
                continue

            yield l_num, row_line

Also, if the count is 0, I would raise an error rather than allow the \t by default.

Other than these comments, I overall like the approach! Thanks a lot David 🚀

Copy link
Member

@CarlosLopezElorduy CarlosLopezElorduy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi David! I agree with @FedericaBrando regarding the order when the _detect_delimiter function is called vs the comment check. Other than that, I added a suggestion on a refactor of the _detect_delimiter function itself by using the csv library, as it already includes a class (Sniffer) with a method designed to handle these type of scenarios, so it could be interesting to add.

Other than that, I also like the approach, so looking forward to this! 💪

dmartmillan and others added 3 commits May 22, 2025 14:24
Co-authored-by: Carlos López-Elorduy <[email protected]>
…rly' of github.com:bbglab/openvariant into 47-openvariant-bug-windows-input-files-not-parsed-properly
@dmartmillan
Copy link
Collaborator Author

Added! if you are okay with that, we can merge.

Copy link
Member

@CarlosLopezElorduy CarlosLopezElorduy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just found a problem with the function for single column files. Added a suggestion to catch this problem.

dmartmillan and others added 2 commits May 23, 2025 12:49
Copy link
Member

@CarlosLopezElorduy CarlosLopezElorduy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Copy link
Member

@FedericaBrando FedericaBrando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@dmartmillan dmartmillan merged commit 8e68a46 into develop May 23, 2025
1 check passed
@dmartmillan dmartmillan deleted the 47-openvariant-bug-windows-input-files-not-parsed-properly branch May 23, 2025 22:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OpenVariant Bug | Windows input files not parsed properly

4 participants