OpenVariant parse any delimiter #52

dmartmillan · 2025-05-20T15:09:28Z

Be able to parse any tabular, coma and semicolon file.

I decided not to set [\t,;] as the DELIMITER_DEFAULT because it can lead to incorrect parsing of lines where fields are separated by tabs, but commas or semicolons appear inside the fields, like in this example line:

CCDC66	0	.	GRCh37	3	56627026	56627026	+	Missense_Mutation	SNP	G	G	A	novel	.	TCGA-AB-2854-03B-01W-0728-08	TCGA-AB-2854-11B-01W-0729-08	G	G	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	c.965G>A	p.Ser322Asn	p.S322N	ENST00000394672	8/18	310	299	11	245	245	0	CCDC66,missense_variant,p.Ser322Asn,ENST00000394672,;CCDC66,missense_variant,p.Ser288Asn,ENST00000326595,;CCDC66,missense_variant,p.Ser322Asn,ENST00000436465,;CCDC66,intron_variant,,ENST00000422222,;CCDC66,non_coding_transcript_exon_variant,,ENST00000484623,;CCDC66,3_prime_UTR_variant,,ENST00000471681,;CCDC66,3_prime_UTR_variant,,ENST00000341455,;CCDC66,non_coding_transcript_exon_variant,,ENST00000494672,;CCDC66,upstream_gene_variant,,ENST00000468108,;	AENSG00000180376	ENST00000394672	Transcript	missense_variant	1035	965	322	S/N	aGt/aAt	.	.	.	1	CCDC66	HGNC	27709	protein_coding	YES	CCDS46852.1	ENSP00000378167	CCD66_HUMAN	F8WCY0_HUMAN	UPI000020ADBC	.	tolerated(0.11)	benign(0.046)	8/18	.	hmmpanther:PTHR22736:SF1,hmmpanther:PTHR22736	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	MODERATE	.	SNV	.	.	...	.	.	.	.	.	.	.	.	wga	NONE	MUTECT|MUSE	TTTCAGTGCTG	.	2

Now, the first line (header) of each file is checked to determine which delimiter will work, using the following function located in openvariant/variant/variant.py:

def _detect_delimiter(line: str):
    """Detects the dominant delimiter in a line"""
    counts = {
        '\t': line.count('\t'),
        ',': line.count(','),
        ';': line.count(';')
    }
    return max(counts, key=counts.get)

As well, user can determine its own delimiter on annotation file:
delimiter: '[\t,-]'
or
delimiter: \t
or
delimiter: \t,-;

However, with default set up user doesn't need to add any additional config. to be able to parse tsv,csv and ssv files simultaneously. What do you think?

dmartmillan · 2025-05-20T15:09:44Z

Not merge yet!
Need to add some docs.

FedericaBrando · 2025-05-21T15:06:30Z

However, with default set up user doesn't need to add any additional config. to be able to parse tsv,csv and ssv files simultaneously. What do you think?

I haven't looked at the code yet, but my fear with this approach is that, for example, a vcf with few columns (say CHR POS REF ALT STRAND) and a lot of other fields in the INFO separated by ;or ,' would results in parsing the INFO column rather than the \t ones.

One example is VCF output of VEP, which condenses every possible annotation of a single variant into the INFO column.

dmartmillan · 2025-05-21T15:14:04Z

However, with default set up user doesn't need to add any additional config. to be able to parse tsv,csv and ssv files simultaneously. What do you think?

I haven't looked at the code yet, but my fear with this approach is that, for example, a vcf with few columns (say CHR POS REF ALT STRAND) and a lot of other fields in the INFO separated by ;or ,' would results in parsing the INFO column rather than the \t ones.

One example is VCF output of VEP, which condenses every possible annotation of a single variant into the INFO column.

The delimiter is determined only based on the header (the first line) of the file. For example, if the header is tab-separated, the entire file is expected to use tabs as the delimiter.

There are two important considerations:

Every file must include a header.
The scenario you mentioned should not occur in the header.

Copilot

Pull Request Overview

This PR refactors the delimiter configuration and detection logic for parsing annotation files, allowing the parsing of files with tab, comma, or semicolon delimiters without extra configuration. Key changes include:

Removing the AnnotationDelimiter enum and DEFAULT_DELIMITER from the configuration.
Updating the delimiter detection logic in both the parser and annotation initialization.
Adjusting unit tests to reflect the updated delimiter configuration.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
tests/test_annotation/test_annotation.py	Removed tests for invalid/non-existent delimiter and added a debug print
tests/data/annotation/invalid_delimiter.yaml	Removed invalid example configuration
tests/data/annotation/annotation.yaml	Updated the delimiter setting to use a literal tab character
openvariant/variant/variant.py	Introduced the _detect_delimiter function and updated parsing logic
openvariant/annotation/config_annotation.py	Removed the AnnotationDelimiter enum and DEFAULT_DELIMITER constant
openvariant/annotation/annotation.py	Updated delimiter handling in initialization and key validation

Comments suppressed due to low confidence (2)

tests/test_annotation/test_annotation.py:55

Consider removing the debug print statement from the unit test to keep test output clean.

print(annotation.delimiter, '	')

openvariant/annotation/annotation.py:38

The removal of the delimiter value validation (previously checking allowed values) may allow invalid delimiters to pass unnoticed. Consider reintroducing validation to ensure the provided delimiter meets expected patterns.

if AnnotationGeneralKeys.DELIMITER.value in annot and not isinstance(annot[AnnotationGeneralKeys.DELIMITER.value], str):

FedericaBrando

I did a test with a VCF (with commented line ## and # in the header) and as a separator ;. And it failed.

##FORMAT=<ID=SAF,Number=A,Type=Integer,Description="Alternate allele observations on the forward strand">
##FORMAT=<ID=SAR,Number=A,Type=Integer,Description="Alternate allele observations on the reverse strand">
##FORMAT=<ID=SRF,Number=1,Type=Integer,Description="Number of reference observations on the forward strand">
##FORMAT=<ID=SRR,Number=1,Type=Integer,Description="Number of reference observations on the reverse strand">
#CHROM;POS;ID;REF;ALT;QUAL;FILTER;INFO;FORMAT;23P_5232
chr1;115256528;.;TT;TC;11486.3;PASS;AF=0.331738;AO=1440;DP=4387;FAO=1456;FDP=4389;FDVR=10;FR=.;FRO=2933;FSAF=831;FSAR=625;FSRF=1668;FSRR=1265;FWDB=-0.0187354;FXX=0.00113792;GCM=0;HRUN=2;LEN=3;MLLD=233.749;NID=vc.novel.1;OALT=C;OID=.;OMAPALT=TC;OPOS=115256529;OREF=T;PB=.;PBP=.;PPA=.;PPD=0;QD=10.4682;RBI=0.0202428;REFB=-0.00833745;REVB=0.00766507;RO=2921;SAF=827;SAR=613;SPD=0;SRF=1661;SRR=1260;SSEN=0;SSEP=0;SSSB=0.00680865;STB=0.501391;STBP=0.898;TYPE=snp;UFR=.,.,.,.,unknown,.;VARB=0.016353;VCFALT=TCG;VCFPOS=115256528;VCFREF=TTG;FUNC=[{'origPos':'115256528','origRef':'TT','normalizedRef':'T','gene':'NRAS','normalizedPos':'115256529','normalizedAlt':'C','polyphen':'0.214','gt':'pos','codon':'CGA','coding':'c.182A>G','sift':'0.0','grantham':'43.0','transcript':'NM_002524.5','function':'missense','protein':'p.Q61R','location':'exonic','origAlt':'TC','exon':'3','oncomineGeneClass':'Gain-of-Function','oncomineVariantClass':'Hotspot','CLNACC1':'13900','CLNSIG1':'Pathogenic','CLNREVSTAT1':'criteria_provided,_multiple_submitters,_no_conflicts','CLNID1':'rs11554290'}];GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR;0/1:11486:4387:4389:2921:2933:1440:1456:0.331738:613:827:1661:1260:625:831:1668:1265
chr1;115256529;COSM584;T;C;11439.6;PASS;AF=0.329765;AO=1440;DP=4390;FAO=1448;FDP=4391;FDVR=10;FR=.;FRO=2939;FSAF=827;FSAR=621;FSRF=1673;FSRR=1266;FWDB=-0.0193984;FXX=6.82749E-4;GCM=0;HRUN=2;LEN=1;MLLD=235.553;OALT=TC;OID=COSM584;OMAPALT=C;OPOS=115256528;OREF=TT;PB=.;PBP=.;PPA=.;PPD=0;QD=10.4209;RBI=0.0216282;REFB=-0.00712873;REVB=0.00956471;RO=2925;SAF=827;SAR=613;SPD=0;SRF=1662;SRR=1263;SSEN=0;SSEP=0;SSSB=0.00733417;STB=0.501293;STBP=0.907;TYPE=snp;VARB=0.0138076;VCFALT=CG;VCFPOS=115256529;VCFREF=TG;HS;FUNC=[{'origPos':'115256529','origRef':'T','normalizedRef':'T','gene':'NRAS','normalizedPos':'115256529','normalizedAlt':'C','polyphen':'0.214','gt':'pos','codon':'CGA','coding':'c.182A>G','sift':'0.0','grantham':'43.0','transcript':'NM_002524.5','function':'missense','protein':'p.Q61R','location':'exonic','origAlt':'C','exon':'3','oncomineGeneClass':'Gain-of-Function','oncomineVariantClass':'Hotspot','CLNACC1':'13900','CLNSIG1':'Pathogenic','CLNREVSTAT1':'criteria_provided,_multiple_submitters,_no_conflicts','CLNID1':'rs11554290'}];GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR;0/1:11396:4390:4391:2925:2939:1440:1448:0.329765:613:827:1662:1263:621:827:1673:1266
chr2;29416338;ALK;A;<CNV>;100.0;GAIN;HS;FR=.;PRECISE=FALSE;SVTYPE=CNV;END=29940583;LEN=524246;NUMTILES=12;SD=0.2;CDF_MAPD=.001:3.89382,0.005:4.64302,0.01:5.01559,0.025:5.57397,0.05:6.06513,0.1:6.64428,0.2:7.36448,0.25:7.64358,0.5:8.8,0.75:10.0086,0.8:10.3168,0.9:11.1443,0.95:11.8457,0.975:12.4675,0.99:13.2067,0.995:13.7201,0.999:14.805;RAW_CN=2.68;REF_CN=2;PVAL=7.983E-6;FD=1.34;CI=0.05:6.06513,0.95:11.8457;FUNC=[{'gene':'ALK','oncomineGeneClass':'Gain-of-Function','oncomineVariantClass':'Amplification'}];GT:FD:CN;./.:1.34:8.8
chr2;29455169;ALK;C;.;.;PASS;SVTYPE=RNAExonTiles;FUNCTION=ExpressionImbalance;GENE_NAME=ALK;IMBALANCE_SCORE=4.323;IMBALANCE_PVAL=7.0E-4;BREAKPOINT_RANGE=exon15-exon20;NUM_ISOFORMS_DETECTED=0;FUNC=[{'gene':'ALK','oncomineGeneClass':'Gain-of-Function','oncomineVariantClass':'ExpressionImbalance'}];GT:GQ;./.:.

The way the delimiter function is implemented is that it will only get into the _detect_delimiter() if the l_num == 0 and the delimiter is None. The problem is that the _detect_delimiter() function will return a value even if the count for each of the key is 0. In fact, the tie break of the max function is the first key that checks (in our case \t) [source].

So the l_num condition will be met only on the first line of the file (independent if its a comment or not since the skip is implemented later) and in the first hit it will return a \teven if the count of the separator is 0.

We should detect that the line that we are taking to check the delimiter count is the actual header.

        for l_num, line in enumerate(iter(mm_obj.readline, b'')):
            line = line.decode('utf-8')

            # Skip comments
            if (line.startswith('#') or line.startswith('##') or line.startswith('browser') or
                line.startswith('track')) and not line.startswith('#CHROM'):
                continue

            if delimiter is None:
                delimiter = _detect_delimiter(line)

            row_line = re.split(delimiter, line)
            row_line = list(map(lambda w: w.rstrip("\r\n"), row_line))

            if len(row_line) == 0:
                continue

            yield l_num, row_line

Also, if the count is 0, I would raise an error rather than allow the \t by default.

Other than these comments, I overall like the approach! Thanks a lot David 🚀

CarlosLopezElorduy

Hi David! I agree with @FedericaBrando regarding the order when the _detect_delimiter function is called vs the comment check. Other than that, I added a suggestion on a refactor of the _detect_delimiter function itself by using the csv library, as it already includes a class (Sniffer) with a method designed to handle these type of scenarios, so it could be interesting to add.

Other than that, I also like the approach, so looking forward to this! 💪

openvariant/variant/variant.py

Co-authored-by: Carlos López-Elorduy <[email protected]>

…rly' of github.com:bbglab/openvariant into 47-openvariant-bug-windows-input-files-not-parsed-properly

dmartmillan · 2025-05-22T13:09:17Z

Added! if you are okay with that, we can merge.

CarlosLopezElorduy

I just found a problem with the function for single column files. Added a suggestion to catch this problem.

openvariant/variant/variant.py

Co-authored-by: Carlos López-Elorduy <[email protected]>

CarlosLopezElorduy

🚀

FedericaBrando

💯

dmartmillan added 2 commits May 20, 2025 15:18

add: be able to parse any delimiter

9727fc9

fix: delimiter check

7a78ba0

dmartmillan self-assigned this May 20, 2025

dmartmillan added bug Something isn't working documentation Improvements or additions to documentation labels May 20, 2025

dmartmillan linked an issue May 20, 2025 that may be closed by this pull request

OpenVariant Bug | Windows input files not parsed properly #47

Closed

dmartmillan requested review from CarlosLopezElorduy and FedericaBrando May 21, 2025 14:56

FedericaBrando requested a review from Copilot May 21, 2025 15:32

Copilot AI reviewed May 21, 2025

View reviewed changes

FedericaBrando requested changes May 21, 2025

View reviewed changes

CarlosLopezElorduy requested changes May 22, 2025

View reviewed changes

openvariant/variant/variant.py Outdated Show resolved Hide resolved

dmartmillan and others added 3 commits May 22, 2025 14:24

fix: update detect delimiter

32f2b3b

Co-authored-by: Carlos López-Elorduy <[email protected]>

fix: skip comments order

7139ff8

Merge branch '47-openvariant-bug-windows-input-files-not-parsed-prope…

e9233a1

…rly' of github.com:bbglab/openvariant into 47-openvariant-bug-windows-input-files-not-parsed-properly

dmartmillan requested review from CarlosLopezElorduy and FedericaBrando May 22, 2025 13:07

CarlosLopezElorduy requested changes May 22, 2025

View reviewed changes

openvariant/variant/variant.py Outdated Show resolved Hide resolved

dmartmillan and others added 2 commits May 23, 2025 12:49

fix: update detect delimiter

a731246

Co-authored-by: Carlos López-Elorduy <[email protected]>

fix: added comment

ca427ed

dmartmillan requested a review from CarlosLopezElorduy May 23, 2025 11:57

CarlosLopezElorduy approved these changes May 23, 2025

View reviewed changes

FedericaBrando approved these changes May 23, 2025

View reviewed changes

dmartmillan merged commit 8e68a46 into develop May 23, 2025
1 check passed

dmartmillan deleted the 47-openvariant-bug-windows-input-files-not-parsed-properly branch May 23, 2025 22:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OpenVariant parse any delimiter #52

OpenVariant parse any delimiter #52

Uh oh!

dmartmillan commented May 20, 2025 •

edited

Loading

Uh oh!

dmartmillan commented May 20, 2025

Uh oh!

FedericaBrando commented May 21, 2025 •

edited

Loading

Uh oh!

dmartmillan commented May 21, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

FedericaBrando left a comment •

edited

Loading

Uh oh!

CarlosLopezElorduy left a comment •

edited

Loading

Uh oh!

Uh oh!

dmartmillan commented May 22, 2025

Uh oh!

CarlosLopezElorduy left a comment

Uh oh!

Uh oh!

CarlosLopezElorduy left a comment

Uh oh!

FedericaBrando left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

OpenVariant parse any delimiter #52

OpenVariant parse any delimiter #52

Uh oh!

Conversation

dmartmillan commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dmartmillan commented May 20, 2025

Uh oh!

FedericaBrando commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dmartmillan commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

FedericaBrando left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CarlosLopezElorduy left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dmartmillan commented May 22, 2025

Uh oh!

CarlosLopezElorduy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

CarlosLopezElorduy left a comment

Choose a reason for hiding this comment

Uh oh!

FedericaBrando left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dmartmillan commented May 20, 2025 •

edited

Loading

FedericaBrando commented May 21, 2025 •

edited

Loading

dmartmillan commented May 21, 2025 •

edited

Loading

FedericaBrando left a comment •

edited

Loading

CarlosLopezElorduy left a comment •

edited

Loading