-
Notifications
You must be signed in to change notification settings - Fork 3
OpenVariant parse any delimiter #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenVariant parse any delimiter #52
Conversation
|
Not merge yet! |
I haven't looked at the code yet, but my fear with this approach is that, for example, a vcf with few columns (say One example is VCF output of VEP, which condenses every possible annotation of a single variant into the INFO column. |
The delimiter is determined only based on the header (the first line) of the file. For example, if the header is tab-separated, the entire file is expected to use tabs as the delimiter. There are two important considerations:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the delimiter configuration and detection logic for parsing annotation files, allowing the parsing of files with tab, comma, or semicolon delimiters without extra configuration. Key changes include:
- Removing the AnnotationDelimiter enum and DEFAULT_DELIMITER from the configuration.
- Updating the delimiter detection logic in both the parser and annotation initialization.
- Adjusting unit tests to reflect the updated delimiter configuration.
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_annotation/test_annotation.py | Removed tests for invalid/non-existent delimiter and added a debug print |
| tests/data/annotation/invalid_delimiter.yaml | Removed invalid example configuration |
| tests/data/annotation/annotation.yaml | Updated the delimiter setting to use a literal tab character |
| openvariant/variant/variant.py | Introduced the _detect_delimiter function and updated parsing logic |
| openvariant/annotation/config_annotation.py | Removed the AnnotationDelimiter enum and DEFAULT_DELIMITER constant |
| openvariant/annotation/annotation.py | Updated delimiter handling in initialization and key validation |
Comments suppressed due to low confidence (2)
tests/test_annotation/test_annotation.py:55
- Consider removing the debug print statement from the unit test to keep test output clean.
print(annotation.delimiter, ' ')
openvariant/annotation/annotation.py:38
- The removal of the delimiter value validation (previously checking allowed values) may allow invalid delimiters to pass unnoticed. Consider reintroducing validation to ensure the provided delimiter meets expected patterns.
if AnnotationGeneralKeys.DELIMITER.value in annot and not isinstance(annot[AnnotationGeneralKeys.DELIMITER.value], str):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a test with a VCF (with commented line ## and # in the header) and as a separator ;. And it failed.
##FORMAT=<ID=SAF,Number=A,Type=Integer,Description="Alternate allele observations on the forward strand">
##FORMAT=<ID=SAR,Number=A,Type=Integer,Description="Alternate allele observations on the reverse strand">
##FORMAT=<ID=SRF,Number=1,Type=Integer,Description="Number of reference observations on the forward strand">
##FORMAT=<ID=SRR,Number=1,Type=Integer,Description="Number of reference observations on the reverse strand">
#CHROM;POS;ID;REF;ALT;QUAL;FILTER;INFO;FORMAT;23P_5232
chr1;115256528;.;TT;TC;11486.3;PASS;AF=0.331738;AO=1440;DP=4387;FAO=1456;FDP=4389;FDVR=10;FR=.;FRO=2933;FSAF=831;FSAR=625;FSRF=1668;FSRR=1265;FWDB=-0.0187354;FXX=0.00113792;GCM=0;HRUN=2;LEN=3;MLLD=233.749;NID=vc.novel.1;OALT=C;OID=.;OMAPALT=TC;OPOS=115256529;OREF=T;PB=.;PBP=.;PPA=.;PPD=0;QD=10.4682;RBI=0.0202428;REFB=-0.00833745;REVB=0.00766507;RO=2921;SAF=827;SAR=613;SPD=0;SRF=1661;SRR=1260;SSEN=0;SSEP=0;SSSB=0.00680865;STB=0.501391;STBP=0.898;TYPE=snp;UFR=.,.,.,.,unknown,.;VARB=0.016353;VCFALT=TCG;VCFPOS=115256528;VCFREF=TTG;FUNC=[{'origPos':'115256528','origRef':'TT','normalizedRef':'T','gene':'NRAS','normalizedPos':'115256529','normalizedAlt':'C','polyphen':'0.214','gt':'pos','codon':'CGA','coding':'c.182A>G','sift':'0.0','grantham':'43.0','transcript':'NM_002524.5','function':'missense','protein':'p.Q61R','location':'exonic','origAlt':'TC','exon':'3','oncomineGeneClass':'Gain-of-Function','oncomineVariantClass':'Hotspot','CLNACC1':'13900','CLNSIG1':'Pathogenic','CLNREVSTAT1':'criteria_provided,_multiple_submitters,_no_conflicts','CLNID1':'rs11554290'}];GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR;0/1:11486:4387:4389:2921:2933:1440:1456:0.331738:613:827:1661:1260:625:831:1668:1265
chr1;115256529;COSM584;T;C;11439.6;PASS;AF=0.329765;AO=1440;DP=4390;FAO=1448;FDP=4391;FDVR=10;FR=.;FRO=2939;FSAF=827;FSAR=621;FSRF=1673;FSRR=1266;FWDB=-0.0193984;FXX=6.82749E-4;GCM=0;HRUN=2;LEN=1;MLLD=235.553;OALT=TC;OID=COSM584;OMAPALT=C;OPOS=115256528;OREF=TT;PB=.;PBP=.;PPA=.;PPD=0;QD=10.4209;RBI=0.0216282;REFB=-0.00712873;REVB=0.00956471;RO=2925;SAF=827;SAR=613;SPD=0;SRF=1662;SRR=1263;SSEN=0;SSEP=0;SSSB=0.00733417;STB=0.501293;STBP=0.907;TYPE=snp;VARB=0.0138076;VCFALT=CG;VCFPOS=115256529;VCFREF=TG;HS;FUNC=[{'origPos':'115256529','origRef':'T','normalizedRef':'T','gene':'NRAS','normalizedPos':'115256529','normalizedAlt':'C','polyphen':'0.214','gt':'pos','codon':'CGA','coding':'c.182A>G','sift':'0.0','grantham':'43.0','transcript':'NM_002524.5','function':'missense','protein':'p.Q61R','location':'exonic','origAlt':'C','exon':'3','oncomineGeneClass':'Gain-of-Function','oncomineVariantClass':'Hotspot','CLNACC1':'13900','CLNSIG1':'Pathogenic','CLNREVSTAT1':'criteria_provided,_multiple_submitters,_no_conflicts','CLNID1':'rs11554290'}];GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR;0/1:11396:4390:4391:2925:2939:1440:1448:0.329765:613:827:1662:1263:621:827:1673:1266
chr2;29416338;ALK;A;<CNV>;100.0;GAIN;HS;FR=.;PRECISE=FALSE;SVTYPE=CNV;END=29940583;LEN=524246;NUMTILES=12;SD=0.2;CDF_MAPD=.001:3.89382,0.005:4.64302,0.01:5.01559,0.025:5.57397,0.05:6.06513,0.1:6.64428,0.2:7.36448,0.25:7.64358,0.5:8.8,0.75:10.0086,0.8:10.3168,0.9:11.1443,0.95:11.8457,0.975:12.4675,0.99:13.2067,0.995:13.7201,0.999:14.805;RAW_CN=2.68;REF_CN=2;PVAL=7.983E-6;FD=1.34;CI=0.05:6.06513,0.95:11.8457;FUNC=[{'gene':'ALK','oncomineGeneClass':'Gain-of-Function','oncomineVariantClass':'Amplification'}];GT:FD:CN;./.:1.34:8.8
chr2;29455169;ALK;C;.;.;PASS;SVTYPE=RNAExonTiles;FUNCTION=ExpressionImbalance;GENE_NAME=ALK;IMBALANCE_SCORE=4.323;IMBALANCE_PVAL=7.0E-4;BREAKPOINT_RANGE=exon15-exon20;NUM_ISOFORMS_DETECTED=0;FUNC=[{'gene':'ALK','oncomineGeneClass':'Gain-of-Function','oncomineVariantClass':'ExpressionImbalance'}];GT:GQ;./.:.The way the delimiter function is implemented is that it will only get into the _detect_delimiter() if the l_num == 0 and the delimiter is None. The problem is that the _detect_delimiter() function will return a value even if the count for each of the key is 0. In fact, the tie break of the max function is the first key that checks (in our case \t) [source].
So the l_num condition will be met only on the first line of the file (independent if its a comment or not since the skip is implemented later) and in the first hit it will return a \teven if the count of the separator is 0.
We should detect that the line that we are taking to check the delimiter count is the actual header.
for l_num, line in enumerate(iter(mm_obj.readline, b'')):
line = line.decode('utf-8')
# Skip comments
if (line.startswith('#') or line.startswith('##') or line.startswith('browser') or
line.startswith('track')) and not line.startswith('#CHROM'):
continue
if delimiter is None:
delimiter = _detect_delimiter(line)
row_line = re.split(delimiter, line)
row_line = list(map(lambda w: w.rstrip("\r\n"), row_line))
if len(row_line) == 0:
continue
yield l_num, row_lineAlso, if the count is 0, I would raise an error rather than allow the \t by default.
Other than these comments, I overall like the approach! Thanks a lot David 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi David! I agree with @FedericaBrando regarding the order when the _detect_delimiter function is called vs the comment check. Other than that, I added a suggestion on a refactor of the _detect_delimiter function itself by using the csv library, as it already includes a class (Sniffer) with a method designed to handle these type of scenarios, so it could be interesting to add.
Other than that, I also like the approach, so looking forward to this! 💪
Co-authored-by: Carlos López-Elorduy <[email protected]>
…rly' of github.com:bbglab/openvariant into 47-openvariant-bug-windows-input-files-not-parsed-properly
|
Added! if you are okay with that, we can merge. |
CarlosLopezElorduy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just found a problem with the function for single column files. Added a suggestion to catch this problem.
Co-authored-by: Carlos López-Elorduy <[email protected]>
CarlosLopezElorduy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
FedericaBrando
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
Be able to parse any tabular, coma and semicolon file.
I decided not to set
[\t,;]as theDELIMITER_DEFAULTbecause it can lead to incorrect parsing of lines where fields are separated by tabs, but commas or semicolons appear inside the fields, like in this example line:Now, the first line (header) of each file is checked to determine which delimiter will work, using the following function located in
openvariant/variant/variant.py:As well, user can determine its own delimiter on annotation file:
delimiter: '[\t,-]'or
delimiter: \tor
delimiter: \t,-;However, with default set up user doesn't need to add any additional config. to be able to parse tsv,csv and ssv files simultaneously. What do you think?