-
Notifications
You must be signed in to change notification settings - Fork 30
Description
Description
scan_vcf / read_vcf panics in Rust when a multi-valued INFO field (Number=R, Number=A, or Number=.) contains . (the standard VCF missing value) as one of the comma-separated elements. The panic silently truncates the output instead of raising a Python exception, causing massive data loss (e.g. 5M rows → 229K rows with no error).
Environment
- polars-bio: 0.23.0
- Python: 3.12
- OS: Linux (Ubuntu 24.04)
Minimal Reproducer
import polars_bio as pb
# Create minimal VCF — "." is the standard VCF missing value
vcf_content = """##fileformat=VCFv4.2
##INFO=<ID=ALLELE_ID,Number=R,Type=String,Description="Identifier for each allele">
#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO
chr1\t100\t.\tA\tG\t30\tPASS\tALLELE_ID=ref1,alt1
chr1\t200\t.\tC\tT\t30\tPASS\tALLELE_ID=.,alt2
"""
with open("/tmp/test.vcf", "w") as f:
f.write(vcf_content)
# This panics and returns 0 rows (expected: 2)
lf = pb.scan_vcf("/tmp/test.vcf", info_fields=["ALLELE_ID"])
df = lf.collect()
print(f"Rows: {df.height}") # prints 0, expected 2Panic Output
thread '<unnamed>' panicked at .../datafusion/bio-format-vcf/src/physical_exec.rs:360:87:
called `Option::unwrap()` on a `None` value
Affected Scope
The bug affects ALL multi-valued INFO fields when any element is .:
| Header Number | Type | Panic line | Example value |
|---|---|---|---|
Number=R |
String | 360 | ALLELE_ID=.,alt_id |
Number=A |
String | 360 | TEST=val1,. |
Number=. |
String | 360 | TEST=val1,.,val3 |
Number=R |
Integer | 348 | AD=.,15 |
Scalar fields (Number=1) correctly handle . as null — so the bug is specifically in the list-building code paths.
Why This Matters
-
Valid VCF:
.is the standard missing value in VCF 4.2 spec (§1.2). Fields likeALLELE_ID(Number=R) from DRAGEN/Illumina commonly useALLELE_ID=.,NM_000157.4:c.1604G>Awhere the REF allele has no annotation. -
Silent data loss: The Rust panic does not propagate as a Python exception. Instead, output is silently truncated at the batch boundary where the panic occurred. In a real 5M-variant WGS VCF, we got only 229K rows back — 95.5% of data silently lost with no error raised.
-
No workaround except exclusion: Users must exclude all potentially affected fields from
info_fields, which defeats the purpose of having them.
Expected Behavior
. elements in multi-valued fields should be represented as null within the list, consistent with how scalar Number=1 fields already handle .. For example:
ALLELE_ID=.,alt_id → list[str]: [null, "alt_id"]
AD=.,15 → list[i32]: [null, 15]
Root Cause
In physical_exec.rs, the list-building code calls .unwrap() on the parsed element without handling the None case that arises when . is encountered. The fix would be to handle None as a null element in the Arrow list array builder rather than panicking.