Skip to content

scan_vcf panics on multi-valued INFO fields containing "." (VCF missing value) — silent data truncation #312

@antonkulaga

Description

@antonkulaga

Description

scan_vcf / read_vcf panics in Rust when a multi-valued INFO field (Number=R, Number=A, or Number=.) contains . (the standard VCF missing value) as one of the comma-separated elements. The panic silently truncates the output instead of raising a Python exception, causing massive data loss (e.g. 5M rows → 229K rows with no error).

Environment

  • polars-bio: 0.23.0
  • Python: 3.12
  • OS: Linux (Ubuntu 24.04)

Minimal Reproducer

import polars_bio as pb

# Create minimal VCF — "." is the standard VCF missing value
vcf_content = """##fileformat=VCFv4.2
##INFO=<ID=ALLELE_ID,Number=R,Type=String,Description="Identifier for each allele">
#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO
chr1\t100\t.\tA\tG\t30\tPASS\tALLELE_ID=ref1,alt1
chr1\t200\t.\tC\tT\t30\tPASS\tALLELE_ID=.,alt2
"""

with open("/tmp/test.vcf", "w") as f:
    f.write(vcf_content)

# This panics and returns 0 rows (expected: 2)
lf = pb.scan_vcf("/tmp/test.vcf", info_fields=["ALLELE_ID"])
df = lf.collect()
print(f"Rows: {df.height}")  # prints 0, expected 2

Panic Output

thread '<unnamed>' panicked at .../datafusion/bio-format-vcf/src/physical_exec.rs:360:87:
called `Option::unwrap()` on a `None` value

Affected Scope

The bug affects ALL multi-valued INFO fields when any element is .:

Header Number Type Panic line Example value
Number=R String 360 ALLELE_ID=.,alt_id
Number=A String 360 TEST=val1,.
Number=. String 360 TEST=val1,.,val3
Number=R Integer 348 AD=.,15

Scalar fields (Number=1) correctly handle . as null — so the bug is specifically in the list-building code paths.

Why This Matters

  1. Valid VCF: . is the standard missing value in VCF 4.2 spec (§1.2). Fields like ALLELE_ID (Number=R) from DRAGEN/Illumina commonly use ALLELE_ID=.,NM_000157.4:c.1604G>A where the REF allele has no annotation.

  2. Silent data loss: The Rust panic does not propagate as a Python exception. Instead, output is silently truncated at the batch boundary where the panic occurred. In a real 5M-variant WGS VCF, we got only 229K rows back — 95.5% of data silently lost with no error raised.

  3. No workaround except exclusion: Users must exclude all potentially affected fields from info_fields, which defeats the purpose of having them.

Expected Behavior

. elements in multi-valued fields should be represented as null within the list, consistent with how scalar Number=1 fields already handle .. For example:

ALLELE_ID=.,alt_id  →  list[str]: [null, "alt_id"]
AD=.,15             →  list[i32]: [null, 15]

Root Cause

In physical_exec.rs, the list-building code calls .unwrap() on the parsed element without handling the None case that arises when . is encountered. The fix would be to handle None as a null element in the Arrow list array builder rather than panicking.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions