Skip to content

Faithfully recapitulate metadata via a round-trip through writeVcf + readVcf. #78

@LTLA

Description

@LTLA

writeVcf performs some unnecessary work that interferes with a perfect roundtrip of a VCF to a file and back. At the very least, these discrepancies interfere with my unit tests that check for correct reproduction of a VCF object from a VCF file; they also have some practical consequences as they introduce differences in the MD5 checksums, which then prevents some deduplication mechanisms in backend storage systems. To illustrate:

fl <- system.file("extdata", "structural.vcf", package="VariantAnnotation")
out <- tempfile()
first <- readVcf(fl)
writeVcf(first, out)
roundtrip <- readVcf(out)

all.equal(roundtrip, first)
## [1] "Attributes: < Component “metadata”: Component “header”: Attributes: < Component “header”: Lengths: 9, 8 > >"
## [2] "Attributes: < Component “metadata”: Component “header”: Attributes: < Component “header”: Names: Lengths (9, 8) differ (string compare on first 8) > >"
## [3] "Attributes: < Component “metadata”: Component “header”: Attributes: < Component “header”: Attributes: < Component ## “listData”: Length mismatch: comparison on first 8 components > > >"
## [4] "Attributes: < Component “metadata”: Component “header”: Attributes: < Component “header”: Attributes: < Component “listData”: Component “fileDate”: Attributes: < Component “listData”: Component “Value”: 1 string mismatch > > > >"
## [5] "Attributes: < Component “metadata”: Component “header”: Attributes: < Component “reference”: Lengths (4, 0) differ (string compare on first 0) > >"

For mismatches 1-3: as documented in its manpage, writeVcf adds some ##contig headers based on the seqinfo() of the input object. However, in this case, all the sequence information is NA, so perhaps writeVcf might be smart enough to omit the ##contig headers altogether, rather than manufacturing some non-informative placeholders:

##contig=<ID=1>
##contig=<ID=2>
##contig=<ID=3>
##contig=<ID=4>

For mismatch 4: as documented, writeVcf replaces the fileDate with the current date. Perhaps you could consider an option to respect any existing date in the VCF object, which is useful for retaining provenance, e.g., if I'm using VariantAnnotation to make some changes to an existing VCF file but I want to keep the date of the original file's creation.

For mismatch 5: I'm not sure what happened here, but I'm guessing this has something to do with the extra contig lines.

Session information
R Under development (unstable) (2023-11-29 r85646)
Platform: aarch64-apple-darwin22.5.0
Running under: macOS Ventura 13.6.1

Matrix products: default
BLAS:   /Users/luna/Software/R/trunk/lib/libRblas.dylib
LAPACK: /Users/luna/Software/R/trunk/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] VariantAnnotation_1.49.2    Rsamtools_2.19.2
 [3] Biostrings_2.71.1           XVector_0.43.0
 [5] SummarizedExperiment_1.33.1 Biobase_2.63.0
 [7] GenomicRanges_1.55.1        GenomeInfoDb_1.39.5
 [9] IRanges_2.37.0              S4Vectors_0.41.3
[11] MatrixGenerics_1.15.0       matrixStats_1.2.0
[13] BiocGenerics_0.49.1

loaded via a namespace (and not attached):
 [1] KEGGREST_1.43.0          rjson_0.2.21             lattice_0.22-5
 [4] vctrs_0.6.5              tools_4.4.0              bitops_1.0-7
 [7] generics_0.1.3           curl_5.2.0               parallel_4.4.0
[10] tibble_3.2.1             fansi_1.0.6              AnnotationDbi_1.65.2
[13] RSQLite_2.3.4            blob_1.2.4               pkgconfig_2.0.3
[16] Matrix_1.6-4             BSgenome_1.71.1          dbplyr_2.4.0
[19] lifecycle_1.0.4          GenomeInfoDbData_1.2.11  compiler_4.4.0
[22] stringr_1.5.1            progress_1.2.3           codetools_0.2-19
[25] yaml_2.3.8               RCurl_1.98-1.13          pillar_1.9.0
[28] crayon_1.5.2             BiocParallel_1.37.0      DelayedArray_0.29.0
[31] cachem_1.0.8             abind_1.4-5              tidyselect_1.2.0
[34] digest_0.6.33            stringi_1.8.3            restfulr_0.0.15
[37] dplyr_1.1.4              biomaRt_2.59.0           fastmap_1.1.1
[40] grid_4.4.0               cli_3.6.2                SparseArray_1.3.2
[43] magrittr_2.0.3           S4Arrays_1.3.1           GenomicFeatures_1.55.1
[46] XML_3.99-0.16            utf8_1.2.4               rappdirs_0.3.3
[49] filelock_1.0.3           prettyunits_1.2.0        bit64_4.0.5
[52] httr_1.4.7               bit_4.0.5                png_0.1-8
[55] hms_1.1.3                memoise_2.0.1            BiocIO_1.13.0
[58] BiocFileCache_2.11.1     rtracklayer_1.63.0       rlang_1.1.2
[61] glue_1.6.2               DBI_1.2.0                xml2_1.3.6
[64] R6_2.5.1                 GenomicAlignments_1.39.0 zlibbioc_1.49.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions