-
Notifications
You must be signed in to change notification settings - Fork 20
Description
writeVcf performs some unnecessary work that interferes with a perfect roundtrip of a VCF to a file and back. At the very least, these discrepancies interfere with my unit tests that check for correct reproduction of a VCF object from a VCF file; they also have some practical consequences as they introduce differences in the MD5 checksums, which then prevents some deduplication mechanisms in backend storage systems. To illustrate:
fl <- system.file("extdata", "structural.vcf", package="VariantAnnotation")
out <- tempfile()
first <- readVcf(fl)
writeVcf(first, out)
roundtrip <- readVcf(out)
all.equal(roundtrip, first)
## [1] "Attributes: < Component “metadata”: Component “header”: Attributes: < Component “header”: Lengths: 9, 8 > >"
## [2] "Attributes: < Component “metadata”: Component “header”: Attributes: < Component “header”: Names: Lengths (9, 8) differ (string compare on first 8) > >"
## [3] "Attributes: < Component “metadata”: Component “header”: Attributes: < Component “header”: Attributes: < Component ## “listData”: Length mismatch: comparison on first 8 components > > >"
## [4] "Attributes: < Component “metadata”: Component “header”: Attributes: < Component “header”: Attributes: < Component “listData”: Component “fileDate”: Attributes: < Component “listData”: Component “Value”: 1 string mismatch > > > >"
## [5] "Attributes: < Component “metadata”: Component “header”: Attributes: < Component “reference”: Lengths (4, 0) differ (string compare on first 0) > >"For mismatches 1-3: as documented in its manpage, writeVcf adds some ##contig headers based on the seqinfo() of the input object. However, in this case, all the sequence information is NA, so perhaps writeVcf might be smart enough to omit the ##contig headers altogether, rather than manufacturing some non-informative placeholders:
##contig=<ID=1>
##contig=<ID=2>
##contig=<ID=3>
##contig=<ID=4>
For mismatch 4: as documented, writeVcf replaces the fileDate with the current date. Perhaps you could consider an option to respect any existing date in the VCF object, which is useful for retaining provenance, e.g., if I'm using VariantAnnotation to make some changes to an existing VCF file but I want to keep the date of the original file's creation.
For mismatch 5: I'm not sure what happened here, but I'm guessing this has something to do with the extra contig lines.
Session information
R Under development (unstable) (2023-11-29 r85646)
Platform: aarch64-apple-darwin22.5.0
Running under: macOS Ventura 13.6.1
Matrix products: default
BLAS: /Users/luna/Software/R/trunk/lib/libRblas.dylib
LAPACK: /Users/luna/Software/R/trunk/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] VariantAnnotation_1.49.2 Rsamtools_2.19.2
[3] Biostrings_2.71.1 XVector_0.43.0
[5] SummarizedExperiment_1.33.1 Biobase_2.63.0
[7] GenomicRanges_1.55.1 GenomeInfoDb_1.39.5
[9] IRanges_2.37.0 S4Vectors_0.41.3
[11] MatrixGenerics_1.15.0 matrixStats_1.2.0
[13] BiocGenerics_0.49.1
loaded via a namespace (and not attached):
[1] KEGGREST_1.43.0 rjson_0.2.21 lattice_0.22-5
[4] vctrs_0.6.5 tools_4.4.0 bitops_1.0-7
[7] generics_0.1.3 curl_5.2.0 parallel_4.4.0
[10] tibble_3.2.1 fansi_1.0.6 AnnotationDbi_1.65.2
[13] RSQLite_2.3.4 blob_1.2.4 pkgconfig_2.0.3
[16] Matrix_1.6-4 BSgenome_1.71.1 dbplyr_2.4.0
[19] lifecycle_1.0.4 GenomeInfoDbData_1.2.11 compiler_4.4.0
[22] stringr_1.5.1 progress_1.2.3 codetools_0.2-19
[25] yaml_2.3.8 RCurl_1.98-1.13 pillar_1.9.0
[28] crayon_1.5.2 BiocParallel_1.37.0 DelayedArray_0.29.0
[31] cachem_1.0.8 abind_1.4-5 tidyselect_1.2.0
[34] digest_0.6.33 stringi_1.8.3 restfulr_0.0.15
[37] dplyr_1.1.4 biomaRt_2.59.0 fastmap_1.1.1
[40] grid_4.4.0 cli_3.6.2 SparseArray_1.3.2
[43] magrittr_2.0.3 S4Arrays_1.3.1 GenomicFeatures_1.55.1
[46] XML_3.99-0.16 utf8_1.2.4 rappdirs_0.3.3
[49] filelock_1.0.3 prettyunits_1.2.0 bit64_4.0.5
[52] httr_1.4.7 bit_4.0.5 png_0.1-8
[55] hms_1.1.3 memoise_2.0.1 BiocIO_1.13.0
[58] BiocFileCache_2.11.1 rtracklayer_1.63.0 rlang_1.1.2
[61] glue_1.6.2 DBI_1.2.0 xml2_1.3.6
[64] R6_2.5.1 GenomicAlignments_1.39.0 zlibbioc_1.49.0