1- Io_lib: Version 1.14.15
2- ========================
1+ Io_lib: Version 1.15.0
2+ =======================
33
44Io_lib is a library of file reading and writing code to provide a general
55purpose SAM/BAM/CRAM, trace file (and Experiment File) reading
@@ -33,131 +33,30 @@ See the CHANGES for a summary of older updates or git logs for the
3333full details.
3434
3535
36- Version 1.14.15 (6th December 2022 )
37- ---------------
36+ Version 1.15.0 (14th April 2023 )
37+ --------------
3838
39- This is primarily a bug fix release.
39+ The first release that no longer warns about CRAM 3.1 being draft.
40+ No changes have been made to the format and it is fully compatible
41+ with the 1.14.x releases.
4042
4143
42- Version 1.14.14 (17th March 2021)
43- ---------------
44+ Technology Demo: 4.0
45+ ====================
4446
45- This is simply a bug fix release. It also updates to the latest
46- htscodecs submodule, now at an official 1.0 release.
47+ The current official GA4GH CRAM version is 3.1.
4748
48- Version 1.14.13 (3rd July 2020)
49- ---------------
49+ The current default CRAM output is 3.0, for maximum compatibility with
50+ other tools. Use the -V3.1 option to select CRAM 3.1 if needed.
5051
51- This release has a mixture of on-going CRAM 4 work (not compatible
52- with previous CRAM 4) and some more general quality of life
53- improvements for all CRAM versions including speed-ups and better
54- multi-threading.
55-
56- Note both CRAM 3.1 and 4.0 are still to be considered an unofficial
57- CRAM extensions.
58-
59- Updates:
60-
61- * Scramble can now filter-in or filter-out aux tags during
62- transcoding. This is done using -d and -D options. For example:
63-
64- scramble -D OQ,BI,BD in.bam out.cram
65-
66- removes the GATK added OQ, BI and BD aux tags.
67- Requested by @jhaezebrouck in issue #24 .
68-
69- * The Scramble -X <profile > options are now implemented using a
70- CRAM_OPT_PROFILE option. This simplifies the scramble code and
71- makes it easier to call from a library. This also fixes a number of
72- bugs in the order of argument parsing.
73-
74- * Improved CRAM writing speeds.
75-
76- The bam_copy function now only copies the number of used bytes
77- rather than the number of allocated bytes, which can sometimes be
78- substantially smaller. As this was done in the main thread it may
79- have a significant benefit when multi-threading.
80-
81- * Added libdeflate support into CRAM too (in addition to the existing
82- support in BAM). This isn't a huge change to CRAM speeds except at
83- high levels (-8 and -9) which are now slower, but also better
84- compression ratio. A modest 2-3% speed gain is visible are low and
85- mid levels, and at -1/-2 to -4 the compression ratio is also
86- improved.
87-
88- * CRAM 3.1 compression level -1 is now 25% faster, but 4% larger.
89- This is achieved by difference choice of compression codecs, most
90- notably disabling the name tokeniser for level 1. Use level 2 for
91- something comparable to the old behaviour.
92-
93- * Added an io_lib/version.h to make it easier to detect the version
94- being compiled against using IOLIB_VERSION macros.
95- Requested by German Tischler in issue #25 .
96-
97- * Refactored the cram encoding interface used by biobambam.
98- Implemented by German Tischler in PR #27 .
99-
100- * CRAM 4 now uses E_CONST instead of a uni-value version of
101- E_HUFFMAN. Also added offset field to VARINT_SIGNED and
102- VARINT_UNSIGNED which helps for data series that have values from -1
103- to MAXINT.
104-
105- * CRAM 4 container structure has changed so that all values are
106- variable sized integers instead of fixed size.
107-
108- * Further improvements with CRAM 4's use of signed values.
109- - Ref_seq_id is container and slice headers are now signed.
110- - RI (ref ID) data series and NS (mate ref ID) are also now signed
111- as -1 is a valid value.
112- - Embedded ref id is now 0 for unusued instead of -1.
113-
114- * Reversed the use of CRAM 4 delta encoding for the B array. It only
115- helps at the moment for ONT signal data, so it needs more work to
116- make it auto-detect when delta makes sense. (Enabling it globally
117- for CRAM4 B aux tags was accidental.)
118-
119- * Htscodecs submodule has gained support for big-endian platforms
120- Other big-endian improvements to parts of CRAM4 too.
121-
122- Bug fixes:
123-
124- * Fixed CRAM MD tag generatin when using the "b" feature code
125- (NB: unused by known CRAM encoders).
126- Also see https://github.com/samtools/htslib/pull/1086 for more details.
127-
128- * Fixed CRAM quality string when using "q" feature code (unused by
129- encoders?) and in lossy-quality mode (maybe utilised in old
130- Cramtools).
131- Also see https://github.com/samtools/htslib/pull/1094 for more details.
132-
133- * Fixed some minor memory leaks.
134-
135- * "Scramble -X archive -1" enabled lzma, which should only have
136- arrived at level 7 and above. (It compared integer 7 vs ASCII '1'.)
137-
138- * Removed minor compilation warning in printf debugging.
139-
140- * Fixed a 7 year old bug in scram_pileup which couldn't cope with
141- soft-clips being followed by hard-clips.
142-
143-
144- Technology Demo: CRAM 3.1 and 4.0
145- =================================
146-
147- The current official GA4GH CRAM version is 3.0.
148-
149- For purposes of * EVALUATION ONLY* this release of io_lib includes CRAM
150- version 3.1, with new compression codecs (but is otherwise identical
151- file layout to 3.0), and 4.0 with a few additional format
52+ For purposes of * EVALUATION ONLY* this release of io_lib also includes
53+ an experimental CRAM version 4.0. The format very likely to change
54+ and should not be used for production data. CRAM 4.0 includes format
15255modifications, such as 64-bit sizes, deduplication of read names,
15356orientation changes of quality strings and a revised variable sized
154- integer encoding.
57+ integer encoding. It can be enabled using scramble -V4.0
15558
156- They can be turned on using e.g. scramble -V3.1 or scramble -V4.0.
157- It is likely CRAM v4.0 will be official significantly later, but we
158- plan on v3.1 being a recognised GA4GH standard this year.
159-
160- By default enabling either of these will also enable the new codecs.
59+ Enabling CRAM 3.1 or 4.0 will also enable the new codecs.
16160Which codecs are used also depends on the profile specified (eg via
16261"-X small"). Some of the new codecs are considerably slower,
16362especially at decompression, but by default CRAM 3.1 aims to be
@@ -167,79 +66,37 @@ small and archive respectively).
16766
16867Here are some example file sizes and timings with different codecs and
16968levels on 10 million 150bp NovaSeq reads, single threaded. Decode
170- timing is checked using "scram_flagstat -b". Tests were performed
171- on an Intel i5-4570 processor at 3.2GHz.
69+ timing is checked using "scram_flagstat -b".
70+
71+ Table produced with Io_lib 1.15.0 on a laptop with Intel i7-1185G7
72+ CPU running Ubuntu 20.04 under Microsoft's WSL2.
17273
17374| Scramble opts. | Size(MB) | Enc(s)| Dec(s)| Codecs used |
17475| --------------------| --------:| -----:| -----:| ---------------------------|
175- | -O bam | 531.9| 92.3| 7.5| bgzf(zlib) |
176- | -O bam -1 | 611.4| 26.4| 5.4| bgzf(libdeflate) |
177- | -O bam (default) | 539.5| 45.0| 4.9| bgzf(libdeflate) |
178- | -O bam -9 | 499.5| 920.2| 4.9| bgzf(libdeflate) |
179- ||||||
180- | -V2.0 -X fast | 317.7| 38.8| 11.8| (default, level 1) |
181- | -V2.0 (default) | 267.6| 47.0| 10.5| (default) |
182- | -V2.0 -X small | 218.0| 124.6| 33.1| bzip2 |
183- ||||||
184- | -V3.0 -X fast | 264.9| 31.3| 10.8| (default, level 1) |
185- | -V3.0 (default) | 223.7| 34.7| 10.3| (default) |
186- | -V3.0 -X small | 212.3| 88.3| 18.2| bzip2 |
187- | -V3.0 -X archive | 209.4| 98.7| 18.2| bzip2 |
188- ||||||
189- | -V3.1 -X fast | 262.4| 29.1| 9.3| rANS++ |
190- | -V3.1 (default) | 186.4| 33.7| 8.3| rANS++,tok3 |
191- | -V3.1 -X small | 176.8| 74.0| 35.2| rANS++,tok3,fqz |
192- | -V3.1 -X archive | 171.9| 127.9| 34.9| rANS++,tok3,fqz,bzip2,arith|
193- ||||||
194- | -V4.0 -X fast | 251.2| 28.9| 9.6| rANS++ |
195- | -V4.0 (default) | 182.1| 32.9| 8.2| rANS++,tok3 |
196- | -V4.0 -X small | 170.9| 70.9| 35.0| rANS++,tok3,fqz |
197- | -V4.0 -X archive | 166.9| 116.4| 34.2| rANS++,tok3,fqz,bzip2,arith|
198-
199- We also tested on a small human aligned HiSeq run (ERR317482)
200- representing older Illumina data with pre-binning era quality values.
201- This dataset shows less impressive gains with 4.0 over 3.0 in the
202- default profile, but major gains in small profile once fqzcomp quality
203- encoding is enabled.
204-
205- Note for this file, the file sizes are larger meaning less disk
206- caching is possible (the test machine wasn't a memory stressed
207- desktop). Threading was also enabled, albeit with just 4 threads,
208- which further exacerbates I/O bottlenecks. The previous test
209- demonstrated BAM being faster to read than CRAM, but with large files
210- in a more I/O stressed situation this test demonstrates the default
211- profile of CRAM is faster to read than BAM, due to the smaller I/O
212- footprint.
213-
214- NB: the table below was produced with 1.14.12.
215-
216- | Scramble opts. | Size(MB) | Enc(s)| Dec(s)| Codecs used |
217- | -------------------- | --------:| -----:| -----:| --------------------------------|
218- | -t4 -O bam (default) | 6526 | 115.4| 44.7| bgzf(libdeflate) |
76+ | -O bam (default) | 518.2| 65.8| 5.7| bgzf(zlib) |
77+ | -O bam -1 | 584.5| 17.4| 3.5| bgzf(libdeflate) |
78+ | -O bam (default) | 524.6| 27.8| 2.9| bgzf(libdeflate) |
79+ | -O bam -9 | 486.5| 810.4| 3.0| bgzf(libdeflate) |
21980||||||
220- | -t4 -V2.0 -X fast | 3674 | 87.4| 31.4| (default, level 1) |
221- | -t4 -V2.0 (default) | 3435 | 91.4| 30.7| (default) |
222- | -t4 -V2.0 -X small | 3373 | 145.5| 47.8| bzip2 |
223- | -t4 -V2.0 -X archive | 3377 | 166.3| 49.7| bzip2 |
224- | -t4 -V2.0 -X archive -9| 3125 | 1900.6| 76.9| bzip2 |
81+ | -V2.0 -X fast | 294.5| 23.1| 7.8| (default, level 1) |
82+ | -V2.0 (default) | 252.3| 32.9| 8.0| (default) |
83+ | -V2.0 -X small | 208.0| 85.2| 23.5| bzip2 |
84+ | -V2.0 -X archive | 206.0| 88.1| 24.3| bzip2 |
22585||||||
226- | -t4 -V3.0 -X fast | 3620 | 88.3| 29.3| (default, level 1) |
227- | -t4 -V3.0 (default) | 3287 | 90.5| 29.5| (default) |
228- | -t4 -V3.0 -X small | 3238 | 128.5| 40.3| bzip2 |
229- | -t4 -V3.0 -X archive | 3220 | 164.9| 50.0| bzip2 |
230- | -t4 -V3.0 -X archive -9| 3115 | 1866.6| 75.2| bzip2, lzma |
86+ | -V3.0 -X fast | 241.1| 19.7| 8.5| (default, level 1) |
87+ | -V3.0 (default) | 208.5| 23.0| 8.8| (default) |
88+ | -V3.0 -X small | 201.7| 60.0| 14.5| bzip2 |
89+ | -V3.0 -X archive | 199.9| 61.7| 13.6| bzip2 |
23190||||||
232- | -t4 -V3.1 -X fast | 3611 | 87.9| 29.2| rANS++ |
233- | -t4 -V3.1 (default) | 3161 | 88.8| 29.7| rANS++,tok3 |
234- | -t4 -V3.1 -X small | 2249 | 192.2| 146.1| rANS++,tok3,fqz |
235- | -t4 -V3.1 -X archive | 2157 | 235.2| 127.5| rANS++,tok3,fqz,bzip2,arith |
236- | -t4 -V3.1 -X archive | 2145 | 480.3| 128.9| rANS++,tok3,fqz,bzip2,arith,lzma|
91+ | -V3.1 -X fast | 237.1| 22.1| 7.9| rANS++ |
92+ | -V3.1 (default) | 175.8| 26.7| 8.9| rANS++,tok3 |
93+ | -V3.1 -X small | 166.9| 47.9| 24.6| rANS++,tok3,fqz |
94+ | -V3.1 -X archive | 162.2| 72.5| 20.5| rANS++,tok3,fqz,bzip2,arith|
23795||||||
238- | -t4 -V4.0 -X fast | 3551 | 87.8| 29.5| rANS++ |
239- | -t4 -V4.0 (default) | 3148 | 88.9| 30.0| rANS++,tok3 |
240- | -t4 -V4.0 -X small | 2236 | 189.7| 142.6| rANS++,tok3,fqz |
241- | -t4 -V4.0 -X archive | 2139 | 226.7| 127.5| rANS++,tok3,fqz,bzip2,arith |
242- | -t4 -V4.0 -X archive -9| 2132 | 453.5| 128.2| rANS++,tok3,fqz,bzip2,arith,lzma|
96+ | -V4.0 -X fast | 227.5| 16.6| 6.2| rANS++ |
97+ | -V4.0 (default) | 172.8| 19.7| 6.3| rANS++,tok3 |
98+ | -V4.0 -X small | 162.3| 34.8| 20.2| rANS++,tok3,fqz |
99+ | -V4.0 -X archive | 157.9| 82.2| 26.2| rANS++,tok3,fqz,bzip2,arith|
243100
244101
245102Building
0 commit comments