Skip to content

Commit 4407f2a

Browse files
committed
Update files for release 1.15.
This removes the warning on CRAM 3.1 being in draft, and updates htscodecs to gain fqzcomp speed improvements.
1 parent 771c53c commit 4407f2a

File tree

5 files changed

+63
-192
lines changed

5 files changed

+63
-192
lines changed

CHANGES

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,21 @@
1+
Version 1.15.0 (14th April 2023)
2+
--------------
3+
4+
Version number bumped to reflect the official status of CRAM 3.1.
5+
6+
Updates:
7+
8+
* Formally accept CRAM 3.1 as an official standard. Warning removed.
9+
For best compatibility CRAM 3.0 is still the default CRAM, but use
10+
"-V3.1" to specify the version.
11+
12+
* Updated to latest htscodecs. This has a significant speed
13+
improvement in encoding with fqzcomp (enabled in "-X small" profile).
14+
15+
Tested on a NovaSeq dataset, encoding from BAM to CRAM was 27% faster.
16+
Decoding a CRAM with fqzcomp is also around 6% faster.
17+
18+
119
Version 1.14.15 (6th December 2022)
220
---------------
321

README.md

Lines changed: 41 additions & 184 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
Io_lib: Version 1.14.15
2-
========================
1+
Io_lib: Version 1.15.0
2+
=======================
33

44
Io_lib is a library of file reading and writing code to provide a general
55
purpose SAM/BAM/CRAM, trace file (and Experiment File) reading
@@ -33,131 +33,30 @@ See the CHANGES for a summary of older updates or git logs for the
3333
full details.
3434

3535

36-
Version 1.14.15 (6th December 2022)
37-
---------------
36+
Version 1.15.0 (14th April 2023)
37+
--------------
3838

39-
This is primarily a bug fix release.
39+
The first release that no longer warns about CRAM 3.1 being draft.
40+
No changes have been made to the format and it is fully compatible
41+
with the 1.14.x releases.
4042

4143

42-
Version 1.14.14 (17th March 2021)
43-
---------------
44+
Technology Demo: 4.0
45+
====================
4446

45-
This is simply a bug fix release. It also updates to the latest
46-
htscodecs submodule, now at an official 1.0 release.
47+
The current official GA4GH CRAM version is 3.1.
4748

48-
Version 1.14.13 (3rd July 2020)
49-
---------------
49+
The current default CRAM output is 3.0, for maximum compatibility with
50+
other tools. Use the -V3.1 option to select CRAM 3.1 if needed.
5051

51-
This release has a mixture of on-going CRAM 4 work (not compatible
52-
with previous CRAM 4) and some more general quality of life
53-
improvements for all CRAM versions including speed-ups and better
54-
multi-threading.
55-
56-
Note both CRAM 3.1 and 4.0 are still to be considered an unofficial
57-
CRAM extensions.
58-
59-
Updates:
60-
61-
* Scramble can now filter-in or filter-out aux tags during
62-
transcoding. This is done using -d and -D options. For example:
63-
64-
scramble -D OQ,BI,BD in.bam out.cram
65-
66-
removes the GATK added OQ, BI and BD aux tags.
67-
Requested by @jhaezebrouck in issue #24.
68-
69-
* The Scramble -X <profile> options are now implemented using a
70-
CRAM_OPT_PROFILE option. This simplifies the scramble code and
71-
makes it easier to call from a library. This also fixes a number of
72-
bugs in the order of argument parsing.
73-
74-
* Improved CRAM writing speeds.
75-
76-
The bam_copy function now only copies the number of used bytes
77-
rather than the number of allocated bytes, which can sometimes be
78-
substantially smaller. As this was done in the main thread it may
79-
have a significant benefit when multi-threading.
80-
81-
* Added libdeflate support into CRAM too (in addition to the existing
82-
support in BAM). This isn't a huge change to CRAM speeds except at
83-
high levels (-8 and -9) which are now slower, but also better
84-
compression ratio. A modest 2-3% speed gain is visible are low and
85-
mid levels, and at -1/-2 to -4 the compression ratio is also
86-
improved.
87-
88-
* CRAM 3.1 compression level -1 is now 25% faster, but 4% larger.
89-
This is achieved by difference choice of compression codecs, most
90-
notably disabling the name tokeniser for level 1. Use level 2 for
91-
something comparable to the old behaviour.
92-
93-
* Added an io_lib/version.h to make it easier to detect the version
94-
being compiled against using IOLIB_VERSION macros.
95-
Requested by German Tischler in issue #25.
96-
97-
* Refactored the cram encoding interface used by biobambam.
98-
Implemented by German Tischler in PR#27.
99-
100-
* CRAM 4 now uses E_CONST instead of a uni-value version of
101-
E_HUFFMAN. Also added offset field to VARINT_SIGNED and
102-
VARINT_UNSIGNED which helps for data series that have values from -1
103-
to MAXINT.
104-
105-
* CRAM 4 container structure has changed so that all values are
106-
variable sized integers instead of fixed size.
107-
108-
* Further improvements with CRAM 4's use of signed values.
109-
- Ref_seq_id is container and slice headers are now signed.
110-
- RI (ref ID) data series and NS (mate ref ID) are also now signed
111-
as -1 is a valid value.
112-
- Embedded ref id is now 0 for unusued instead of -1.
113-
114-
* Reversed the use of CRAM 4 delta encoding for the B array. It only
115-
helps at the moment for ONT signal data, so it needs more work to
116-
make it auto-detect when delta makes sense. (Enabling it globally
117-
for CRAM4 B aux tags was accidental.)
118-
119-
* Htscodecs submodule has gained support for big-endian platforms
120-
Other big-endian improvements to parts of CRAM4 too.
121-
122-
Bug fixes:
123-
124-
* Fixed CRAM MD tag generatin when using the "b" feature code
125-
(NB: unused by known CRAM encoders).
126-
Also see https://github.com/samtools/htslib/pull/1086 for more details.
127-
128-
* Fixed CRAM quality string when using "q" feature code (unused by
129-
encoders?) and in lossy-quality mode (maybe utilised in old
130-
Cramtools).
131-
Also see https://github.com/samtools/htslib/pull/1094 for more details.
132-
133-
* Fixed some minor memory leaks.
134-
135-
* "Scramble -X archive -1" enabled lzma, which should only have
136-
arrived at level 7 and above. (It compared integer 7 vs ASCII '1'.)
137-
138-
* Removed minor compilation warning in printf debugging.
139-
140-
* Fixed a 7 year old bug in scram_pileup which couldn't cope with
141-
soft-clips being followed by hard-clips.
142-
143-
144-
Technology Demo: CRAM 3.1 and 4.0
145-
=================================
146-
147-
The current official GA4GH CRAM version is 3.0.
148-
149-
For purposes of *EVALUATION ONLY* this release of io_lib includes CRAM
150-
version 3.1, with new compression codecs (but is otherwise identical
151-
file layout to 3.0), and 4.0 with a few additional format
52+
For purposes of *EVALUATION ONLY* this release of io_lib also includes
53+
an experimental CRAM version 4.0. The format very likely to change
54+
and should not be used for production data. CRAM 4.0 includes format
15255
modifications, such as 64-bit sizes, deduplication of read names,
15356
orientation changes of quality strings and a revised variable sized
154-
integer encoding.
57+
integer encoding. It can be enabled using scramble -V4.0
15558

156-
They can be turned on using e.g. scramble -V3.1 or scramble -V4.0.
157-
It is likely CRAM v4.0 will be official significantly later, but we
158-
plan on v3.1 being a recognised GA4GH standard this year.
159-
160-
By default enabling either of these will also enable the new codecs.
59+
Enabling CRAM 3.1 or 4.0 will also enable the new codecs.
16160
Which codecs are used also depends on the profile specified (eg via
16261
"-X small"). Some of the new codecs are considerably slower,
16362
especially at decompression, but by default CRAM 3.1 aims to be
@@ -167,79 +66,37 @@ small and archive respectively).
16766

16867
Here are some example file sizes and timings with different codecs and
16968
levels on 10 million 150bp NovaSeq reads, single threaded. Decode
170-
timing is checked using "scram_flagstat -b". Tests were performed
171-
on an Intel i5-4570 processor at 3.2GHz.
69+
timing is checked using "scram_flagstat -b".
70+
71+
Table produced with Io_lib 1.15.0 on a laptop with Intel i7-1185G7
72+
CPU running Ubuntu 20.04 under Microsoft's WSL2.
17273

17374
|Scramble opts. |Size(MB) |Enc(s)|Dec(s)|Codecs used |
17475
|--------------------|--------:|-----:|-----:|---------------------------|
175-
|-O bam | 531.9| 92.3| 7.5|bgzf(zlib) |
176-
|-O bam -1 | 611.4| 26.4| 5.4|bgzf(libdeflate) |
177-
|-O bam (default) | 539.5| 45.0| 4.9|bgzf(libdeflate) |
178-
|-O bam -9 | 499.5| 920.2| 4.9|bgzf(libdeflate) |
179-
||||||
180-
|-V2.0 -X fast | 317.7| 38.8| 11.8|(default, level 1) |
181-
|-V2.0 (default) | 267.6| 47.0| 10.5|(default) |
182-
|-V2.0 -X small | 218.0| 124.6| 33.1|bzip2 |
183-
||||||
184-
|-V3.0 -X fast | 264.9| 31.3| 10.8|(default, level 1) |
185-
|-V3.0 (default) | 223.7| 34.7| 10.3|(default) |
186-
|-V3.0 -X small | 212.3| 88.3| 18.2|bzip2 |
187-
|-V3.0 -X archive | 209.4| 98.7| 18.2|bzip2 |
188-
||||||
189-
|-V3.1 -X fast | 262.4| 29.1| 9.3|rANS++ |
190-
|-V3.1 (default) | 186.4| 33.7| 8.3|rANS++,tok3 |
191-
|-V3.1 -X small | 176.8| 74.0| 35.2|rANS++,tok3,fqz |
192-
|-V3.1 -X archive | 171.9| 127.9| 34.9|rANS++,tok3,fqz,bzip2,arith|
193-
||||||
194-
|-V4.0 -X fast | 251.2| 28.9| 9.6|rANS++ |
195-
|-V4.0 (default) | 182.1| 32.9| 8.2|rANS++,tok3 |
196-
|-V4.0 -X small | 170.9| 70.9| 35.0|rANS++,tok3,fqz |
197-
|-V4.0 -X archive | 166.9| 116.4| 34.2|rANS++,tok3,fqz,bzip2,arith|
198-
199-
We also tested on a small human aligned HiSeq run (ERR317482)
200-
representing older Illumina data with pre-binning era quality values.
201-
This dataset shows less impressive gains with 4.0 over 3.0 in the
202-
default profile, but major gains in small profile once fqzcomp quality
203-
encoding is enabled.
204-
205-
Note for this file, the file sizes are larger meaning less disk
206-
caching is possible (the test machine wasn't a memory stressed
207-
desktop). Threading was also enabled, albeit with just 4 threads,
208-
which further exacerbates I/O bottlenecks. The previous test
209-
demonstrated BAM being faster to read than CRAM, but with large files
210-
in a more I/O stressed situation this test demonstrates the default
211-
profile of CRAM is faster to read than BAM, due to the smaller I/O
212-
footprint.
213-
214-
NB: the table below was produced with 1.14.12.
215-
216-
|Scramble opts. |Size(MB) |Enc(s)|Dec(s)|Codecs used |
217-
|-------------------- |--------:|-----:|-----:|--------------------------------|
218-
|-t4 -O bam (default) | 6526 | 115.4| 44.7|bgzf(libdeflate) |
76+
|-O bam (default) | 518.2| 65.8| 5.7|bgzf(zlib) |
77+
|-O bam -1 | 584.5| 17.4| 3.5|bgzf(libdeflate) |
78+
|-O bam (default) | 524.6| 27.8| 2.9|bgzf(libdeflate) |
79+
|-O bam -9 | 486.5| 810.4| 3.0|bgzf(libdeflate) |
21980
||||||
220-
|-t4 -V2.0 -X fast | 3674 | 87.4| 31.4|(default, level 1) |
221-
|-t4 -V2.0 (default) | 3435 | 91.4| 30.7|(default) |
222-
|-t4 -V2.0 -X small | 3373 | 145.5| 47.8|bzip2 |
223-
|-t4 -V2.0 -X archive | 3377 | 166.3| 49.7|bzip2 |
224-
|-t4 -V2.0 -X archive -9| 3125 |1900.6| 76.9|bzip2 |
81+
|-V2.0 -X fast | 294.5| 23.1| 7.8|(default, level 1) |
82+
|-V2.0 (default) | 252.3| 32.9| 8.0|(default) |
83+
|-V2.0 -X small | 208.0| 85.2| 23.5|bzip2 |
84+
|-V2.0 -X archive | 206.0| 88.1| 24.3|bzip2 |
22585
||||||
226-
|-t4 -V3.0 -X fast | 3620 | 88.3| 29.3|(default, level 1) |
227-
|-t4 -V3.0 (default) | 3287 | 90.5| 29.5|(default) |
228-
|-t4 -V3.0 -X small | 3238 | 128.5| 40.3|bzip2 |
229-
|-t4 -V3.0 -X archive | 3220 | 164.9| 50.0|bzip2 |
230-
|-t4 -V3.0 -X archive -9| 3115 |1866.6| 75.2|bzip2, lzma |
86+
|-V3.0 -X fast | 241.1| 19.7| 8.5|(default, level 1) |
87+
|-V3.0 (default) | 208.5| 23.0| 8.8|(default) |
88+
|-V3.0 -X small | 201.7| 60.0| 14.5|bzip2 |
89+
|-V3.0 -X archive | 199.9| 61.7| 13.6|bzip2 |
23190
||||||
232-
|-t4 -V3.1 -X fast | 3611 | 87.9| 29.2|rANS++ |
233-
|-t4 -V3.1 (default) | 3161 | 88.8| 29.7|rANS++,tok3 |
234-
|-t4 -V3.1 -X small | 2249 | 192.2| 146.1|rANS++,tok3,fqz |
235-
|-t4 -V3.1 -X archive | 2157 | 235.2| 127.5|rANS++,tok3,fqz,bzip2,arith |
236-
|-t4 -V3.1 -X archive | 2145 | 480.3| 128.9|rANS++,tok3,fqz,bzip2,arith,lzma|
91+
|-V3.1 -X fast | 237.1| 22.1| 7.9|rANS++ |
92+
|-V3.1 (default) | 175.8| 26.7| 8.9|rANS++,tok3 |
93+
|-V3.1 -X small | 166.9| 47.9| 24.6|rANS++,tok3,fqz |
94+
|-V3.1 -X archive | 162.2| 72.5| 20.5|rANS++,tok3,fqz,bzip2,arith|
23795
||||||
238-
|-t4 -V4.0 -X fast | 3551 | 87.8| 29.5|rANS++ |
239-
|-t4 -V4.0 (default) | 3148 | 88.9| 30.0|rANS++,tok3 |
240-
|-t4 -V4.0 -X small | 2236 | 189.7| 142.6|rANS++,tok3,fqz |
241-
|-t4 -V4.0 -X archive | 2139 | 226.7| 127.5|rANS++,tok3,fqz,bzip2,arith |
242-
|-t4 -V4.0 -X archive -9| 2132 | 453.5| 128.2|rANS++,tok3,fqz,bzip2,arith,lzma|
96+
|-V4.0 -X fast | 227.5| 16.6| 6.2|rANS++ |
97+
|-V4.0 (default) | 172.8| 19.7| 6.3|rANS++,tok3 |
98+
|-V4.0 -X small | 162.3| 34.8| 20.2|rANS++,tok3,fqz |
99+
|-V4.0 -X archive | 157.9| 82.2| 26.2|rANS++,tok3,fqz,bzip2,arith|
243100

244101

245102
Building

configure.ac

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
dnl Process this file with autoconf to produce a configure script.
2-
AC_INIT(io_lib, 1.14.15)
2+
AC_INIT(io_lib, 1.15.0)
33
IOLIB_VERSION=$PACKAGE_VERSION
44
IOLIB_VERSION_MAJOR=`expr "$PACKAGE_VERSION" : '\([[0-9]]*\)'`
55
IOLIB_VERSION_MINOR=`expr "$PACKAGE_VERSION" : '[[0-9]]*\.\([[0-9]]*\)'`
@@ -69,7 +69,7 @@ AX_SUBDIRS_CONFIGURE([htscodecs],[[--disable-shared],[--with-pic]])
6969
# libstaden-read.so.1.1.0
7070

7171
VERS_CURRENT=15
72-
VERS_REVISION=2
72+
VERS_REVISION=3
7373
VERS_AGE=1
7474
AC_SUBST(VERS_CURRENT)
7575
AC_SUBST(VERS_REVISION)

progs/scramble.c

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -184,7 +184,7 @@ static int filter_tags(bam_seq_t *s, char *aux_filter, int keep) {
184184

185185
static void usage(FILE *fp) {
186186
fprintf(fp, " -=- sCRAMble -=- version %s\n", IOLIB_VERSION);
187-
fprintf(fp, "Author: James Bonfield, Wellcome Trust Sanger Institute. 2013-2022\n\n");
187+
fprintf(fp, "Author: James Bonfield, Wellcome Trust Sanger Institute. 2013-2023\n\n");
188188

189189
fprintf(fp, "Usage: scramble [options] [input_file [output_file]]\n");
190190

@@ -504,10 +504,6 @@ int main(int argc, char **argv) {
504504
fprintf(stderr, "\nWARNING: this version of CRAM is not a recognised GA4GH standard.\n"
505505
"Note this CRAM version is a technology demonstration only.\n"
506506
"Future versions of Scramble may not be able to read these files.\n\n");
507-
} else if (cram_default_version() > 300) {
508-
fprintf(stderr, "\nWARNING: this version of CRAM has yet to be formally signed off.\n"
509-
"CRAM 3.1 has multiple implementations that have been cross-validated, but\n"
510-
"the specification document has not yet been accepted as an official standard.\n\n");
511507
}
512508

513509
if (argc - optind > 2) {

0 commit comments

Comments
 (0)