|
| 1 | +--- |
| 2 | +title: Rust BufWriter and LZ4 Compression |
| 3 | +categories: |
| 4 | +- rust |
| 5 | +discussions: |
| 6 | +--- |
| 7 | +Recently, I've been working on a Rust project again. |
| 8 | +It deals with bioinformatics data, which can be quite large, |
| 9 | +so I got to play with profiling and optimizing the code. |
| 10 | +I've done some of this in past, but this time it was _actually_ useful. |
| 11 | +In this post, I want to talk about a small optimization |
| 12 | +in working with LZ4 compression |
| 13 | +that made a big difference in runtime performance. |
| 14 | + |
| 15 | +This tool mainly reads in a BAM file |
| 16 | +(which contains aligned genome sequence data), |
| 17 | +does some processing on it, |
| 18 | +and outputs the results in various formats, |
| 19 | +chosen by the user. |
| 20 | +One of the formats is the internal data structure used by the tool, |
| 21 | +which is convient for debugging and testing. |
| 22 | +Since this is Rust, all I had to do was add some `#[derive(Serialize, Deserialize)]` annotations, |
| 23 | +choose a good format (I picked [MessagePack](https://msgpack.org/)), |
| 24 | +and thanks to [serde](https://serde.rs/), |
| 25 | +we have a data format. |
| 26 | +Concretely, I made an enum with all the possible structures I want to output, |
| 27 | +(which includes header fields) |
| 28 | +and serialize and write each structure separately, |
| 29 | +so that they are concatenated in the output file. |
| 30 | +To read it back in, |
| 31 | +I wrote a little helper function[^rmp-stream] that |
| 32 | +keeps deserializing these enum values until it reaches the end of the file. |
| 33 | +So far, so good. |
| 34 | + |
| 35 | +## Compression with LZ4 |
| 36 | + |
| 37 | +However, the output file was quite large |
| 38 | +-- it's pretty much everything I have in RAM. |
| 39 | +I wanted to compress it, |
| 40 | +but I also knew that compression is expensive, |
| 41 | +and for my debug output I don't really need to squeeze every byte out of it. |
| 42 | +I chose [LZ4](https://lz4.github.io/lz4/), via the `lz4` crate. |
| 43 | +Its [`Encoder`](https://docs.rs/lz4/1.28.1/lz4/struct.Encoder.html) |
| 44 | +implements `Write`, |
| 45 | +so we can just wrap our writer in it and continue to use it as before: |
| 46 | + |
| 47 | +```rust |
| 48 | +let file = std::fs::File::create("output.msgpack.lz4")?; |
| 49 | +let encoder = lz4::EncoderBuilder::new().level(4).build(file)?; |
| 50 | +``` |
| 51 | + |
| 52 | +Pretty early in my Rust journey, |
| 53 | +I learned that file I/O is not buffered by default, |
| 54 | +so it's a good idea to wrap the `file` in a `BufWriter`: |
| 55 | + |
| 56 | +```rust |
| 57 | +let file = std::fs::File::create("output.msgpack.lz4")?; |
| 58 | +let file_buffered = std::io::BufWriter::new(file); |
| 59 | +let encoder = lz4::EncoderBuilder::new().level(4).build(file_buffered)?; |
| 60 | +``` |
| 61 | + |
| 62 | +This then creates a chain like this: |
| 63 | + |
| 64 | +``` |
| 65 | +MessagePack Serializer -> LZ4 Encoder -> BufWriter -> File |
| 66 | +``` |
| 67 | + |
| 68 | +## Profiling |
| 69 | + |
| 70 | +When profiling the code (with [samply](https://github.com/mstange/samply/)), |
| 71 | +I noticed that the overhead from LZ4 was quite high. |
| 72 | +Even after lowering the compression level to 0, |
| 73 | +I wasn't happy. |
| 74 | +This was slower than the BGZIP compression I use for BCF files! |
| 75 | +And that is based on Deflate, which, while optimized heavily, |
| 76 | +is not an algorithm that should play in the same league as LZ4. |
| 77 | +What is going on here? |
| 78 | + |
| 79 | +I saw that there were **many** stacks with calls to `LZ4F_compressUpdateImpl`. |
| 80 | +Looking at [the implementation](https://github.com/lz4/lz4/blob/v1.10.0/lib/lz4frame.c#L977) |
| 81 | +with the samples per line, |
| 82 | +I see a lot of calls to `LZ4F_selectCompression`, `LZ4F_compressBound_internal`, |
| 83 | +`memcpy` (if the temporary block buffer has space and LZ4 wants to buffer), |
| 84 | +`LZ4F_makeBlock`, which writes the block header and checksum, |
| 85 | +and finally `XXH32_update`, which computes the checksum for the block. |
| 86 | +Why is this being called so much and why are there so many blocks being made? |
| 87 | + |
| 88 | +LZ4 is a block-based compression algorithm, |
| 89 | +which means that it compresses data in chunks. |
| 90 | +The chunks we are giving it are the serialized MessagePack data, |
| 91 | +which is around 250 bytes each. |
| 92 | +This means that for every 250 byte chunk, |
| 93 | +we're calling calling into LZ4 and ask it to compress it. |
| 94 | +And for every 250 byte chunk, |
| 95 | +it does the entire round checks and compression, and checksumming. |
| 96 | + |
| 97 | +## Swap the buffer |
| 98 | + |
| 99 | +Knowing that LZ4 works with blocks internally, |
| 100 | +I had the idea that I could swap the way I use the buffer: |
| 101 | +Instead of buffering writing to the file system, |
| 102 | +I could buffer writing to the LZ4 encoder. |
| 103 | + |
| 104 | +```rust |
| 105 | +let file = std::fs::File::create("output.msgpack.lz4")?; |
| 106 | +let encoder = lz4::EncoderBuilder::new().level(4).build(file)?; |
| 107 | +let encoder_buffered = std::io::BufWriter::new(encoder); |
| 108 | +``` |
| 109 | + |
| 110 | +And indeed, this works! |
| 111 | +In my initial benchmark, this made this part of the code 1.83 times faster. |
| 112 | +An amazing result for basically just swapping two lines of code. |
| 113 | + |
| 114 | +[^rmp-stream]: `serde_json` includes a [`StreamDeserializer`](https://docs.rs/serde_json/1.0.140/serde_json/struct.StreamDeserializer.html) but `rmp_serde` does not, so I wrote one myself. It's not as feature-complete (I think), but you can find it [here](https://github.com/3Hren/msgpack-rust/issues/317#issuecomment-3012814957). |
0 commit comments