Skip to content

Conversation

@Trecek
Copy link

@Trecek Trecek commented May 5, 2025

Stabilizing fastxpp Benchmarks

I had AI summarize my messy notes and hyperfine results. Everything seems to be correct.

This is follow up to #14 where there were some inconsistent results.

TL,DR

By holding benchmarking scaffolding static with @no_inline and selectively forcing inlining on the hottest helpers, we:

  • reduce the overall runtime of the read_once path from 1.38–1.60 s to 0.93–1.05 s on the 10× uniprot_sprot fixture,
  • erase gap between swar and read_once, and
  • deliver around 1.2× speed‑up over the current orig implementation in the apples‑to‑apples, separate‑executable benchmark, all without algorithmic changes

Motivation

The existing benchmark numbers have been noisy, likely because the compiler optimizes the benchmark harness together with the implementation under test. This obscures the real cost of each I O strategy. We want numbers that:

  1. isolate the implementation, not the harness,
  2. expose the true cost of helper functions like strip_newline, read_byte, and read_until, and
  3. guide us toward the next bottleneck.

Header field definition (for now)

image

image

How do we calculate the last line (if we wanted to)?

image

Different read methods for fastxpp

The methods are named terribly sorry, ill change latter.

There are 4 key steps
1: Identify record start ('>')
2: Read header
3: SWAR decode header info field
4: Read sequence bytes

Besides the original (naive) read method, the main difference between the three is how we read the sequence bytes (and quality scores if this was fastq). Especially how we remove new lines in sequence blocks of fasta.

orig

  • Original read method that uses no header comment info

strip_newline

  • uses header info, does not use bpl, only uses slen and lcnt
  • SWAR decode
  • In place compaction to remove newlines

swar

  • Uses header info, lcnt, bpl, slen
  • SWAR Decode
  • Use bytes per line to remove newlines, using memcpy with jump over newlines. Passes over bytes twice/

read_once

  • Uses header info, but does not need lcnt. Only uses slen and bpl.
  • SWAR decode
  • Using bpl to read bytes up to newlines, consume newline with readbyte.
    Only passes over bytes once

Design of the Experiment

Input: 2.6G uncompressed fasta file

Variant Bench fn Read fn strip_newline read_byte read_until
A compiler chooses compiler chooses @always_inline compiler chooses compiler chooses
B @no_inline @no_inline @always_inline compiler chooses compiler chooses
C @no_inline @no_inline @no_inline @no_inline compiler chooses
D @no_inline @no_inline @always_inline @always_inline @always_inline
E @no_inline compiler chooses @always_inline @always_inline @always_inline
F compiled separately (@no_inline) compiler chooses @always_inline @always_inline @always_inline

All builds used the same mojo build fastxpp_bench.mojo invocation and were measured with Hyperfine --warmup 3 -r 10 on an otherwise idle machine.

Results Snapshot

Configuration (best from each group) orig (s) strip_newline (s) swar (s) read_once (s) Fastest Speedup vs orig Δ vs fastest (×)
Default inlining 1.38 1.30 0.99 1.16 swar 1.39× 1.06×
Bench+read @no_inline 1.46 1.29 1.10 1.29 swar 1.33× 1.18×
Bench+read @no_inline, helpers @no_inline 1.59 1.29 1.08 1.27 swar 1.36× 1.16×
Bench+read @no_inline, helpers @always_inline 1.49 1.28 1.06 1.05 read_once 1.42× 1.13×
Bench @no_inline, helpers @always_inline 1.13 1.26 0.95 0.93 read_once 1.21× 1.00×
Separate executables 1.14 1.17 0.95 0.93 read_once 1.23× 1.00×

Observation: Inlining read_byte eliminates most of the delta between swar and read_once. Adding @always_inline to read_until lets read_once nose ahead.

Ordering Sensitivity

The last entry in the bench list is most sensitive to @no_inline. In-lining the read bytes functions eliminates most of the difference besides compiling separately.

Summary

  1. Land @no_inline on all bench functions to freeze harness behavior.
  2. Force inlining on read_byte and read_until because they are hot in both swar and read_once paths.
  3. Let compiler handling lining for read helpers.
  4. Moving forward, read functions should be benchmarked in separate executables.
  5. Read once and swar are fastest. Read once does not need the lcnt, reducing the header comment field by 7 digits. However swar can be modified to get lcnt with something like:
    var lcnt = (slen + (bpl - 2)) // (bpl - 1)

Next steps

  • Try using strip newline on entire buffer rather than per record
  • Remove lcnt. Test swar read function without lcnt field
  • Give better names to read functions
  • Parallel over bgzf blocks, use header info to know how to handle truncated record.
    • We can make decompression and record processing embarsingly parallel by sending bgzf blocks to threads with a overlapping block or two between each thread. Then use the header info to know when we have truncated record at the end of a block next to an overlapping block, then we know when we need to go into a overlapping block for the bytes we need to complete record. This way no thread has to share information with other threads and only works with the data local to it.

# track length before and after
var before = len(self.seq)
var _want = want
var _total = self.reader.read_bytes(self.seq, _want)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if here we did var _total = self.reader.read_bytes(self.seq, _want, keep=True)?

You'd have to adjust the byte math to subtract one. But it avoids an extra read call, which might be nice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not read_until! So just _want + 1?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants