Fastxpp benchmark experiments #21

Trecek · 2025-05-05T19:32:27Z

Stabilizing fastxpp Benchmarks

I had AI summarize my messy notes and hyperfine results. Everything seems to be correct.

This is follow up to #14 where there were some inconsistent results.

TL,DR

By holding benchmarking scaffolding static with @no_inline and selectively forcing inlining on the hottest helpers, we:

reduce the overall runtime of the read_once path from 1.38–1.60 s to 0.93–1.05 s on the 10× uniprot_sprot fixture,
erase gap between swar and read_once, and
deliver around 1.2× speed‑up over the current orig implementation in the apples‑to‑apples, separate‑executable benchmark, all without algorithmic changes

Motivation

The existing benchmark numbers have been noisy, likely because the compiler optimizes the benchmark harness together with the implementation under test. This obscures the real cost of each I O strategy. We want numbers that:

isolate the implementation, not the harness,
expose the true cost of helper functions like strip_newline, read_byte, and read_until, and
guide us toward the next bottleneck.

Header field definition (for now)

How do we calculate the last line (if we wanted to)?

Different read methods for fastxpp

The methods are named terribly sorry, ill change latter.

There are 4 key steps
1: Identify record start ('>')
2: Read header
3: SWAR decode header info field
4: Read sequence bytes

Besides the original (naive) read method, the main difference between the three is how we read the sequence bytes (and quality scores if this was fastq). Especially how we remove new lines in sequence blocks of fasta.

orig

Original read method that uses no header comment info

strip_newline

uses header info, does not use bpl, only uses slen and lcnt
SWAR decode
In place compaction to remove newlines

swar

Uses header info, lcnt, bpl, slen
SWAR Decode
Use bytes per line to remove newlines, using memcpy with jump over newlines. Passes over bytes twice/

read_once

Uses header info, but does not need lcnt. Only uses slen and bpl.
SWAR decode
Using bpl to read bytes up to newlines, consume newline with readbyte.
Only passes over bytes once

Design of the Experiment

Input: 2.6G uncompressed fasta file

Variant	Bench fn	Read fn	`strip_newline`	`read_byte`	`read_until`
A	compiler chooses	compiler chooses	`@always_inline`	compiler chooses	compiler chooses
B	`@no_inline`	`@no_inline`	`@always_inline`	compiler chooses	compiler chooses
C	`@no_inline`	`@no_inline`	`@no_inline`	`@no_inline`	compiler chooses
D	`@no_inline`	`@no_inline`	`@always_inline`	`@always_inline`	`@always_inline`
E	`@no_inline`	compiler chooses	`@always_inline`	`@always_inline`	`@always_inline`
F	compiled separately (`@no_inline`)	compiler chooses	`@always_inline`	`@always_inline`	`@always_inline`

All builds used the same mojo build fastxpp_bench.mojo invocation and were measured with Hyperfine --warmup 3 -r 10 on an otherwise idle machine.

Results Snapshot

Configuration (best from each group)	`orig` (s)	`strip_newline` (s)	`swar` (s)	`read_once` (s)	Fastest	Speedup vs `orig`	Δ vs fastest (×)
Default inlining	1.38	1.30	0.99	1.16	swar	1.39×	1.06×
Bench+read `@no_inline`	1.46	1.29	1.10	1.29	swar	1.33×	1.18×
Bench+read `@no_inline`, helpers `@no_inline`	1.59	1.29	1.08	1.27	swar	1.36×	1.16×
Bench+read `@no_inline`, helpers `@always_inline`	1.49	1.28	1.06	1.05	read_once	1.42×	1.13×
Bench `@no_inline`, helpers `@always_inline`	1.13	1.26	0.95	0.93	read_once	1.21×	1.00×
Separate executables	1.14	1.17	0.95	0.93	read_once	1.23×	1.00×

Observation: Inlining read_byte eliminates most of the delta between swar and read_once. Adding @always_inline to read_until lets read_once nose ahead.

Ordering Sensitivity

The last entry in the bench list is most sensitive to @no_inline. In-lining the read bytes functions eliminates most of the difference besides compiling separately.

Summary

Land @no_inline on all bench functions to freeze harness behavior.
Force inlining on read_byte and read_until because they are hot in both swar and read_once paths.
Let compiler handling lining for read helpers.
Moving forward, read functions should be benchmarked in separate executables.
Read once and swar are fastest. Read once does not need the lcnt, reducing the header comment field by 7 digits. However swar can be modified to get lcnt with something like:
var lcnt = (slen + (bpl - 2)) // (bpl - 1)

Next steps

Try using strip newline on entire buffer rather than per record
Remove lcnt. Test swar read function without lcnt field
Give better names to read functions
Parallel over bgzf blocks, use header info to know how to handle truncated record.
- We can make decompression and record processing embarsingly parallel by sending bgzf blocks to threads with a overlapping block or two between each thread. Then use the header info to know when we have truncated record at the end of a block next to an overlapping block, then we know when we need to go into a overlapping block for the bytes we need to complete record. This way no thread has to share information with other threads and only works with the data local to it.

sstadick · 2025-05-09T17:14:39Z

ishlib/vendor/kseq.mojo

+            # track length before and after
+            var before = len(self.seq)
+            var _want = want
+            var _total = self.reader.read_bytes(self.seq, _want)


What if here we did var _total = self.reader.read_bytes(self.seq, _want, keep=True)?

You'd have to adjust the byte math to subtract one. But it avoids an extra read call, which might be nice.

That's not read_until! So just _want + 1?

TalonTrecek and others added 18 commits April 21, 2025 16:45

Add support for generating fastxpp format

d299d67

Add support for reading fastxpp format

250ef26

Add file for benchmarking

0d210b4

Comments

f229a57

Add bpl

69830a3

Add bpl generation

f1a1df3

Add swar decode

e3c3c8b

remove hlen

a4614eb

move swar decode for eventual vendor

ae738bb

clean up

d20d1b0

Switch to gzip to open

f82709d

Add memcpy free method

9c4697c

add read once test

8d56122

remove toying around code

b3b11d5

cleanup

6462073

Clean up

b7a8cd4

Seperate bench functions

d097c41

Optimize inlining

8568017

sstadick reviewed May 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fastxpp benchmark experiments #21

Fastxpp benchmark experiments #21

Uh oh!

Trecek commented May 5, 2025 •

edited

Loading

Uh oh!

sstadick May 9, 2025

Uh oh!

sstadick May 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fastxpp benchmark experiments #21

Are you sure you want to change the base?

Fastxpp benchmark experiments #21

Uh oh!

Conversation

Trecek commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Stabilizing fastxpp Benchmarks

I had AI summarize my messy notes and hyperfine results. Everything seems to be correct.

TL,DR

Motivation

Header field definition (for now)

How do we calculate the last line (if we wanted to)?

Different read methods for fastxpp

orig

strip_newline

swar

read_once

Design of the Experiment

Results Snapshot

Ordering Sensitivity

Summary

Next steps

Uh oh!

sstadick May 9, 2025

Choose a reason for hiding this comment

Uh oh!

sstadick May 9, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Trecek commented May 5, 2025 •

edited

Loading