trace2fio.py is a utility that takes an eBPF or syscall-level trace of file
I/O activity and reconstructs an equivalent workload description suitable for
use with fio. The tool's purpose is to allow
developers, storage engineers, and kernel researchers to replay realistic I/O
patterns captured from real systems in a deterministic and reproducible way.
Rather than relying on statistical inference or machine learning, trace2fio
uses a deterministic state machine that models file descriptor lifecycles.
This makes it easy to reason about and verify, while remaining flexible enough
to describe complex I/O behaviors.
Modern eBPF frameworks can trace fine-grained I/O events such as open,
read, write, and fsync across all processes. However, converting those
traces into a form usable for benchmarking tools like fio has traditionally
required manual analysis or ad-hoc scripts.
trace2fio aims to bridge this gap by:
- Mapping syscall traces to high-level fio jobs automatically.
- Preserving key parameters such as block size, read/write mix, sequential/random access, and direct I/O flags.
- Allowing controlled replay of real-world workloads without intrusive tracing.
- Providing a deterministic, auditable workflow without machine learning heuristics.
The tool parses a syscall trace in CSV or JSONL format with at least the following fields:
ts,pid,comm,syscall,fd,bytes,offset,flags,path,ret
It then applies a state machine model:
- File descriptor tracking — each
(pid, fd)pair is tracked throughopen → read/write → fsync → close. - Operation aggregation — offsets, sizes, and timestamps are aggregated per file.
- Pattern inference — block size, sequential vs random pattern, and I/O type (read/write/mixed) are derived.
- fio job generation — each file becomes a
[job]section in the output.fiofile.
The resulting fio configuration can be used to replay the original workload on any system, enabling accurate reproduction and performance analysis.
$ python3 trace2fio.py trace.csv -o workload.fio
$ fio workload.fio
For combined replay of all files as a single job:
$ python3 trace2fio.py trace.csv -o workload.fio --merge
ts,pid,comm,syscall,fd,bytes,offset,flags,path,ret
0.001,1234,app,open,3,,O_RDWR|O_DIRECT,/data/file1,3
0.002,1234,app,write,3,4096,0,,/data/file1,4096
0.004,1234,app,write,3,4096,4096,,/data/file1,4096
0.006,1234,app,fsync,3,,,,/data/file1,0
0.007,1234,app,close,3,,,,/data/file1,0
Produces:
[file1]
filename=file1.dat
rw=write
bs=4096
size=8192
direct=1
- Determinism over heuristics: All inference is rule-based and explainable.
- Modular expansion: Future versions may add io_uring, iodepth, or multithread detection.
- Trace neutrality: Works with any tracing backend (bpftrace, perf, strace, LTTng) as long as the CSV/JSON schema matches.
- Transparency: Every generated fio parameter can be traced back to source trace data.
- Add support for async I/O (
io_uringevents) to inferiodepthandioengine. - Detect temporal phases in traces (warmup, steady state, cooldown).
- Integrate with
bpftracescripts to capture required fields automatically. - Visualize inferred workloads and I/O timelines.
MIT License — see LICENSE for details.
Luis Chamberlain — (linux-kdevops project)