Skip to content

linux-kdevops/trace2fio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

trace2fio — Deterministic Reconstruction of I/O Workloads

Overview

trace2fio.py is a utility that takes an eBPF or syscall-level trace of file I/O activity and reconstructs an equivalent workload description suitable for use with fio. The tool's purpose is to allow developers, storage engineers, and kernel researchers to replay realistic I/O patterns captured from real systems in a deterministic and reproducible way.

Rather than relying on statistical inference or machine learning, trace2fio uses a deterministic state machine that models file descriptor lifecycles. This makes it easy to reason about and verify, while remaining flexible enough to describe complex I/O behaviors.

Motivation

Modern eBPF frameworks can trace fine-grained I/O events such as open, read, write, and fsync across all processes. However, converting those traces into a form usable for benchmarking tools like fio has traditionally required manual analysis or ad-hoc scripts.

trace2fio aims to bridge this gap by:

  • Mapping syscall traces to high-level fio jobs automatically.
  • Preserving key parameters such as block size, read/write mix, sequential/random access, and direct I/O flags.
  • Allowing controlled replay of real-world workloads without intrusive tracing.
  • Providing a deterministic, auditable workflow without machine learning heuristics.

How It Works

The tool parses a syscall trace in CSV or JSONL format with at least the following fields:

ts,pid,comm,syscall,fd,bytes,offset,flags,path,ret

It then applies a state machine model:

  1. File descriptor tracking — each (pid, fd) pair is tracked through open → read/write → fsync → close.
  2. Operation aggregation — offsets, sizes, and timestamps are aggregated per file.
  3. Pattern inference — block size, sequential vs random pattern, and I/O type (read/write/mixed) are derived.
  4. fio job generation — each file becomes a [job] section in the output .fio file.

The resulting fio configuration can be used to replay the original workload on any system, enabling accurate reproduction and performance analysis.

Example

$ python3 trace2fio.py trace.csv -o workload.fio
$ fio workload.fio

For combined replay of all files as a single job:

$ python3 trace2fio.py trace.csv -o workload.fio --merge

Example Input

ts,pid,comm,syscall,fd,bytes,offset,flags,path,ret
0.001,1234,app,open,3,,O_RDWR|O_DIRECT,/data/file1,3
0.002,1234,app,write,3,4096,0,,/data/file1,4096
0.004,1234,app,write,3,4096,4096,,/data/file1,4096
0.006,1234,app,fsync,3,,,,/data/file1,0
0.007,1234,app,close,3,,,,/data/file1,0

Produces:

[file1]
filename=file1.dat
rw=write
bs=4096
size=8192
direct=1

Design Principles

  • Determinism over heuristics: All inference is rule-based and explainable.
  • Modular expansion: Future versions may add io_uring, iodepth, or multithread detection.
  • Trace neutrality: Works with any tracing backend (bpftrace, perf, strace, LTTng) as long as the CSV/JSON schema matches.
  • Transparency: Every generated fio parameter can be traced back to source trace data.

Roadmap

  • Add support for async I/O (io_uring events) to infer iodepth and ioengine.
  • Detect temporal phases in traces (warmup, steady state, cooldown).
  • Integrate with bpftrace scripts to capture required fields automatically.
  • Visualize inferred workloads and I/O timelines.

License

MIT License — see LICENSE for details.

Author

Luis Chamberlain — (linux-kdevops project)

About

Generalize workloads to fio

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published