CSV Race

CSV Race is a benchmarking repository for comparing the performance characteristics of CSV parsers across different languages and implementations.

This project was originally created to benchmark and fine-tune my own CSV parser, and to better understand how different parsers behave under a variety of real-world and synthetic workloads. Over time, it evolved into a more general framework for evaluating CSV parsers in a consistent and transparent way.

The goal is twofold:

For parser authors: provide a reproducible environment to evaluate and improve performance.
For users: help identify parsers that best match their performance, memory, and workload requirements.

While this README highlights selected benchmark results, the repository also includes scripts and tooling that allow you to:

Add your own parsers
Generate custom datasets
Run benchmarks on your own machine
Produce your own charts and raw data

For details on running benchmarks yourself, see Running the benchmarks.

This repository was used extensively during the development of 👉 csv-zero and proved invaluable throughout that process.

Before diving into the charts, I strongly recommend reading the Benchmark Methodology section to understand what is — and is not — being measured.

Benchmark Methodology

Benchmarks are easy to misinterpret and easy to get wrong. While this repository aims to be careful and transparent, you should not make decisions solely based on the charts shown here.

CSV parsers vary widely in:

Feature sets
API design
Error handling
Memory strategies
Suitability for specific workloads

Even if raw performance is your primary concern, you should verify that:

The benchmark matches your usage pattern
The input data resembles your real data
The execution environment is comparable to yours

I have made a best-effort attempt to choose representative test cases and to use each library as intended. If you notice an issue, a mistake, or a missing library, contributions and corrections are very welcome.

Task Definition

The benchmark intentionally focuses on iteration speed, not downstream data processing.

Each parser is required to:

Iterate through the entire CSV file
Count the total number of fields
Use a 64 KB input buffer where configurable

Some libraries do not expose buffer size controls; in those cases, the default behavior is used.

This task isolates parsing overhead and minimizes the impact of allocation, data conversion, or user-level processing.

Test Files

Real-world datasets

After surveying commonly used CSV benchmark files, the following datasets were selected due to their diversity in size, structure, and quoting behavior:

game.csv
gtfs-mbta-stop-times.csv
nfl.csv
worldcitiespop.csv

These files include a mix of:

Quoted and unquoted fields
Escaped characters
Varying row and column counts

They are reasonably representative of real-world CSV data.

Generated datasets

To explore additional edge cases and scalability, synthetic datasets are generated using the following naming scheme:

<size>_<mix|no>_quotes_<column-count>_col_<min-field-size>_<max-field-size>.csv

Where:

size indicates approximate file size
col indicates column count
Field contents are random printable ASCII characters (32–127)
Field length is randomly chosen in [min-field-size, max-field-size]

If quoting is disabled:

", \, and , are excluded

If quoting is enabled:

These characters may appear, and fields are quoted correctly when required

Generated test files:

file	size	quote/escape mode	columns	field length
`xs_mix_quotes_12_col_0_32.csv`	~1 KB	Contains quoted fields	12	0-32
`xs_no_quotes_52_col_0_256.csv`	~333 K	No quoted fields	52	0-256
`m_mix_quotes_12_col_0_32.csv`	~102 M	Contains quoted fields	12	0-32
`m_no_quotes_52_col_0_256.csv`	~32 M	No quoted fields	52	0-256
`xl_mix_quotes_2_col_0_12_many_rows.csv`	~700 M	Contains quoted fields	2 columns	0-12
`xl_no_quotes_52_col_0_256.csv`	~3.2 G	No quoted fields	52 columns	0-256
`xl_mix_quotes_12_col_0_32.csv`	~9.9 G	Contains quoted fields	12 columns	0-32

Collected Metrics

While wall-clock time is the most visible metric, several additional hardware-level metrics are captured to provide deeper insight:

Wall Time Total elapsed time to complete the task.
Peak RSS Maximum resident set size (memory usage) during execution.
CPU Instructions Number of retired machine instructions, independent of clock speed.
CPU Cycles Total cycles elapsed, including stalls and memory waits.
Cache References Number of accesses through the CPU cache hierarchy.
Cache Misses Cache misses across all cache levels (primarily LLC).
Branch Misses Modern CPUs rely heavily on branch prediction to keep their pipelines full. Control-flow constructs such as if, switch, loops, and conditional jumps usually compile down to branch instructions unless the compiler can fully eliminate them.

When the CPU encounters a branch, it predicts which path will be taken and begins executing instructions speculatively. If the prediction is correct, execution continues with little to no cost. If it is wrong, the CPU must flush part of the pipeline and restart execution, which incurs a noticeable performance penalty.

A branch miss (or branch misprediction) occurs when the CPU’s prediction does not match the actual control flow.

This matters for CSV parsing because:
- Parsers often contain tight loops with many conditionals
- Decisions depend on input data (e.g. quote handling, escape detection, delimiter checks)
- Irregular or data-dependent patterns reduce predictability
CSV files with:
- Mixed quoted and unquoted fields
- Escaped characters
- Varying row and column lengths
tend to produce less predictable branching behavior than uniform, unquoted data.

Tooling

Benchmarks are primarily executed using:

Poop A Linux-only benchmarking tool built on top of perf.

On macOS or Windows:

Hyperfine can be used, but it only provides wall-time metrics.

If you prefer, you may also use perf directly or substitute alternative tooling.

Data Visualization

A Python script orchestrates the benchmarks and generates:

A CSV file with all raw results
A set of charts for selected metrics

The scripts are easily adaptable if you want to:

Use different tools
Add new metrics
Customize visualizations

Benchmark Results

Benchmarks were run on: CPU: AMD Ryzen 5 PRO 5650U Memory: 30 GB OS: Linux 6.17.8-arch1-1

Results will vary significantly across machines and architectures. Contributions with data from other CPUs are very welcome.

To reduce visual noise:

Charts show only the top 5 parsers
Raw results for all parsers and test cases are available in result-all.csv
Charts focus primarily on the four common real-world datasets

Wall Time

Observations:

SIMD-accelerated parsers (simd-csv, zsv, csv-zero) generally dominate, but show reduced advantage on game.csv
zsc performs exceptionally well overall but regresses noticeably on game.csv
lazycpp is the most consistent performer across datasets
Surprisingly, for worldcitiespop.csv (no quoted fields), some parsers (csv (rust), lazycsv (cpp)) underperform

For large files:

Here, zsc (c), lazycsv (cpp), and simdcsv-rust remain consistently strong. csv-zero finishes first across all tested cases.

Peak RSS

Most top parsers exhibit stable memory usage regardless of file size. An exception is lazycsv (cpp), which can consume gigabytes of memory on large inputs.

Branch Misses

Branch mispredictions often correlate strongly with wall-time performance. Parsers with predictable control flow tend to perform better, especially on complex quoting patterns.

Cache Behavior

Cache References

Total number of memory access requests issued by the CPU that are serviced through the cache hierarchy while executing the parser process. This includes loads and stores that may be satisfied by any cache level (L1, L2, or Last Level Cache).

Cache references serve as a proxy for overall memory traffic generated by the parser. Higher values typically indicate more frequent memory accesses, such as per-byte processing, pointer chasing, temporary buffers, or field copying. SIMD-based and streaming parsers often reduce cache references by scanning data in wide vectors and minimizing dependent memory loads.

Cache Misses

Total number of cache access requests that could not be satisfied by the cache level accessed and therefore required fetching data from a lower cache level or main memory. In practice, this counter primarily reflects Last Level Cache (LLC) misses on modern CPUs.

Cache misses are significantly more expensive than cache hits, often incurring tens to hundreds of CPU cycles per miss depending on whether the data is retrieved from L2, L3, or DRAM. High cache miss counts generally indicate poor memory locality, working sets larger than the cache, or unpredictable access patterns. For large CSV files that exceed cache capacity, some level of cache misses is expected; performance differences are driven by how predictable and sequential those misses are.

Additional metrics are available in the images directory and raw CSV outputs.

Running the Benchmarks

Requirements

Poop (Linux) or Hyperfine
Python + matplotlib (for charts)
Zig 0.15.2 (for csv-zero and data generation)

Building Parsers

All parsers live under src/.

The only exception is:

src/zig/src/data_gen.zig — used for generating test data

To build everything:

make build_all

This requires:

C
C++
Rust
Go
Zig

You are free to remove parsers you don’t care about or add your own. zsc (c) requires a manual build but is straightforward if you follow its upstream instructions.

Selecting Parsers and Test Files

You can customize:

Which parsers participate
Which test files are used
Which metrics are collected

Data Visualization

generate_charts.py:

Runs benchmarks using poop
Produces output.csv
Writes figures to images/

macOS support is not yet available here.

Hyperfine / Manual Runs

If poop is unavailable, you can still run benchmarks manually:

make hyperfine TEST_FILE=/path/to/your/file.csv

Adjust the Makefile targets to control which parsers and tools are used.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CSV Race

Benchmark Methodology

Task Definition

Test Files

Real-world datasets

Generated datasets

Collected Metrics

Tooling

Data Visualization

Benchmark Results

Wall Time

Peak RSS

Branch Misses

Cache Behavior

Cache References

Cache Misses

Running the Benchmarks

Requirements

Building Parsers

Selecting Parsers and Test Files

Data Visualization

Hyperfine / Manual Runs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
images		images
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
generate_charts.py		generate_charts.py
output.csv		output.csv
result-all.csv		result-all.csv

Folders and files

Latest commit

History

Repository files navigation

CSV Race

Benchmark Methodology

Task Definition

Test Files

Real-world datasets

Generated datasets

Collected Metrics

Tooling

Data Visualization

Benchmark Results

Wall Time

Peak RSS

Branch Misses

Cache Behavior

Cache References

Cache Misses

Running the Benchmarks

Requirements

Building Parsers

Selecting Parsers and Test Files

Data Visualization

Hyperfine / Manual Runs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages