CSV Race is a benchmarking repository for comparing the performance characteristics of CSV parsers across different languages and implementations.
This project was originally created to benchmark and fine-tune my own CSV parser, and to better understand how different parsers behave under a variety of real-world and synthetic workloads. Over time, it evolved into a more general framework for evaluating CSV parsers in a consistent and transparent way.
The goal is twofold:
- For parser authors: provide a reproducible environment to evaluate and improve performance.
- For users: help identify parsers that best match their performance, memory, and workload requirements.
While this README highlights selected benchmark results, the repository also includes scripts and tooling that allow you to:
- Add your own parsers
- Generate custom datasets
- Run benchmarks on your own machine
- Produce your own charts and raw data
For details on running benchmarks yourself, see Running the benchmarks.
This repository was used extensively during the development of 👉 csv-zero and proved invaluable throughout that process.
Before diving into the charts, I strongly recommend reading the Benchmark Methodology section to understand what is — and is not — being measured.
Benchmarks are easy to misinterpret and easy to get wrong. While this repository aims to be careful and transparent, you should not make decisions solely based on the charts shown here.
CSV parsers vary widely in:
- Feature sets
- API design
- Error handling
- Memory strategies
- Suitability for specific workloads
Even if raw performance is your primary concern, you should verify that:
- The benchmark matches your usage pattern
- The input data resembles your real data
- The execution environment is comparable to yours
I have made a best-effort attempt to choose representative test cases and to use each library as intended. If you notice an issue, a mistake, or a missing library, contributions and corrections are very welcome.
The benchmark intentionally focuses on iteration speed, not downstream data processing.
Each parser is required to:
- Iterate through the entire CSV file
- Count the total number of fields
- Use a 64 KB input buffer where configurable
Some libraries do not expose buffer size controls; in those cases, the default behavior is used.
This task isolates parsing overhead and minimizes the impact of allocation, data conversion, or user-level processing.
After surveying commonly used CSV benchmark files, the following datasets were selected due to their diversity in size, structure, and quoting behavior:
game.csvgtfs-mbta-stop-times.csvnfl.csvworldcitiespop.csv
These files include a mix of:
- Quoted and unquoted fields
- Escaped characters
- Varying row and column counts
They are reasonably representative of real-world CSV data.
To explore additional edge cases and scalability, synthetic datasets are generated using the following naming scheme:
<size>_<mix|no>_quotes_<column-count>_col_<min-field-size>_<max-field-size>.csv
Where:
sizeindicates approximate file sizecolindicates column count- Field contents are random printable ASCII characters (
32–127) - Field length is randomly chosen in
[min-field-size, max-field-size]
If quoting is disabled:
",\, and,are excluded
If quoting is enabled:
- These characters may appear, and fields are quoted correctly when required
Generated test files:
| file | size | quote/escape mode | columns | field length |
|---|---|---|---|---|
xs_mix_quotes_12_col_0_32.csv |
~1 KB | Contains quoted fields | 12 | 0-32 |
xs_no_quotes_52_col_0_256.csv |
~333 K | No quoted fields | 52 | 0-256 |
m_mix_quotes_12_col_0_32.csv |
~102 M | Contains quoted fields | 12 | 0-32 |
m_no_quotes_52_col_0_256.csv |
~32 M | No quoted fields | 52 | 0-256 |
xl_mix_quotes_2_col_0_12_many_rows.csv |
~700 M | Contains quoted fields | 2 columns | 0-12 |
xl_no_quotes_52_col_0_256.csv |
~3.2 G | No quoted fields | 52 columns | 0-256 |
xl_mix_quotes_12_col_0_32.csv |
~9.9 G | Contains quoted fields | 12 columns | 0-32 |
While wall-clock time is the most visible metric, several additional hardware-level metrics are captured to provide deeper insight:
-
Wall Time Total elapsed time to complete the task.
-
Peak RSS Maximum resident set size (memory usage) during execution.
-
CPU Instructions Number of retired machine instructions, independent of clock speed.
-
CPU Cycles Total cycles elapsed, including stalls and memory waits.
-
Cache References Number of accesses through the CPU cache hierarchy.
-
Cache Misses Cache misses across all cache levels (primarily LLC).
-
Branch Misses Modern CPUs rely heavily on branch prediction to keep their pipelines full. Control-flow constructs such as
if,switch, loops, and conditional jumps usually compile down to branch instructions unless the compiler can fully eliminate them.When the CPU encounters a branch, it predicts which path will be taken and begins executing instructions speculatively. If the prediction is correct, execution continues with little to no cost. If it is wrong, the CPU must flush part of the pipeline and restart execution, which incurs a noticeable performance penalty.
A branch miss (or branch misprediction) occurs when the CPU’s prediction does not match the actual control flow.
This matters for CSV parsing because:
- Parsers often contain tight loops with many conditionals
- Decisions depend on input data (e.g. quote handling, escape detection, delimiter checks)
- Irregular or data-dependent patterns reduce predictability
CSV files with:
- Mixed quoted and unquoted fields
- Escaped characters
- Varying row and column lengths
tend to produce less predictable branching behavior than uniform, unquoted data.
Benchmarks are primarily executed using:
- Poop
A Linux-only benchmarking tool built on top of
perf.
On macOS or Windows:
- Hyperfine can be used, but it only provides wall-time metrics.
If you prefer, you may also use perf directly or substitute alternative tooling.
A Python script orchestrates the benchmarks and generates:
- A CSV file with all raw results
- A set of charts for selected metrics
The scripts are easily adaptable if you want to:
- Use different tools
- Add new metrics
- Customize visualizations
Benchmarks were run on: CPU: AMD Ryzen 5 PRO 5650U Memory: 30 GB OS: Linux 6.17.8-arch1-1
Results will vary significantly across machines and architectures. Contributions with data from other CPUs are very welcome.
To reduce visual noise:
- Charts show only the top 5 parsers
- Raw results for all parsers and test cases are available in
result-all.csv - Charts focus primarily on the four common real-world datasets
Observations:
- SIMD-accelerated parsers (
simd-csv,zsv,csv-zero) generally dominate, but show reduced advantage ongame.csv zscperforms exceptionally well overall but regresses noticeably ongame.csvlazycppis the most consistent performer across datasets- Surprisingly, for
worldcitiespop.csv(no quoted fields), some parsers (csv (rust),lazycsv (cpp)) underperform
For large files:
Here, zsc (c), lazycsv (cpp), and simdcsv-rust remain consistently strong.
csv-zero finishes first across all tested cases.
Most top parsers exhibit stable memory usage regardless of file size.
An exception is lazycsv (cpp), which can consume gigabytes of memory on large inputs.
Branch mispredictions often correlate strongly with wall-time performance. Parsers with predictable control flow tend to perform better, especially on complex quoting patterns.
Total number of memory access requests issued by the CPU that are serviced through the cache hierarchy while executing the parser process. This includes loads and stores that may be satisfied by any cache level (L1, L2, or Last Level Cache).
Cache references serve as a proxy for overall memory traffic generated by the parser. Higher values typically indicate more frequent memory accesses, such as per-byte processing, pointer chasing, temporary buffers, or field copying. SIMD-based and streaming parsers often reduce cache references by scanning data in wide vectors and minimizing dependent memory loads.
Total number of cache access requests that could not be satisfied by the cache level accessed and therefore required fetching data from a lower cache level or main memory. In practice, this counter primarily reflects Last Level Cache (LLC) misses on modern CPUs.
Cache misses are significantly more expensive than cache hits, often incurring tens to hundreds of CPU cycles per miss depending on whether the data is retrieved from L2, L3, or DRAM. High cache miss counts generally indicate poor memory locality, working sets larger than the cache, or unpredictable access patterns. For large CSV files that exceed cache capacity, some level of cache misses is expected; performance differences are driven by how predictable and sequential those misses are.
Additional metrics are available in the images directory and raw CSV outputs.
- Poop (Linux) or Hyperfine
- Python + matplotlib (for charts)
- Zig 0.15.2 (for
csv-zeroand data generation)
All parsers live under src/.
The only exception is:
src/zig/src/data_gen.zig— used for generating test data
To build everything:
make build_allThis requires:
- C
- C++
- Rust
- Go
- Zig
You are free to remove parsers you don’t care about or add your own.
zsc (c) requires a manual build but is straightforward if you follow its upstream instructions.
You can customize:
- Which parsers participate
- Which test files are used
- Which metrics are collected
generate_charts.py:
- Runs benchmarks using
poop - Produces
output.csv - Writes figures to
images/
macOS support is not yet available here.
If poop is unavailable, you can still run benchmarks manually:
make hyperfine TEST_FILE=/path/to/your/file.csvAdjust the Makefile targets to control which parsers and tools are used.





