Add structured datasets loading capability in valkey benchmark #2823

VoletiRam · 2025-11-10T22:14:33Z

Background

Currently, valkey-benchmark only supports synthetic data generation through placeholders like __rand_int__ and __data__. This limits realistic performance testing since synthetic data doesn't reflect real-world usage patterns, data distributions, or content characteristics that applications actually work with. We need this capability for our Full-text search work and believe it would benefit other use cases like JSON operations, VSS, and general data modeling.

Add structured datasets loading capability. Support XML/CSV/TSV file formats. Use __field:fieldname__ placeholders to replace the corresponding fields from the dataset file. Support natural content size of varying length. Allow mixed placeholder usage combining dataset fields with random generators. Enable automatic field discovery from CSV/TSV headers and XML tags. Use --maxdocs to limit the dataset loading.

Rather than modifying the existing placeholder system, we detect field placeholders and switch to a separate code path that builds commands from scratch using valkeyFormatCommandArgv(). This ensures:

Zero impact on existing functionality
Full support for variable-size content
Thread-safe atomic record iteration
Compatible with pipelining and threading modes

# Strings - Simple key-value with dataset fields
./valkey-benchmark --dataset products.csv -n 10000 SET product:__rand_int__ "__field:name__"

# Sets - Unique collections from dataset
./valkey-benchmark --dataset categories.csv -n 10000 SADD tags:__rand_int__ "__field:category__"

# XML dataset with document limit
./valkey-benchmark --dataset wiki.xml --xml-root-element doc --maxdocs 100000 -n 50000 HSET doc:__rand_int__ title "__field:title__" body "__field:abstract__"

# Mixed placeholders (dataset + random)
./valkey-benchmark --dataset terms.csv -r 5000000 -n 50000 HSET search:__rand_int__ term "__field:term__" score __rand_1st__

Full-Text Search Benchmarking

# Search hit scenarios (existing terms)
./valkey-benchmark --dataset search_terms.csv -n 50000 FT.SEARCH rd0 "__field:term__"

# Search miss scenarios (non-existent terms)  
./valkey-benchmark --dataset miss_terms.csv -n 50000 FT.SEARCH rd0 "__field:term__"

# Query variations
./valkey-benchmark --dataset search_terms.csv -n 50000 FT.SEARCH rd0 "@title:__field:term__"
./valkey-benchmark --dataset search_terms.csv -n 50000 FT.SEARCH rd0 "__field:term__*"

Test environment:
Instance: AWS c7i.16xlarge, 64 vCPU

Test Dataset: 5M+ Wikipedia XML documents, 5.8GB memory

Configuration	Throughput	CPU Usage	Wall Time	Memory Peak
Single-threaded, P1	93,295 RPS	99%	71.4s	5.8GB
Multi-threaded (10), P1	93,332 RPS	137%	71.5s	5.8GB
Single-threaded, P10	274,499 RPS	96%	36.1s	5.8GB
Multi-threaded (4), P10	344,589 RPS	161%	32.4s	5.8GB

Add structured datasets loading capability. Support XML/CSV/TSV file formats. Use `__field:fieldname__' placeholders to replace the corresponding fields from the dataset file. Support natural content size of varying length. Allow mixed placeholder usage combining dataset fields with random generators. Enable automatic field discovery from CSV/TSV headers and XML tags. Use `--maxdocs` to limit the dataset loading. Signed-off-by: Ram Prasad Voleti <[email protected]>

codecov · 2025-11-10T22:32:50Z

Codecov Report

❌ Patch coverage is 87.34491% with 51 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.49%. Comparing base (7fbd4cb) to head (fa33f69).

Files with missing lines	Patch %	Lines
src/valkey-benchmark.c	87.34%	51 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2823      +/-   ##
============================================
+ Coverage     72.47%   72.49%   +0.02%     
============================================
  Files           128      128              
  Lines         70286    70681     +395     
============================================
+ Hits          50937    51242     +305     
- Misses        19349    19439      +90

Files with missing lines	Coverage Δ
src/valkey-benchmark.c	`66.64% <87.34%> (+5.81%)`	⬆️

... and 15 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Fix memory leak in memory reporting Signed-off-by: Ram Prasad Voleti <[email protected]>

JimB123

Need a full documentation file. Need details on the new file format. Examples.

Maybe seed the file with --help data for the existing cases. This can be an area for future improvement.

JimB123 · 2025-11-13T18:16:51Z

src/valkey-benchmark.c

+        " --dataset <file>   Path to CSV/TSV/XML dataset file for field placeholder replacement.\n"
+        "                    All fields auto-detected with natural content lengths.\n",


Have we reached the point where we need some actual documentation for valkey-benchmark?

The benchmark tool has relied on --help for increasingly complex configurations. Now, we're introducing a dataset configuration file with very limited description. Is it likely that developers will understand how to use this without examining the code?

I suggest that a new documentation file be created (benchmark.md) which can provide details and examples for using valkey-benchmark, including details about this new configuration file.

Thank you for the suggestion. I agree --help is not clear enough for most of the configurations benchmark support. Will add the detailed explanation with examples.

github-actions bot assigned VoletiRam Nov 10, 2025

Fix memory leak in memory reporting

fa33f69

Fix memory leak in memory reporting Signed-off-by: Ram Prasad Voleti <[email protected]>

zuiderkwast linked an issue Nov 11, 2025 that may be closed by this pull request

[NEW] Add structured dataset support to valkey-benchmark #2765

Open

JimB123 suggested changes Nov 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add structured datasets loading capability in valkey benchmark #2823

Add structured datasets loading capability in valkey benchmark #2823

Uh oh!

VoletiRam commented Nov 10, 2025

Uh oh!

codecov bot commented Nov 10, 2025 •

edited

Loading

Uh oh!

JimB123 left a comment

Uh oh!

JimB123 Nov 13, 2025

Uh oh!

VoletiRam Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		" --dataset <file> Path to CSV/TSV/XML dataset file for field placeholder replacement.\n"
		" All fields auto-detected with natural content lengths.\n",

Add structured datasets loading capability in valkey benchmark #2823

Are you sure you want to change the base?

Add structured datasets loading capability in valkey benchmark #2823

Uh oh!

Conversation

VoletiRam commented Nov 10, 2025

Background

Uh oh!

codecov bot commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

JimB123 left a comment

Choose a reason for hiding this comment

Uh oh!

JimB123 Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

VoletiRam Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Nov 10, 2025 •

edited

Loading