-
Notifications
You must be signed in to change notification settings - Fork 944
Add structured datasets loading capability in valkey benchmark #2823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
Conversation
Add structured datasets loading capability. Support XML/CSV/TSV file formats. Use `__field:fieldname__' placeholders to replace the corresponding fields from the dataset file. Support natural content size of varying length. Allow mixed placeholder usage combining dataset fields with random generators. Enable automatic field discovery from CSV/TSV headers and XML tags. Use `--maxdocs` to limit the dataset loading. Signed-off-by: Ram Prasad Voleti <[email protected]>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #2823 +/- ##
============================================
+ Coverage 72.47% 72.49% +0.02%
============================================
Files 128 128
Lines 70286 70681 +395
============================================
+ Hits 50937 51242 +305
- Misses 19349 19439 +90
🚀 New features to boost your workflow:
|
Fix memory leak in memory reporting Signed-off-by: Ram Prasad Voleti <[email protected]>
JimB123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a full documentation file. Need details on the new file format. Examples.
Maybe seed the file with --help data for the existing cases. This can be an area for future improvement.
| " --dataset <file> Path to CSV/TSV/XML dataset file for field placeholder replacement.\n" | ||
| " All fields auto-detected with natural content lengths.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we reached the point where we need some actual documentation for valkey-benchmark?
The benchmark tool has relied on --help for increasingly complex configurations. Now, we're introducing a dataset configuration file with very limited description. Is it likely that developers will understand how to use this without examining the code?
I suggest that a new documentation file be created (benchmark.md) which can provide details and examples for using valkey-benchmark, including details about this new configuration file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestion. I agree --help is not clear enough for most of the configurations benchmark support. Will add the detailed explanation with examples.
Background
Currently, valkey-benchmark only supports synthetic data generation through placeholders like
__rand_int__and__data__. This limits realistic performance testing since synthetic data doesn't reflect real-world usage patterns, data distributions, or content characteristics that applications actually work with. We need this capability for our Full-text search work and believe it would benefit other use cases like JSON operations, VSS, and general data modeling.Add structured datasets loading capability. Support XML/CSV/TSV file formats. Use
__field:fieldname__placeholders to replace the corresponding fields from the dataset file. Support natural content size of varying length. Allow mixed placeholder usage combining dataset fields with random generators. Enable automatic field discovery from CSV/TSV headers and XML tags. Use--maxdocsto limit the dataset loading.Rather than modifying the existing placeholder system, we detect field placeholders and switch to a separate code path that builds commands from scratch using
valkeyFormatCommandArgv(). This ensures:Full-Text Search Benchmarking
Test environment:
Instance: AWS c7i.16xlarge, 64 vCPU
Test Dataset: 5M+ Wikipedia XML documents, 5.8GB memory