-
Notifications
You must be signed in to change notification settings - Fork 944
Description
Currently, valkey-benchmark only supports synthetic data generation through placeholders like __rand_int__ and __data__. This limits realistic performance testing since synthetic data doesn't reflect real-world usage patterns, data distributions, or content characteristics that applications actually work with. We need this capability for our Full-text search work and believe it would benefit other use cases like JSON operations, VSS, and general data modeling.
Proposed Solution
Add a --dataset option to valkey-benchmark that loads structured data from files and introduces field-based placeholders:
valkey-benchmark --dataset products.jsonl -n 50000 \
HSET product:__field:id__ name "__field:name__" price __field:price__
New Placeholder Syntax
__field:columnname__: Replaced with data from specified dataset column in the file.
Supported file structure
CSV: Header row defines field names - title,content,category
TSV: Tab-separated with header - title\tcontent\tcategory
Parquet: Columnar binary format (for FTS) (requires library to support)
JSONL: Each line is JSON object - {"title": "...", "content": "...", "embedding": [...]} (requires library to support)
Details
- Pre-load dataset into memory during initialization
- Thread-safe row selection using atomic counters
- Extends existing placeholder system in valkey-benchmark.c
Use Cases
# FTS with real Wikipedia data
valkey-benchmark --dataset wikipedia.csv -n 100000 \
FT.SEARCH articles "@title:__field:title__"
# E-commerce product catalog
valkey-benchmark --dataset products.csv -n 50000 \
HSET product:__field:id__ name "__field:name__" category "__field:category__"