Skip to content

Conversation

@salvatore-campagna
Copy link
Contributor

@salvatore-campagna salvatore-campagna commented Feb 24, 2025

This JMH benchmark measures performance for range queries on the @timestamp field under two indexing strategies:

  • with a sparse doc values index (doc values skipper) on the host.name and @timestamp, and
  • without a sparse doc values index, using standard doc values, an inverted index on the host.name field and a KDB tree for the @timestamp field.

It mirrors LogsDB queries from our Rally nightly benchmarks, where latency regressions appeared after introducing sparse doc value indices. By isolating this scenario, we can identify regression causes, capture detailed profiling (e.g., flame graphs), and guide optimizations for range queries in LogsDB.

The parameter values may look arbitrary, but they’re chosen deliberately to avoid “alignment” side effects. Varying batchSize models different log batch sizes per host, while commitEvery ensures Lucene segments flush at intervals that don’t neatly align with batch boundaries. This prevents artificially favorable (or unfavorable) conditions from perfect overlaps. Finally, queryRange covers narrow, medium, and wide time spans to capture different levels of selectivity. Together, these choices recreate a range of realistic scenarios while avoiding misleading alignments between data batches and segment boundaries.

NOTE: the Rally queries in nightly benchmarks also require the result to be sorted on the @timestamp field (instead of using the data as it is sorted in the index). Anyway, we use count when running the query instead of a full search operation returning all data, because in both benchmarking scenarios we expect exactly the same set of documents to be fetched (with the same fetch pattern). Again, we would like to focus on "what is different here" trying to rule-out what is expected to stay the same. This way we can focus more our investigation on the actual work done during the search phase rather then in the work we do to "re-sort" documents on @timestamp and fetching them from disk.

@salvatore-campagna salvatore-campagna self-assigned this Feb 24, 2025
@salvatore-campagna salvatore-campagna changed the title feature: benchmark date field with doc values sparse index Benchmark date field range query with doc values sparse index Feb 24, 2025
@salvatore-campagna
Copy link
Contributor Author

salvatore-campagna commented Feb 24, 2025

Benchmarking

Note: use a nightly build of the Async Profiler. The stable release crashes.
Note: on MacOS you need to allow execution of libasyncProfiler.dylib in System Settings > Privacy & Security.

Capture CPU events

./gradlew -p benchmarks run \
--args 'DateFieldMapperDocValuesSkipperBenchmark -prof "async:libPath=/Users/salvatore.campagna/async-profiler-3.0-f71c31a-macos/lib/libasyncProfiler.dylib;dir=/Users/salvatore.campagna/workspace/elasticsearch/flamegraph;event=cpu;output=flamegraph"'

indexWriter.addDocument(doc);
}

indexWriter.commit();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change this in such a way to commit every n documents so to make sure we run the query on multiple segments per index. This should make the benchmark a bit more realistic.

@salvatore-campagna
Copy link
Contributor Author

salvatore-campagna commented Feb 24, 2025

Previously, the index was a single segment, making it hard to see CPU differences between skipper and non-skipper. By adding commitEvery, we force multiple segments, which better reflects real-world usage and highlights skipper’s impact. We also use a single-threaded search executor to isolate any doc-values skipping overhead from concurrency effects.

rangeEndTimestamp,
rangeQuery
);
return searcher.search(query, nDocs, QUERY_SORT).totalHits.value();
Copy link
Contributor Author

@salvatore-campagna salvatore-campagna Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will use count here instead of search to try to rule-out the work we do when collecting documents. The expectation is that, no matter if we use doc values sparse indices or not, in the fetch phase we need to fetch the same set of documents which are laid-out on disk in the same way in both scenarios. Hopefully using count will allow us to isolate the search phase only in the flame graph.

new Runner(options).run();
}

@Param("1343120")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use a large number of documents here...anyway we can't really prevent a scenario where everything ends up being in memory...unless we generate a very large set of documents, which is probably not ideal for a JMH benchmark.

@salvatore-campagna salvatore-campagna marked this pull request as ready for review February 25, 2025 10:16
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@salvatore-campagna salvatore-campagna merged commit 86a6c93 into elastic:main Feb 26, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants