Metrics to account for time spent waiting for next chunk #129469

ankikuma · 2025-06-16T10:36:24Z

This PR addresses ES-12071.

We want to collect metrics for the time that is spent waiting for the next chunk of a bulk request. This can help with diagnosing high bulk latency in case the latency is attributable to external factors such as network connection.

…nkMetric Refresh branch

…sticsearch into 06142025/WaitForChunkMetric git pull

…nkMetric refresh branch

elasticsearchmachine · 2025-06-17T11:11:04Z

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

tlrx

Looks good, I left some comments

tlrx · 2025-06-17T11:47:35Z

server/src/main/java/org/elasticsearch/action/bulk/IncrementalBulkService.java

+    public IncrementalBulkService(
+        Client client,
+        IndexingPressure indexingPressure,
+        BulkOperationWaitForChunkMetrics bulkOperationWaitForChunkMetrics


I wonder what are the advantages of using a dedicated BulkOperationWaitForChunkMetrics object here? Maube just inject the MeterRegistry and declare the histogram metric in IncrementalBulkService would be simpler.

Later on, calling updateWaitForChunkMetrics would update the metric directly instead of delegating to BulkOperationWaitForChunkMetrics too.

tlrx · 2025-06-17T11:50:27Z

server/src/main/java/org/elasticsearch/action/bulk/IncrementalBulkService.java

            return incrementalOperation;
        }

+        public void updateWaitForChunkMetrics(long chunkWaitTimeCentis) {


Maybe something like this?

Suggested change

public void updateWaitForChunkMetrics(long chunkWaitTimeCentis) {

public void recordWaitForNextChunkTime(long waitForNextChunkTimeInMillis) {

tlrx · 2025-06-17T11:51:00Z

server/src/main/java/org/elasticsearch/node/NodeConstruction.java

+        final IncrementalBulkService incrementalBulkService = new IncrementalBulkService(
+            client,
+            indexingLimits,
+            bulkOperationWaitForChunkMetrics


I would inject telemetryProvider.getMeterRegistry() directly

tlrx · 2025-06-17T11:51:43Z

server/src/main/java/org/elasticsearch/node/NodeConstruction.java

            b.bind(PageCacheRecycler.class).toInstance(pageCacheRecycler);
            b.bind(IngestService.class).toInstance(ingestService);
            b.bind(IndexingPressure.class).toInstance(indexingLimits);
+            b.bind(BulkOperationWaitForChunkMetrics.class).toInstance(bulkOperationWaitForChunkMetrics);


Binding is necessary if the BulkOperationWaitForChunkMetrics is injected through Guice somewhere, I don't think it is the case?

tlrx · 2025-06-17T11:57:41Z

...r/src/main/java/org/elasticsearch/rest/action/document/BulkOperationWaitForChunkMetrics.java

+import org.elasticsearch.telemetry.metric.MeterRegistry;
+
+public class BulkOperationWaitForChunkMetrics {
+    public static final String CHUNK_WAIT_TIME_HISTOGRAM = "es.rest.wait.duration.histogram";


I think we should mention incremental bulk request / chunking:
es.rest.incremental_bulk.wait_for_next_chunk.duration.histogram

(or something along those lines)

tlrx · 2025-06-17T11:59:20Z

...r/src/main/java/org/elasticsearch/rest/action/document/BulkOperationWaitForChunkMetrics.java

+    public static final String CHUNK_WAIT_TIME_HISTOGRAM = "es.rest.wait.duration.histogram";
+
+    /* Capture in milliseconds because the APM histogram only has a range of 100,000 */
+    private final LongHistogram chunkWaitTimeMillisHistogram;


Suggested change

private final LongHistogram chunkWaitTimeMillisHistogram;

private final LongHistogram chunkWaitTimeInMillisHistogram;

tlrx · 2025-06-17T11:59:54Z

...r/src/main/java/org/elasticsearch/rest/action/document/BulkOperationWaitForChunkMetrics.java

+            meterRegistry.registerLongHistogram(
+                CHUNK_WAIT_TIME_HISTOGRAM,
+                "Total time in millis spent waiting for next chunk of a bulk request",
+                "centis"


Suggested change

"centis"

"ms"

tlrx · 2025-06-17T12:36:51Z

server/src/main/java/org/elasticsearch/rest/action/document/RestBulkAction.java

            this.restChannel = restChannel;
            this.handler = handlerSupplier.get();
            request.contentStream().next();
+            requestNextChunkTime = System.nanoTime();


We often pass LongSupplier instead of the real time measurement method, it makes testing easier.

tlrx · 2025-06-17T12:39:29Z

server/src/main/java/org/elasticsearch/rest/action/document/RestBulkAction.java

                }
+                totalChunkWaitTime = TimeUnit.NANOSECONDS.toMillis(totalChunkWaitTime);
+                handler.updateWaitForChunkMetrics(totalChunkWaitTime);
+                totalChunkWaitTime = 0L;


Suggested change

totalChunkWaitTime = 0L;

totalChunkWaitTime = -1L;

and then assert totalChunkWaitTime>= 0L in the handleChunk method?

tlrx · 2025-06-17T12:40:27Z

server/src/main/java/org/elasticsearch/rest/action/document/RestBulkAction.java

-                handler.addItems(toPass, () -> Releasables.close(releasables), () -> request.contentStream().next());
+                handler.addItems(toPass, () -> Releasables.close(releasables), () -> {
+                    request.contentStream().next();
+                    requestNextChunkTime = System.nanoTime();


Do you know if request.contentStream().next(); immediately calls handleChunk if data is available? Otherwise I wonder if we want to capture the time before calling next.

I think we should do it prior, seems more correct regardless.

henningandersen

LGTM.

henningandersen · 2025-06-17T16:44:00Z

server/src/main/java/org/elasticsearch/rest/action/document/RestBulkAction.java

+            long elapsedTime = System.nanoTime() - requestNextChunkTime;
+            if (elapsedTime > 0) {
+                totalChunkWaitTime += elapsedTime;
+                requestNextChunkTime = 0L;


Should we instead reset this to what System.nanoTime() gave above? 0 is not a special value and we do not seem to guard against it.

henningandersen · 2025-06-17T16:44:40Z

server/src/main/java/org/elasticsearch/rest/action/document/RestBulkAction.java

-                handler.addItems(toPass, () -> Releasables.close(releasables), () -> request.contentStream().next());
+                handler.addItems(toPass, () -> Releasables.close(releasables), () -> {
+                    request.contentStream().next();
+                    requestNextChunkTime = System.nanoTime();


I think we should do it prior, seems more correct regardless.

henningandersen · 2025-06-17T16:45:00Z

server/src/main/java/org/elasticsearch/rest/action/document/RestBulkAction.java

            } else {
                Releasables.close(releasables);
                request.contentStream().next();
+                requestNextChunkTime = System.nanoTime();


Also set this before calling next here.

henningandersen · 2025-06-17T16:49:01Z

server/src/main/java/org/elasticsearch/action/bulk/IncrementalBulkService.java

        private boolean bulkInProgress = false;
        private Exception bulkActionLevelFailure = null;
        private BulkRequest bulkRequest = null;
+        private final BulkOperationWaitForChunkMetrics bulkOperationWaitForChunkMetrics;


Can we comment that the only reason this lives in this class is that it is simpler to inject here than into RestBulkAction?

henningandersen · 2025-06-17T16:50:48Z

server/src/main/java/org/elasticsearch/action/bulk/IncrementalBulkService.java

        }

+        public void updateWaitForChunkMetrics(long chunkWaitTimeCentis) {
+            if (bulkOperationWaitForChunkMetrics != null) {


I wonder if we can assert that it is not null instead here?

fcofdez · 2025-06-19T14:34:40Z

@tlrx I think that I went through all the comments, maybe you can take a look before we merge this? thanks!

tlrx

LGTM

…Metric

fcofdez · 2025-06-19T19:34:38Z

@elasticmachine test this (unrelated test failure GetSnapshotsIT > testFilterByState FAILED)

) This PR addresses ES-12071. We want to collect metrics for the time that is spent waiting for the next chunk of a bulk request. This can help with diagnosing high bulk latency in case the latency is attributable to external factors such as network connection. Co-authored-by: Francisco Fernández Castaño <[email protected]>

first cut

8763cd5

ankikuma added Team:Distributed Indexing Meta label for Distributed Indexing team >non-issue labels Jun 16, 2025

elasticsearchmachine added the v9.1.0 label Jun 16, 2025

elasticsearchmachine and others added 5 commits June 16, 2025 10:46

[CI] Auto commit changes from spotless

6f97068

Merge remote-tracking branch 'upstream/main' into 06142025/WaitForChu…

160aa68

…nkMetric Refresh branch

Merge branch '06142025/WaitForChunkMetric' of github.com:ankikuma/ela…

ef7afa4

…sticsearch into 06142025/WaitForChunkMetric git pull

fix test failure

6472d42

Merge remote-tracking branch 'upstream/main' into 06142025/WaitForChu…

dc16d64

…nkMetric refresh branch

ankikuma marked this pull request as ready for review June 17, 2025 10:57

ankikuma requested a review from a team as a code owner June 17, 2025 10:57

ankikuma requested review from henningandersen and tlrx June 17, 2025 10:57

elasticsearchmachine added needs:triage Requires assignment of a team area label and removed Team:Distributed Indexing Meta label for Distributed Indexing team labels Jun 17, 2025

ankikuma added the Team:Distributed Indexing Meta label for Distributed Indexing team label Jun 17, 2025

elasticsearchmachine removed the Team:Distributed Indexing Meta label for Distributed Indexing team label Jun 17, 2025

ankikuma added the :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. label Jun 17, 2025

elasticsearchmachine added the Team:Distributed Indexing Meta label for Distributed Indexing team label Jun 17, 2025

elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Jun 17, 2025

tlrx reviewed Jun 17, 2025

View reviewed changes

henningandersen approved these changes Jun 17, 2025

View reviewed changes

fcofdez added 2 commits June 19, 2025 16:30

Review comments

6a6c27a

Missing mock

c6a60b7

tlrx approved these changes Jun 19, 2025

View reviewed changes

fcofdez added 2 commits June 19, 2025 18:21

Merge remote-tracking branch 'origin/main' into 06142025/WaitForChunk…

6799c5e

…Metric

Merge remote-tracking branch 'origin/main' into 06142025/WaitForChunk…

f8ca2a0

…Metric

fcofdez merged commit 9e19b85 into elastic:main Jun 20, 2025
27 checks passed

	public void updateWaitForChunkMetrics(long chunkWaitTimeCentis) {
	public void recordWaitForNextChunkTime(long waitForNextChunkTimeInMillis) {

	private final LongHistogram chunkWaitTimeMillisHistogram;
	private final LongHistogram chunkWaitTimeInMillisHistogram;

Metrics to account for time spent waiting for next chunk #129469

Metrics to account for time spent waiting for next chunk #129469

Uh oh!

Conversation

ankikuma commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Jun 17, 2025

Uh oh!

tlrx left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fcofdez commented Jun 19, 2025

Uh oh!

tlrx left a comment

Choose a reason for hiding this comment

Uh oh!

fcofdez commented Jun 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ankikuma commented Jun 16, 2025 •

edited

Loading