-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Metrics to account for time spent waiting for next chunk #129469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…nkMetric Refresh branch
…sticsearch into 06142025/WaitForChunkMetric git pull
…nkMetric refresh branch
|
Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing) |
tlrx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I left some comments
| public IncrementalBulkService( | ||
| Client client, | ||
| IndexingPressure indexingPressure, | ||
| BulkOperationWaitForChunkMetrics bulkOperationWaitForChunkMetrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what are the advantages of using a dedicated BulkOperationWaitForChunkMetrics object here? Maube just inject the MeterRegistry and declare the histogram metric in IncrementalBulkService would be simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Later on, calling updateWaitForChunkMetrics would update the metric directly instead of delegating to BulkOperationWaitForChunkMetrics too.
| return incrementalOperation; | ||
| } | ||
|
|
||
| public void updateWaitForChunkMetrics(long chunkWaitTimeCentis) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something like this?
| public void updateWaitForChunkMetrics(long chunkWaitTimeCentis) { | |
| public void recordWaitForNextChunkTime(long waitForNextChunkTimeInMillis) { |
| final IncrementalBulkService incrementalBulkService = new IncrementalBulkService( | ||
| client, | ||
| indexingLimits, | ||
| bulkOperationWaitForChunkMetrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would inject telemetryProvider.getMeterRegistry() directly
| b.bind(PageCacheRecycler.class).toInstance(pageCacheRecycler); | ||
| b.bind(IngestService.class).toInstance(ingestService); | ||
| b.bind(IndexingPressure.class).toInstance(indexingLimits); | ||
| b.bind(BulkOperationWaitForChunkMetrics.class).toInstance(bulkOperationWaitForChunkMetrics); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Binding is necessary if the BulkOperationWaitForChunkMetrics is injected through Guice somewhere, I don't think it is the case?
| import org.elasticsearch.telemetry.metric.MeterRegistry; | ||
|
|
||
| public class BulkOperationWaitForChunkMetrics { | ||
| public static final String CHUNK_WAIT_TIME_HISTOGRAM = "es.rest.wait.duration.histogram"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should mention incremental bulk request / chunking:
es.rest.incremental_bulk.wait_for_next_chunk.duration.histogram
(or something along those lines)
| public static final String CHUNK_WAIT_TIME_HISTOGRAM = "es.rest.wait.duration.histogram"; | ||
|
|
||
| /* Capture in milliseconds because the APM histogram only has a range of 100,000 */ | ||
| private final LongHistogram chunkWaitTimeMillisHistogram; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| private final LongHistogram chunkWaitTimeMillisHistogram; | |
| private final LongHistogram chunkWaitTimeInMillisHistogram; |
| meterRegistry.registerLongHistogram( | ||
| CHUNK_WAIT_TIME_HISTOGRAM, | ||
| "Total time in millis spent waiting for next chunk of a bulk request", | ||
| "centis" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "centis" | |
| "ms" |
| this.restChannel = restChannel; | ||
| this.handler = handlerSupplier.get(); | ||
| request.contentStream().next(); | ||
| requestNextChunkTime = System.nanoTime(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We often pass LongSupplier instead of the real time measurement method, it makes testing easier.
| } | ||
| totalChunkWaitTime = TimeUnit.NANOSECONDS.toMillis(totalChunkWaitTime); | ||
| handler.updateWaitForChunkMetrics(totalChunkWaitTime); | ||
| totalChunkWaitTime = 0L; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| totalChunkWaitTime = 0L; | |
| totalChunkWaitTime = -1L; |
and then assert totalChunkWaitTime>= 0L in the handleChunk method?
| handler.addItems(toPass, () -> Releasables.close(releasables), () -> request.contentStream().next()); | ||
| handler.addItems(toPass, () -> Releasables.close(releasables), () -> { | ||
| request.contentStream().next(); | ||
| requestNextChunkTime = System.nanoTime(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know if request.contentStream().next(); immediately calls handleChunk if data is available? Otherwise I wonder if we want to capture the time before calling next.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should do it prior, seems more correct regardless.
henningandersen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
| long elapsedTime = System.nanoTime() - requestNextChunkTime; | ||
| if (elapsedTime > 0) { | ||
| totalChunkWaitTime += elapsedTime; | ||
| requestNextChunkTime = 0L; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we instead reset this to what System.nanoTime() gave above? 0 is not a special value and we do not seem to guard against it.
| handler.addItems(toPass, () -> Releasables.close(releasables), () -> request.contentStream().next()); | ||
| handler.addItems(toPass, () -> Releasables.close(releasables), () -> { | ||
| request.contentStream().next(); | ||
| requestNextChunkTime = System.nanoTime(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should do it prior, seems more correct regardless.
| } else { | ||
| Releasables.close(releasables); | ||
| request.contentStream().next(); | ||
| requestNextChunkTime = System.nanoTime(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also set this before calling next here.
| private boolean bulkInProgress = false; | ||
| private Exception bulkActionLevelFailure = null; | ||
| private BulkRequest bulkRequest = null; | ||
| private final BulkOperationWaitForChunkMetrics bulkOperationWaitForChunkMetrics; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we comment that the only reason this lives in this class is that it is simpler to inject here than into RestBulkAction?
| } | ||
|
|
||
| public void updateWaitForChunkMetrics(long chunkWaitTimeCentis) { | ||
| if (bulkOperationWaitForChunkMetrics != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can assert that it is not null instead here?
|
@tlrx I think that I went through all the comments, maybe you can take a look before we merge this? thanks! |
tlrx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
@elasticmachine test this (unrelated test failure |
) This PR addresses ES-12071. We want to collect metrics for the time that is spent waiting for the next chunk of a bulk request. This can help with diagnosing high bulk latency in case the latency is attributable to external factors such as network connection. Co-authored-by: Francisco Fernández Castaño <[email protected]>
) This PR addresses ES-12071. We want to collect metrics for the time that is spent waiting for the next chunk of a bulk request. This can help with diagnosing high bulk latency in case the latency is attributable to external factors such as network connection. Co-authored-by: Francisco Fernández Castaño <[email protected]>
This PR addresses ES-12071.
We want to collect metrics for the time that is spent waiting for the next chunk of a bulk request. This can help with diagnosing high bulk latency in case the latency is attributable to external factors such as network connection.