Emit metrics on all task completions. #18766

gianm · 2025-11-21T00:11:02Z

Previously, "emitTaskCompletionLogsAndMetrics" would emit the metrics task/run/time, task/success/count, and task/failed/count only for tasks that complete due to an attached runner callback (from attachCallbacks). This patch causes metrics to be emitted whenever notifyStatus successfully marks a task as completed, which covers a wider variety of scenarios.

The prior behavior missed scenarios where the shutdown API is used on a task that the runner is aware of but has not yet been added to the queue. It could happen during Overlord startup, while the queue is initializing.

This patch also fixes a bug in TaskQueue#getTaskStatus, where it was using status from the taskRunner rather than activeTasks. The status from activeTasks is more authoritative and should be preferred. This caused flakiness in MSQWorkerFaultToleranceTest, which was being exacerbated after the metrics changes above, by the fact that metrics were emitted slightly earlier and therefore faultyIndexer.stop() was called slightly earlier. Fixing the TaskQueue#getTaskStatus bug seems to have resolved the flakiness in the test.

Previously, "emitTaskCompletionLogsAndMetrics" would emit the metrics task/run/time, task/success/count, and task/failed/count only for tasks that complete due to an attached runner callback (from attachCallbacks). This patch causes metrics to be emitted whenever notifyStatus successfully marks a task as completed. The prior behavior missed scenarios where the shutdown API is used on a task that the runner is aware of but has not yet been added to the queue. It could happen during Overlord startup, while the queue is initializing.

kfaraz

LGTM, the test failures seem genuine though.

… race with its finishing.

…ve than runner status.

gianm · 2026-01-08T05:45:06Z

I believe the Docker test failure is due to a new race in IngestionSmokeTest#test_streamLogs_ofCancelledTask. This code now fails because the streamOptional is empty:

    eventCollector.latchableEmitter().waitForEvent(
        event -> event.hasMetricName("task/run/time")
                      .hasDimension(DruidMetrics.TASK_ID, taskId)
                      .hasDimension(DruidMetrics.TASK_STATUS, "FAILED")
    );

    final Optional<InputStream> streamOptional =
        overlord.bindings()
                .getInstance(TaskLogStreamer.class)
                .streamTaskLog(taskId, 0);

    Assertions.assertTrue(streamOptional.isPresent());

It happens because, with this patch, the task/run/time metric is emitted in notifyStatus, which for a canceled task is somewhat before it actually stops running. Formerly, the metric in this scenario would be emitted when the task runner reports the task as finished and fires its callback. So, now there's a window between when task/run/time is emitted and when the log is written to S3.

I believe a little window where logs are not available shortly after tasks are canceled has always existed. I have seen it myself in production: sometimes you get a 404 on the task log for a recently-canceled task. But now it's visible to the test due to the timing of metric emission.

I do believe that emitting metrics in notifyStatus, as this patch does, is better than what we were doing before. notifyStatus is the method that updates the metadata store, which is canonical, so tying metrics to that makes the metrics more accurate. (That was the bug this patch is fixing.)

@kfaraz I'm wondering if you have a better idea to deal with the timing issue, beyond adding a sleep here. That's the best thing I can think of right now.

kfaraz · 2026-01-08T08:09:29Z

@kfaraz I'm wondering if you have a better idea to deal with the timing issue, beyond adding a sleep here. That's the best thing I can think of right now.

Thanks for the clarification, @gianm !
The only alternative I can think of would be to emit a metric in S3TaskLogs.pushTaskLog() but if you feel that seems like an overkill, we can proceed with a sleep to unblock this PR.

I think there are certain APIs for which there is no option but to have a test-sleep-repeat pattern.
We should consider putting something similar to ITRetryUtil.retryUntil() in EmbeddedClusterApis for this purpose.
It would allow us to have a uniform logic and make it easier to fix up those call sites later, if applicable.

gianm · 2026-01-08T09:06:32Z

IMO it's overkill to add a metric to the task log pusher, so I went the route of adding a new utility. The call site looks like this:

    final Optional<InputStream> streamOptional =
        cluster.callApi().waitForResult(
            () -> overlord.bindings()
                          .getInstance(TaskLogStreamer.class)
                          .streamTaskLog(taskId, 0),
            Optional::isPresent
        ).go();

The reason for the extra .go() is that the thing returned by waitForResult is like a builder. It has methods like withTimeoutMillis etc.

kfaraz · 2026-01-08T10:01:46Z

services/src/test/java/org/apache/druid/testing/embedded/EmbeddedClusterApis.java

+   * retry loops and is therefore both more responsive, and better at catching race conditions. Use this method
+   * when there is no metric to wait on, and you believe that adding one would be overkill.
+   */
+  public <T> ResultWaiter<T> waitForResult(


Thanks for adding this and for the javadocs!

github-actions bot added the Area - Ingestion label Nov 21, 2025

capistrant approved these changes Nov 21, 2025

View reviewed changes

kfaraz approved these changes Nov 21, 2025

View reviewed changes

gianm added 6 commits November 28, 2025 21:40

Fix test.

f180149

Merge branch 'master' into ol-missing-metrics

e4e803a

Merge branch 'master' into ol-missing-metrics

6d8b571

Merge branch 'master' into ol-missing-metrics

f53a497

Fix race in TaskQueueTest: it needlessly shut down task3, which could…

eaa10a1

… race with its finishing.

Use activeTasks for status in TaskQueue, since it is more authoritati…

160f1fe

…ve than runner status.

gianm changed the title ~~Emit task metrics on all task completions.~~ Emit metrics on all task completions. Jan 7, 2026

gianm added 2 commits January 7, 2026 15:21

Properly include failure message.

b06843b

Fix testGetTaskStatus.

9f30dcb

Wait for task logs to show up.

d04008c

gianm mentioned this pull request Jan 8, 2026

Dart: Serve reports for running and recently-finished queries. #18886

Merged

kfaraz reviewed Jan 8, 2026

View reviewed changes

kfaraz approved these changes Jan 8, 2026

View reviewed changes

gianm merged commit 9c55da6 into apache:master Jan 8, 2026
110 of 114 checks passed

gianm deleted the ol-missing-metrics branch January 8, 2026 15:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Emit metrics on all task completions. #18766

Emit metrics on all task completions. #18766

gianm commented Nov 21, 2025 •

edited

Loading

Uh oh!

kfaraz left a comment

Uh oh!

gianm commented Jan 8, 2026 •

edited

Loading

Uh oh!

kfaraz commented Jan 8, 2026 •

edited

Loading

Uh oh!

gianm commented Jan 8, 2026

Uh oh!

kfaraz Jan 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Emit metrics on all task completions. #18766

Emit metrics on all task completions. #18766

Conversation

gianm commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kfaraz left a comment

Choose a reason for hiding this comment

Uh oh!

gianm commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kfaraz commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gianm commented Jan 8, 2026

Uh oh!

kfaraz Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gianm commented Nov 21, 2025 •

edited

Loading

gianm commented Jan 8, 2026 •

edited

Loading

kfaraz commented Jan 8, 2026 •

edited

Loading