Skip to content

Conversation

@gianm
Copy link
Contributor

@gianm gianm commented Nov 21, 2025

Previously, "emitTaskCompletionLogsAndMetrics" would emit the metrics task/run/time, task/success/count, and task/failed/count only for tasks that complete due to an attached runner callback (from attachCallbacks). This patch causes metrics to be emitted whenever notifyStatus successfully marks a task as completed, which covers a wider variety of scenarios.

The prior behavior missed scenarios where the shutdown API is used on a task that the runner is aware of but has not yet been added to the queue. It could happen during Overlord startup, while the queue is initializing.

This patch also fixes a bug in TaskQueue#getTaskStatus, where it was using status from the taskRunner rather than activeTasks. The status from activeTasks is more authoritative and should be preferred. This caused flakiness in MSQWorkerFaultToleranceTest, which was being exacerbated after the metrics changes above, by the fact that metrics were emitted slightly earlier and therefore faultyIndexer.stop() was called slightly earlier. Fixing the TaskQueue#getTaskStatus bug seems to have resolved the flakiness in the test.

Previously, "emitTaskCompletionLogsAndMetrics" would emit the metrics
task/run/time, task/success/count, and task/failed/count only for tasks
that complete due to an attached runner callback (from attachCallbacks).
This patch causes metrics to be emitted whenever notifyStatus successfully
marks a task as completed.

The prior behavior missed scenarios where the shutdown API is used on a
task that the runner is aware of but has not yet been added to the queue.
It could happen during Overlord startup, while the queue is initializing.
Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, the test failures seem genuine though.

@gianm gianm changed the title Emit task metrics on all task completions. Emit metrics on all task completions. Jan 7, 2026
@gianm
Copy link
Contributor Author

gianm commented Jan 8, 2026

I believe the Docker test failure is due to a new race in IngestionSmokeTest#test_streamLogs_ofCancelledTask. This code now fails because the streamOptional is empty:

    eventCollector.latchableEmitter().waitForEvent(
        event -> event.hasMetricName("task/run/time")
                      .hasDimension(DruidMetrics.TASK_ID, taskId)
                      .hasDimension(DruidMetrics.TASK_STATUS, "FAILED")
    );

    final Optional<InputStream> streamOptional =
        overlord.bindings()
                .getInstance(TaskLogStreamer.class)
                .streamTaskLog(taskId, 0);

    Assertions.assertTrue(streamOptional.isPresent());

It happens because, with this patch, the task/run/time metric is emitted in notifyStatus, which for a canceled task is somewhat before it actually stops running. Formerly, the metric in this scenario would be emitted when the task runner reports the task as finished and fires its callback. So, now there's a window between when task/run/time is emitted and when the log is written to S3.

I believe a little window where logs are not available shortly after tasks are canceled has always existed. I have seen it myself in production: sometimes you get a 404 on the task log for a recently-canceled task. But now it's visible to the test due to the timing of metric emission.

I do believe that emitting metrics in notifyStatus, as this patch does, is better than what we were doing before. notifyStatus is the method that updates the metadata store, which is canonical, so tying metrics to that makes the metrics more accurate. (That was the bug this patch is fixing.)

@kfaraz I'm wondering if you have a better idea to deal with the timing issue, beyond adding a sleep here. That's the best thing I can think of right now.

@kfaraz
Copy link
Contributor

kfaraz commented Jan 8, 2026

@kfaraz I'm wondering if you have a better idea to deal with the timing issue, beyond adding a sleep here. That's the best thing I can think of right now.

Thanks for the clarification, @gianm !
The only alternative I can think of would be to emit a metric in S3TaskLogs.pushTaskLog() but if you feel that seems like an overkill, we can proceed with a sleep to unblock this PR.

I think there are certain APIs for which there is no option but to have a test-sleep-repeat pattern.
We should consider putting something similar to ITRetryUtil.retryUntil() in EmbeddedClusterApis for this purpose.
It would allow us to have a uniform logic and make it easier to fix up those call sites later, if applicable.

@gianm
Copy link
Contributor Author

gianm commented Jan 8, 2026

IMO it's overkill to add a metric to the task log pusher, so I went the route of adding a new utility. The call site looks like this:

    final Optional<InputStream> streamOptional =
        cluster.callApi().waitForResult(
            () -> overlord.bindings()
                          .getInstance(TaskLogStreamer.class)
                          .streamTaskLog(taskId, 0),
            Optional::isPresent
        ).go();

The reason for the extra .go() is that the thing returned by waitForResult is like a builder. It has methods like withTimeoutMillis etc.

* retry loops and is therefore both more responsive, and better at catching race conditions. Use this method
* when there is no metric to wait on, and you believe that adding one would be overkill.
*/
public <T> ResultWaiter<T> waitForResult(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this and for the javadocs!

@gianm gianm merged commit 9c55da6 into apache:master Jan 8, 2026
110 of 114 checks passed
@gianm gianm deleted the ol-missing-metrics branch January 8, 2026 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants