-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Emit metrics on all task completions. #18766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Previously, "emitTaskCompletionLogsAndMetrics" would emit the metrics task/run/time, task/success/count, and task/failed/count only for tasks that complete due to an attached runner callback (from attachCallbacks). This patch causes metrics to be emitted whenever notifyStatus successfully marks a task as completed. The prior behavior missed scenarios where the shutdown API is used on a task that the runner is aware of but has not yet been added to the queue. It could happen during Overlord startup, while the queue is initializing.
kfaraz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, the test failures seem genuine though.
… race with its finishing.
…ve than runner status.
|
I believe the Docker test failure is due to a new race in eventCollector.latchableEmitter().waitForEvent(
event -> event.hasMetricName("task/run/time")
.hasDimension(DruidMetrics.TASK_ID, taskId)
.hasDimension(DruidMetrics.TASK_STATUS, "FAILED")
);
final Optional<InputStream> streamOptional =
overlord.bindings()
.getInstance(TaskLogStreamer.class)
.streamTaskLog(taskId, 0);
Assertions.assertTrue(streamOptional.isPresent());It happens because, with this patch, the I believe a little window where logs are not available shortly after tasks are canceled has always existed. I have seen it myself in production: sometimes you get a 404 on the task log for a recently-canceled task. But now it's visible to the test due to the timing of metric emission. I do believe that emitting metrics in @kfaraz I'm wondering if you have a better idea to deal with the timing issue, beyond adding a sleep here. That's the best thing I can think of right now. |
Thanks for the clarification, @gianm ! I think there are certain APIs for which there is no option but to have a test-sleep-repeat pattern. |
|
IMO it's overkill to add a metric to the task log pusher, so I went the route of adding a new utility. The call site looks like this: The reason for the extra |
| * retry loops and is therefore both more responsive, and better at catching race conditions. Use this method | ||
| * when there is no metric to wait on, and you believe that adding one would be overkill. | ||
| */ | ||
| public <T> ResultWaiter<T> waitForResult( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this and for the javadocs!
Previously, "emitTaskCompletionLogsAndMetrics" would emit the metrics task/run/time, task/success/count, and task/failed/count only for tasks that complete due to an attached runner callback (from attachCallbacks). This patch causes metrics to be emitted whenever notifyStatus successfully marks a task as completed, which covers a wider variety of scenarios.
The prior behavior missed scenarios where the shutdown API is used on a task that the runner is aware of but has not yet been added to the queue. It could happen during Overlord startup, while the queue is initializing.
This patch also fixes a bug in
TaskQueue#getTaskStatus, where it was using status from thetaskRunnerrather thanactiveTasks. The status fromactiveTasksis more authoritative and should be preferred. This caused flakiness inMSQWorkerFaultToleranceTest, which was being exacerbated after the metrics changes above, by the fact that metrics were emitted slightly earlier and thereforefaultyIndexer.stop()was called slightly earlier. Fixing theTaskQueue#getTaskStatusbug seems to have resolved the flakiness in the test.