Skip to content

Conversation

@nik9000
Copy link
Member

@nik9000 nik9000 commented Feb 7, 2025

The aggs timeout test waits for the agg to return and then double checks that the agg is stopped using the tasks API. We're seeing some failures where the tasks API reports that the agg is still running. I can't reproduce them because computers. This adds two things to hopefully fix the bug:

  1. Logs the hot_threads so we can see if the query is indeed still running.
  2. Retries the _tasks API for a minute. If it goes away soon after the _search returns that's fine. If it sticks around for more than a few seconds then the cancel isn't working. We wait for a minute because CI can't be trusted to do anything quickly.

Closes #121993

The aggs timeout test waits for the agg to return and then double checks
that the agg is stopped using the tasks API. We're seeing some failures
where the tasks API reports that the agg is still running. I can't
reproduce them because computers. This adds two things:
1. Logs the hot_threads so we can see if the query is indeed still
   running.
2. Retries the _tasks API for a minute. If it goes away soon after the
   _search returns that's *fine*. If it sticks around for more than a
   few seconds then the cancel isn't working. We wait for a minute
   because CI can't be trusted to do anything quickly.

Closes elastic#121993
@nik9000 nik9000 added >test Issues or PRs that are addressing/adding tests :Analytics/ES|QL AKA ESQL v9.0.0 v8.18.0 v9.1.0 v8.16.5 v8.17.3 labels Feb 7, 2025
@nik9000 nik9000 requested a review from not-napoleon February 7, 2025 14:53
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Feb 7, 2025
Copy link
Member

@not-napoleon not-napoleon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

* it still does. So long as it stops <strong>eventually</strong> that's
* still indicative of the interrupt code working.
* </p>
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ good clarification

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I had this method and thought "in three weeks I'm not going to know what it's ok to wait here."

@nik9000 nik9000 enabled auto-merge (squash) February 7, 2025 15:01
@nik9000 nik9000 merged commit cff329e into elastic:main Feb 7, 2025
17 checks passed
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Feb 7, 2025
The aggs timeout test waits for the agg to return and then double checks
that the agg is stopped using the tasks API. We're seeing some failures
where the tasks API reports that the agg is still running. I can't
reproduce them because computers. This adds two things:
1. Logs the hot_threads so we can see if the query is indeed still
   running.
2. Retries the _tasks API for a minute. If it goes away soon after the
   _search returns that's *fine*. If it sticks around for more than a
   few seconds then the cancel isn't working. We wait for a minute
   because CI can't be trusted to do anything quickly.

Closes elastic#121993
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Feb 7, 2025
The aggs timeout test waits for the agg to return and then double checks
that the agg is stopped using the tasks API. We're seeing some failures
where the tasks API reports that the agg is still running. I can't
reproduce them because computers. This adds two things:
1. Logs the hot_threads so we can see if the query is indeed still
   running.
2. Retries the _tasks API for a minute. If it goes away soon after the
   _search returns that's *fine*. If it sticks around for more than a
   few seconds then the cancel isn't working. We wait for a minute
   because CI can't be trusted to do anything quickly.

Closes elastic#121993
@nik9000 nik9000 added the v8.19.0 label Feb 7, 2025
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Feb 7, 2025
The aggs timeout test waits for the agg to return and then double checks
that the agg is stopped using the tasks API. We're seeing some failures
where the tasks API reports that the agg is still running. I can't
reproduce them because computers. This adds two things:
1. Logs the hot_threads so we can see if the query is indeed still
   running.
2. Retries the _tasks API for a minute. If it goes away soon after the
   _search returns that's *fine*. If it sticks around for more than a
   few seconds then the cancel isn't working. We wait for a minute
   because CI can't be trusted to do anything quickly.

Closes elastic#121993
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Feb 7, 2025
The aggs timeout test waits for the agg to return and then double checks
that the agg is stopped using the tasks API. We're seeing some failures
where the tasks API reports that the agg is still running. I can't
reproduce them because computers. This adds two things:
1. Logs the hot_threads so we can see if the query is indeed still
   running.
2. Retries the _tasks API for a minute. If it goes away soon after the
   _search returns that's *fine*. If it sticks around for more than a
   few seconds then the cancel isn't working. We wait for a minute
   because CI can't be trusted to do anything quickly.

Closes elastic#121993
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Feb 7, 2025
The aggs timeout test waits for the agg to return and then double checks
that the agg is stopped using the tasks API. We're seeing some failures
where the tasks API reports that the agg is still running. I can't
reproduce them because computers. This adds two things:
1. Logs the hot_threads so we can see if the query is indeed still
   running.
2. Retries the _tasks API for a minute. If it goes away soon after the
   _search returns that's *fine*. If it sticks around for more than a
   few seconds then the cancel isn't working. We wait for a minute
   because CI can't be trusted to do anything quickly.

Closes elastic#121993
@nik9000
Copy link
Member Author

nik9000 commented Feb 7, 2025

Backports:
9.0: #122076
8.19: #122079
8.18: #122080
8.17: #122081
8.16: #122082

elasticsearchmachine pushed a commit that referenced this pull request Feb 7, 2025
The aggs timeout test waits for the agg to return and then double checks
that the agg is stopped using the tasks API. We're seeing some failures
where the tasks API reports that the agg is still running. I can't
reproduce them because computers. This adds two things:
1. Logs the hot_threads so we can see if the query is indeed still
   running.
2. Retries the _tasks API for a minute. If it goes away soon after the
   _search returns that's *fine*. If it sticks around for more than a
   few seconds then the cancel isn't working. We wait for a minute
   because CI can't be trusted to do anything quickly.

Closes #121993
elasticsearchmachine pushed a commit that referenced this pull request Feb 7, 2025
The aggs timeout test waits for the agg to return and then double checks
that the agg is stopped using the tasks API. We're seeing some failures
where the tasks API reports that the agg is still running. I can't
reproduce them because computers. This adds two things:
1. Logs the hot_threads so we can see if the query is indeed still
   running.
2. Retries the _tasks API for a minute. If it goes away soon after the
   _search returns that's *fine*. If it sticks around for more than a
   few seconds then the cancel isn't working. We wait for a minute
   because CI can't be trusted to do anything quickly.

Closes #121993
elasticsearchmachine pushed a commit that referenced this pull request Feb 7, 2025
The aggs timeout test waits for the agg to return and then double checks
that the agg is stopped using the tasks API. We're seeing some failures
where the tasks API reports that the agg is still running. I can't
reproduce them because computers. This adds two things:
1. Logs the hot_threads so we can see if the query is indeed still
   running.
2. Retries the _tasks API for a minute. If it goes away soon after the
   _search returns that's *fine*. If it sticks around for more than a
   few seconds then the cancel isn't working. We wait for a minute
   because CI can't be trusted to do anything quickly.

Closes #121993
elasticsearchmachine pushed a commit that referenced this pull request Feb 7, 2025
The aggs timeout test waits for the agg to return and then double checks
that the agg is stopped using the tasks API. We're seeing some failures
where the tasks API reports that the agg is still running. I can't
reproduce them because computers. This adds two things:
1. Logs the hot_threads so we can see if the query is indeed still
   running.
2. Retries the _tasks API for a minute. If it goes away soon after the
   _search returns that's *fine*. If it sticks around for more than a
   few seconds then the cancel isn't working. We wait for a minute
   because CI can't be trusted to do anything quickly.

Closes #121993
elasticsearchmachine pushed a commit that referenced this pull request Feb 7, 2025
The aggs timeout test waits for the agg to return and then double checks
that the agg is stopped using the tasks API. We're seeing some failures
where the tasks API reports that the agg is still running. I can't
reproduce them because computers. This adds two things:
1. Logs the hot_threads so we can see if the query is indeed still
   running.
2. Retries the _tasks API for a minute. If it goes away soon after the
   _search returns that's *fine*. If it sticks around for more than a
   few seconds then the cancel isn't working. We wait for a minute
   because CI can't be trusted to do anything quickly.

Closes #121993
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) >test Issues or PRs that are addressing/adding tests v8.16.5 v8.17.3 v8.18.0 v8.19.0 v9.0.0 v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] AggsTimeoutIT testTerms failing

3 participants