Skip to content

Conversation

@samxbr
Copy link
Contributor

@samxbr samxbr commented Dec 20, 2024

The entire test suite was muted due to a bunch of java.net.SocketTimeoutException: 60,000 milliseconds timeout on connection http-outgoing-1 [ACTIVE], possibly due to transient network issue. Unmuting seems fine.

Closes #118215

@samxbr samxbr added >test Issues or PRs that are addressing/adding tests :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels Dec 23, 2024
Copy link
Contributor

@mattc58 mattc58 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM assuming CI is happy

@samxbr samxbr marked this pull request as ready for review December 23, 2024 16:04
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Dec 23, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@samxbr samxbr merged commit d92233c into elastic:main Dec 23, 2024
16 checks passed
@nielsbauman
Copy link
Contributor

FWIW, timeouts in tests are often caused by test clusters crashing, so they're more often than not caused by actual issues rather than infrastructure blips. For instance, in the linked test issue, the first build failed due to Failure running machine learning native code. which we've seen before. Infra blips definitely happen from time to time, but it's usually worth double-checking if it's a blip or if something else happened.

The other builds are from PRs, which can be tricky because a PR might have introduced breaking code (i.e. they were still a work in progress). For instance, the last and second to last builds contain a bunch of HTTP timeout exceptions, but those are builds of #117787 which changes a bunch of network stuff - and I am going to guess that that is no coincidence.

The second and third builds are even more tricky because they actually contain an ingest test failure: Failure at [ingest/310_reroute_processor:705]: field [hits.hits] doesn't have length [2]. Both builds originate from the same PR (#118143), so that makes it more likely that both of them are caused by the PR itself, but it could (theoretically) just be a coincidence so it's not a guarantee. If we look at all the occurrences in the last 30 days, we see one more failed build, but that turns out to be from the same PR too - it's the failed serverless check for that PR. Seeing as there are no other occurrences, we can say with reasonable confidence that these ingest failures were caused by the PR.

All that to say that I agree with the unmuting of the test suite 😄. But I hope this was still somewhat valuable - and otherwise I still had some fun investigating it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team >test Issues or PRs that are addressing/adding tests v9.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] IngestCommonClientYamlTestSuiteIT class failing

5 participants