Skip to content

Conversation

mosche
Copy link
Contributor

@mosche mosche commented Sep 9, 2025

This was failing very very rarely due to unfortunate timing conditions.

Cluster state changes are applied to all nodes prior to being published on the master node itself.
However, the cluster state listener was previously attached to the data node, allowing for a very short time window where the state update wasn't visible on the master node itself when checking in assertClusterStateSaveOK.

This changes the test to attach the listener to the master node itself preventing above condition.
I was initially worried it might be attached too late in cases, but I couldn't reproduce any more issues this way.

According to the dashboard, this started to fail on Monday (13/07). It definitely does not look like a test failure, so I'm assigning a medium priority, which we could raise if we discover this is a new bug.

I couldn't find any related commit that might have caused this. Still wondering why this started failing around that time 🤔

Fixes #131210

@mosche mosche requested a review from a team September 9, 2025 13:14
@mosche mosche added >test Issues or PRs that are addressing/adding tests :Core/Infra/Settings Settings infrastructure and APIs labels Sep 9, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@elasticsearchmachine elasticsearchmachine added Team:Core/Infra Meta label for core/infra team v9.2.0 labels Sep 9, 2025
Copy link
Contributor

@alexey-ivanov-es alexey-ivanov-es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mosche mosche added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 11, 2025
@elasticsearchmachine elasticsearchmachine merged commit a29392c into elastic:main Sep 19, 2025
34 checks passed
@mosche mosche deleted the tests/131210_FileSettingsServiceIT_testSettingsAppliedOnStart branch September 19, 2025 13:59
@rjernst
Copy link
Member

rjernst commented Sep 19, 2025

@mosche should/can this be backported?

szybia added a commit to szybia/elasticsearch that referenced this pull request Sep 19, 2025
* upstream/main:
  Turn NumericValues into functional interface (elastic#135068)
  Improve block loader for source only runtime fields of type keyword (elastic#135026)
  Mute org.elasticsearch.xpack.esql.qa.single_node.EsqlSpecIT test {csv-spec:stats.StdDeviationGroupedAllTypes} elastic#135103
  Mute org.elasticsearch.xpack.esql.qa.single_node.EsqlSpecIT test {csv-spec:stats.StdDeviationWithLongs} elastic#135102
  Mute org.elasticsearch.xpack.esql.qa.single_node.EsqlSpecIT test {csv-spec:inlinestats.StdDevFilter} elastic#135101
  Mute org.elasticsearch.xpack.esql.qa.single_node.EsqlSpecIT test {csv-spec:stats.StdDevFilter} elastic#135100
  Remove track_live_docs_in_memory_bytes feature flag (elastic#134900)
  Create SPI to allow prohibiting certain top-level mappings (elastic#132360)
  Only validate primary ids on release branches (elastic#135044)
  Added no-op support for project_routing query param to REST endpoints that will support cross-project search (elastic#134741)
  Fix race in FileSettingsServiceIT.testSettingsAppliedOnStart (elastic#134368)
@mosche mosche added auto-backport Automatically create backport pull requests when merged v8.19.5 v9.1.5 v9.0.8 v8.18.9 labels Sep 22, 2025
mosche added a commit to mosche/elasticsearch that referenced this pull request Sep 22, 2025
…#134368)

This was failing very very rarely due to unfortunate timing conditions.

Cluster state changes are applied to all nodes prior to being published
on the master node itself. However, the cluster state listener was
previously attached to the data node, allowing for a very short time
window where the state update wasn't visible on the master node itself
when checking in `assertClusterStateSaveOK`.

This changes the test to attach the listener to the master node itself
preventing above condition. I was initially worried it might be attached
too late in cases, but I couldn't reproduce any more issues this way.

> According to the dashboard, this started to fail on Monday (13/07). It
definitely does not look like a test failure, so I'm assigning a medium
priority, which we could raise if we discover this is a new bug.

I couldn't find any related commit that might have caused this. Still
wondering why this started failing around that time 🤔

Fixes elastic#131210

(cherry picked from commit a29392c)

# Conflicts:
#	muted-tests.yml
@mosche
Copy link
Contributor Author

mosche commented Sep 22, 2025

💔 Some backports could not be created

Status Branch Result
9.1
9.0 Conflict resolution was aborted by the user
8.19 Conflict resolution was aborted by the user
8.18 An unhandled error occurred. Please see the logs for details

Manual backport

To create the backport manually run:

backport --pr 134368

Questions ?

Please refer to the Backport tool documentation

@mosche
Copy link
Contributor Author

mosche commented Sep 22, 2025

@rjernst I've backported to 9.1, older branches don't contain an earlier fix this is based on. Anyways, this fails very rarely and was only ever observed on main

elasticsearchmachine pushed a commit that referenced this pull request Sep 22, 2025
#135196)

This was failing very very rarely due to unfortunate timing conditions.

Cluster state changes are applied to all nodes prior to being published
on the master node itself. However, the cluster state listener was
previously attached to the data node, allowing for a very short time
window where the state update wasn't visible on the master node itself
when checking in `assertClusterStateSaveOK`.

This changes the test to attach the listener to the master node itself
preventing above condition. I was initially worried it might be attached
too late in cases, but I couldn't reproduce any more issues this way.

> According to the dashboard, this started to fail on Monday (13/07). It
definitely does not look like a test failure, so I'm assigning a medium
priority, which we could raise if we discover this is a new bug.

I couldn't find any related commit that might have caused this. Still
wondering why this started failing around that time 🤔

Fixes #131210

(cherry picked from commit a29392c)

# Conflicts:
#	muted-tests.yml
@rjernst
Copy link
Member

rjernst commented Sep 22, 2025

older branches don't contain an earlier fix this is based on

Can you figure out which change wasn't backported? We should keep the test as in-sync across branches as possible so as to make applying fixes that do fail in older branches easier to backport.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Core/Infra/Settings Settings infrastructure and APIs Team:Core/Infra Meta label for core/infra team >test Issues or PRs that are addressing/adding tests v9.1.5 v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] FileSettingsServiceIT testSettingsAppliedOnStart failing

4 participants