Fix data stream retrieval in `DataStreamLifecycleServiceIT` #125195

nielsbauman · 2025-03-19T09:55:25Z

These tests had the potential to fail when two consecutive GET data streams requests would hit two different nodes, where one node already had the cluster state that contained the new backing index and the other node didn't yet.

Caused by #122852

Fixes #124846
Fixes #124950
Fixes #124999

These tests had the potential to fail when two consecutive GET data streams requests would hit two different nodes, where one node already had the cluster state that contained the new backing index and the other node didn't yet. Caused by elastic#122852 Fixes elastic#124846 Fixes elastic#124950 Fixes elastic#124999

elasticsearchmachine · 2025-03-19T09:55:48Z

Pinging @elastic/es-data-management (Team:Data Management)

nielsbauman · 2025-03-19T09:56:55Z

...alClusterTest/java/org/elasticsearch/datastreams/lifecycle/DataStreamLifecycleServiceIT.java

I'm a little bit on the fence about backporting these changes or not. These fixes are only necessary on v9.1.0 because the GET data streams API only runs on the local node starting with that version, but I'm thinking about backporting test fixes in the future to versions before v9.1.0. I think I'm leaning towards not backporting, but other thoughts are welcome.

test/framework/src/main/java/org/elasticsearch/test/ClusterServiceUtils.java

nielsbauman · 2025-03-19T10:02:33Z

test/framework/src/main/java/org/elasticsearch/test/ESIntegTestCase.java

+     */
+    public static List<String> waitForDataStreamIndices(String dataStreamName, int expectedSize, boolean failureStore) {
+        final var clusterService = internalCluster().getCurrentMasterNodeInstance(ClusterService.class);
+        final var listener = ClusterServiceUtils.addTemporaryStateListener(clusterService, clusterState -> {


David Turner suggested using a temporary cluster state listener instead of an assertBusy that uses the GET data streams transport action. This has improved performance as we can react to new cluster state updates immediately instead of waiting for the exponential backoff of assertBusy. The only "drawback" is that an assertBusy would implicitly cover the GET data streams transport action. I don't think we should rely on tests like these to cover that action - i.e. we should have dedicated tests that sufficiently cover the action, and I think we do.

I would rather not add the override that returns the cluster state like this. Instead, just await for the cluster to reach a suitable state and then use the regular APIs and transport actions to query what you need. That way you get the little bit of extra coverage from calling the transport action, and you also get some confidence that the state you were waiting for doesn't immediately vanish in the next update since the transport action may see a future state.

Hm, that reduces the performance benefit because we'll have to do another request, but only by a little bit I think. I agree that it does bring back some extra coverage. I'll revert the changes to addTemporaryStateListener and use the API to fetch the backing indices.

The performance benefits are all about avoiding the Thread.sleep in the assertBusy call, oversleeping past the point where the cluster state is ready for you. The cost of the transport action should be very low.

DaveCTurner · 2025-03-19T10:32:29Z

test/framework/src/main/java/org/elasticsearch/test/ESIntegTestCase.java

+     */
+    public static List<String> waitForDataStreamIndices(String dataStreamName, int expectedSize, boolean failureStore) {
+        final var clusterService = internalCluster().getCurrentMasterNodeInstance(ClusterService.class);
+        final var listener = ClusterServiceUtils.addTemporaryStateListener(clusterService, clusterState -> {


I would rather not add the override that returns the cluster state like this. Instead, just await for the cluster to reach a suitable state and then use the regular APIs and transport actions to query what you need. That way you get the little bit of extra coverage from calling the transport action, and you also get some confidence that the state you were waiting for doesn't immediately vanish in the next update since the transport action may see a future state.

test/framework/src/main/java/org/elasticsearch/test/ClusterServiceUtils.java

DaveCTurner

General pattern of changes looks good but I'll leave a detailed review to someone else if that's ok

DaveCTurner · 2025-03-19T11:18:36Z

test/framework/src/main/java/org/elasticsearch/test/ESIntegTestCase.java

+        });
+        safeAwait(listener);
+        final var backingIndexNames = getDataStreamBackingIndexNames(dataStreamName, failureStore);
+        assertEquals("Retrieved number of data stream indices doesn't match expectation", expectedSize, backingIndexNames.size());


suggest mentioning the data stream name in the message, and also using assertThat("...", backingIndexNames, hasSize(expectedSize)) so you get to see more detail if it fails

I'll add the data stream name in the message, but I don't think the assertion mechanism changes much:
current implementation gives

Retrieved number of data stream indices doesn't match expectation Expected :1 Actual :2 <Click to see difference> java.lang.AssertionError: Retrieved number of data stream indices doesn't match expectation expected:<1> but was:<2> at __randomizedtesting.SeedInfo.seed([779A73C770529FF8:DDD333CAFBC6DD08]:0) at org.junit.Assert.fail(Assert.java:89) at org.junit.Assert.failNotEquals(Assert.java:835) at org.junit.Assert.assertEquals(Assert.java:647) at org.elasticsearch.test.ESIntegTestCase.waitForDataStreamIndices(ESIntegTestCase.java:880) at org.elasticsearch.test.ESIntegTestCase.waitForDataStreamBackingIndices(ESIntegTestCase.java:861) at org.elasticsearch.datastreams.lifecycle.DataStreamLifecycleServiceIT.testErrorRecordingOnRetention(DataStreamLifecycleServiceIT.java:643)

Your suggestion gives

Retrieved number of data stream indices doesn't match expectation Expected: a collection with size <1> but: collection size was <2> java.lang.AssertionError: Retrieved number of data stream indices doesn't match expectation Expected: a collection with size <1> but: collection size was <2> at __randomizedtesting.SeedInfo.seed([779A73C770529FF8:DDD333CAFBC6DD08]:0) at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) at org.elasticsearch.test.ESTestCase.assertThat(ESTestCase.java:2674) at org.elasticsearch.test.ESIntegTestCase.waitForDataStreamIndices(ESIntegTestCase.java:881) at org.elasticsearch.test.ESIntegTestCase.waitForDataStreamBackingIndices(ESIntegTestCase.java:862) at org.elasticsearch.datastreams.lifecycle.DataStreamLifecycleServiceIT.testErrorRecordingOnRetention(DataStreamLifecycleServiceIT.java:643)

I'll add the retrieved index names to the message/reason, that should probably help debugging.

oh hm that's disappointing, I thought hasSize reported the contents of the collection too

gmarouli

LGTM, I added some comments but this is already an improvement :) . Thanks for fixing it.

gmarouli · 2025-03-20T13:17:09Z

...alClusterTest/java/org/elasticsearch/datastreams/lifecycle/DataStreamLifecycleServiceIT.java

                    assertThat(getDataStreamResponse.getDataStreams().get(0).getDataStream().getName(), equalTo(SYSTEM_DATA_STREAM_NAME));
-                    List<Index> backingIndices = getDataStreamResponse.getDataStreams().get(0).getDataStream().getIndices();
-                    assertThat(backingIndices.size(), equalTo(1)); // global retention is ignored
+                    List<Index> currentBackingIndices = getDataStreamResponse.getDataStreams().get(0).getDataStream().getIndices();


We could use getDataStreamBackingIndexNames(SYSTEM_DATA_STREAM_NAME) here, right?

We're serializing the getDataStreamResponse to XContent a few lines below, so I preferred this over doing two API calls.

...alClusterTest/java/org/elasticsearch/datastreams/lifecycle/DataStreamLifecycleServiceIT.java

gmarouli · 2025-03-20T13:21:28Z

...alClusterTest/java/org/elasticsearch/datastreams/lifecycle/DataStreamLifecycleServiceIT.java

-                    break;
-                }
-            }
+            DataStreamLifecycleService lifecycleService = internalCluster().getCurrentMasterNodeInstance(DataStreamLifecycleService.class);


gmarouli · 2025-03-20T13:31:04Z

test/framework/src/main/java/org/elasticsearch/test/ESIntegTestCase.java

+    /**
+     * Waits for the specified data stream to have the expected number of backing or failure indices.
+     */
+    public static List<String> waitForDataStreamIndices(String dataStreamName, int expectedSize, boolean failureStore) {


Food for thought, what if instead of expectedSize we would add a predicate here. This would change the nature of this to wait for condition or something, but it might make it more reusable and extend the benefit of avoiding a Thread.sleep in other places too. What do you think?

Yeah, I thought about that too. I decided to not do that (yet) as I didn't see any use cases where that would be relevant. On second thought, I think sections like these could make use of that:

elasticsearch/modules/data-streams/src/internalClusterTest/java/org/elasticsearch/datastreams/lifecycle/DataStreamLifecycleServiceIT.java

Lines 185 to 192 in b08eea6

assertBusy(() -> {

List<String> backingIndices = getDataStreamBackingIndexNames(dataStreamName);

assertThat(backingIndices.size(), equalTo(1));

// we expect the data stream to have only one backing index, the write one, with generation 2

// as generation 1 would've been deleted by the data stream lifecycle given the configuration

String writeIndex = backingIndices.get(0);

assertThat(writeIndex, backingIndexEqualTo(dataStreamName, 2));

});

However, in that case, it doesn't really make sense to retrieve the backing indices. Then, having a different dedicated method just for data stream predicates feels a little over-specific, so a generic cluster-state predicate one is probably more sensible. However, then we're getting pretty close to the addTemporaryStateListener already (and the awaitClusterState). I would still see value in adding a wrapper in ESIntegTestCase that fetches the ClusterService instance from the master node, to avoid having to do that in every test. What do you think?

Hm, I would say it depends on how much is this going to be used, for example:

If it's used only in this file, the helper method could reside in this file.

If it's used only in this module, then the helper could be part of a fixture and that all the other tests use. If I am not mistaken we have more methods that could go there.

If it's used by many tests scattered all over the place, then I agree that it's worth putting it in ESIntegTestCase

If you agree with the above, where do you think this should reside?

I'll respond to those questions later, but I'm not sure how they're related to this discussion? I was referring to the implementation of the additional helper method.

I'm going to merge this for now. There are other similar test suites that need to be updated. When I get to those, I'll automatically see what potential other use cases there are and we can adjust the helper method accordingly.

ah, I misunderstood. What I had in mind was that we can share the method that you said would be rather specific:

Then, having a different dedicated method just for data stream predicates feels a little over-specific

Let me try to understand better what you meant with the following:

I would still see value in adding a wrapper in ESIntegTestCase that fetches the ClusterService instance from the master node, to avoid having to do that in every test

You mean to avoid doing this: final var clusterService = internalCluster().getCurrentMasterNodeInstance(ClusterService.class);? Something like:

public static SubscribableListener<Void> addTemporaryStateListener(Predicate<ClusterState> predicate) { final var clusterService = internalCluster().getCurrentMasterNodeInstance(ClusterService.class); return ClusterServiceUtils.addTemporaryStateListener(clusterService, predicate); }

Not sure, if this is worth it honestly, considering it's only one line of code. Do you think this is too much of the mental load for a developer?

Note that we already have org.elasticsearch.test.ESIntegTestCase#clusterService to access the ClusterService of some random node, which is often what you want. But here it's important that it's the cluster service on the currently elected master node, since we need to be sure we've applied the state everywhere else before proceeding.

Not that I'm against a utility for this somewhere in the test framework (suggest ClusterServiceUtils over ESIntegTestCase). I see other places where it would be useful. Just that I think it should mention masterNode somewhere in its name.

See #125648

Thanks, David! My only reason for

I would still see value in adding a wrapper in ESIntegTestCase that fetches the ClusterService instance from the master node, to avoid having to do that in every test.

was based on my assumption that ESIntegTestCase#internalCluster was not publicly available. Your PR is exactly what I had in mind.

@gmarouli, FTR, I agree that the mental load is minimal. I have a personal vendetta with boilerplate/duplicate code. In general, I'm also a fan of reducing test method sizes, as I feel a lot of our tests are hard to read, so any line reduction is a win in my eyes.

…125195) These tests had the potential to fail when two consecutive GET data streams requests would hit two different nodes, where one node already had the cluster state that contained the new backing index and the other node didn't yet. Caused by elastic#122852 Fixes elastic#124846 Fixes elastic#124950 Fixes elastic#124999

nielsbauman added >test Issues or PRs that are addressing/adding tests :Data Management/Data streams Data streams and their lifecycles Team:Data Management Meta label for data/management team v9.1.0 labels Mar 19, 2025

nielsbauman commented Mar 19, 2025

View reviewed changes

DaveCTurner reviewed Mar 19, 2025

View reviewed changes

nielsbauman added 4 commits March 19, 2025 11:57

Fetch backing indices using transport action

73f2aef

Merge branch 'main' into fix-data-stream-lifecycle-service-it

201bcf0

Remove import

7918984

Revert changes in addTemporaryStateListener

23ef31f

DaveCTurner reviewed Mar 19, 2025

View reviewed changes

Improve assertion message

b08eea6

gmarouli approved these changes Mar 20, 2025

View reviewed changes

Merge branch 'main' into fix-data-stream-lifecycle-service-it

09f83b2

nielsbauman enabled auto-merge (squash) March 24, 2025 14:12

nielsbauman merged commit 542a3b6 into elastic:main Mar 24, 2025
17 of 18 checks passed

nielsbauman deleted the fix-data-stream-lifecycle-service-it branch March 24, 2025 15:43

This was referenced Mar 25, 2025

Fix data stream retrieval in DataStreamAndIndexLifecycleMixingTests #125609

Merged

Fix data stream retrieval in ExplainDataStreamLifecycleIT #125611

Merged

nielsbauman mentioned this pull request Apr 25, 2025

Remove dependency on cluster state API in SpecificMasterNodesIT #127213

Merged

	assertBusy(() -> {
	List<String> backingIndices = getDataStreamBackingIndexNames(dataStreamName);
	assertThat(backingIndices.size(), equalTo(1));
	// we expect the data stream to have only one backing index, the write one, with generation 2
	// as generation 1 would've been deleted by the data stream lifecycle given the configuration
	String writeIndex = backingIndices.get(0);
	assertThat(writeIndex, backingIndexEqualTo(dataStreamName, 2));
	});

Fix data stream retrieval in DataStreamLifecycleServiceIT #125195

Fix data stream retrieval in DataStreamLifecycleServiceIT #125195

Uh oh!

Conversation

nielsbauman commented Mar 19, 2025

Uh oh!

elasticsearchmachine commented Mar 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gmarouli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix data stream retrieval in `DataStreamLifecycleServiceIT` #125195

Fix data stream retrieval in `DataStreamLifecycleServiceIT` #125195