ILM: Force merge on zero-replica cloned index before snapshot #133954

nielsbauman · 2025-09-01T17:34:40Z

When performing a searchable snapshot action with force merge enabled, if the source index has one or more replicas, ILM now clones the index with zero replicas and performs the force merge on the clone. The snapshot is then taken from the force-merged clone instead of the source index, ensuring only primary shards are force-merged. The cloned index is deleted after the snapshot is mounted, and all references and step logic have been updated accordingly. Test coverage was added for the new flow, including handling retries and cleanup of failed clones.

Key changes:

Execution state: Track the force-merged clone index in ILM state and propagate through relevant APIs.
SearchableSnapshotAction: Add conditional steps to clone the index with 0 replicas, force-merge, and delete the clone as needed.
Steps: Update ForceMerge, SegmentCount, Snapshot, and Delete steps to operate on the correct index (source or clone).
Tests/QA: Add and enhance tests to verify force-merge and snapshot behavior with and without replicas, including retry/cleanup paths and configuration for stable force-merges.

Resolves #75478

When performing a searchable snapshot action with force merge enabled, if the source index has one or more replicas, ILM now clones the index with zero replicas and performs the force merge on the clone. The snapshot is then taken from the force-merged clone instead of the source index, ensuring only primary shards are force-merged. The cloned index is deleted after the snapshot is mounted, and all references and step logic have been updated accordingly. Test coverage was added for the new flow, including handling retries and cleanup of failed clones. Key changes: - Execution state: Track the force-merged clone index in ILM state and propagate through relevant APIs. - SearchableSnapshotAction: Add conditional steps to clone the index with 0 replicas, force-merge, and delete the clone as needed. - Steps: Update ForceMerge, SegmentCount, Snapshot, and Delete steps to operate on the correct index (source or clone). - Tests/QA: Add and enhance tests to verify force-merge and snapshot behavior with and without replicas, including retry/cleanup paths and configuration for stable force-merges. Resolves elastic#75478

Copilot

Pull Request Overview

This PR implements a new ILM behavior for searchable snapshot actions with force merge enabled. When the source index has one or more replicas, ILM now clones the index with zero replicas and performs the force merge on the clone instead of the original index. This optimization avoids unnecessarily force-merging replica shards since snapshots only capture primary shards.

Force merge optimization by cloning indices with replicas to avoid merging unnecessary replica shards
New execution state tracking for force-merged clone indices throughout the ILM lifecycle
Enhanced test coverage including failure scenarios and retry mechanisms

Reviewed Changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
TransportExplainLifecycleAction.java	Adds force merge index name to ILM explain response
SearchableSnapshotActionIT.java	Comprehensive test coverage for new clone-based force merge behavior
TimeSeriesRestDriver.java	Utility method for moving indices between ILM steps in tests
build.gradle	Test cluster configuration to prevent shard rebalancing during force merges
SearchableSnapshotActionTests.java	Unit tests validating clone step configuration and replica settings
IndexLifecycleExplainResponse*.java	Response model updates to include force merge index tracking
SegmentCountStep.java	Updated to operate on cloned index when available
SearchableSnapshotAction.java	Core logic implementing conditional clone steps and cleanup
MountSnapshotStep.java	Enhanced snapshot mounting logic for cloned indices
GenerateSnapshotNameStep.java	Updated to use cloned index name for snapshot generation
ForceMergeStep.java	Modified to target cloned index when present
DeleteStep.java	Enhanced with configurable target index deletion capability
CreateSnapshotStep.java	Updated to snapshot the force-merged clone instead of original
ESRestTestCase.java	Test framework utility for waiting on index deletion
LifecycleExecutionState.java	Core state tracking for force merge index names

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

...de/src/javaRestTest/java/org/elasticsearch/xpack/ilm/actions/SearchableSnapshotActionIT.java

...lugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/IndexLifecycleExplainResponse.java

Co-authored-by: Copilot <[email protected]>

elasticsearchmachine · 2025-09-01T17:38:46Z

Pinging @elastic/es-data-management (Team:Data Management)

elasticsearchmachine · 2025-09-01T17:38:46Z

Hi @nielsbauman, I've created a changelog YAML for you.

nielsbauman · 2025-09-02T06:50:20Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/SearchableSnapshotAction.java

+                ClusterHealthStatus.GREEN,
+                FORCE_MERGE_INDEX_NAME_SUPPLIER
+            ),
+            cleanupClonedIndexKey


Hm, I thought I copied the approach of the ShrinkAction here by going back to the cleanup step if the threshold/timeout is passed. But it looks like that's not the case:

elasticsearch/x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/ShrinkAction.java

Lines 267 to 270 in 1ff6608

ClusterStateWaitUntilThresholdStep checkShrinkReadyStep = new ClusterStateWaitUntilThresholdStep(

new CheckShrinkReadyStep(allocationRoutedKey, shrinkKey),

setSingleNodeKey

);

The ShrinkAction just goes back to SetSingleNodeAllocateStep. I'm inclined to think my current approach is safer, but I'm also a fan of consistency. Anyone else have any thoughts?

The one you linked is the "wait for single node allocation bit", where the new shrunken index hasn't been created yet (so there's nothing to clean up). You use the same behavior to go back to the cleanup later on in the file:

elasticsearch/x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/ShrinkAction.java

Lines 289 to 295 in 1ff6608

// wait until the shrunk index is recovered. we again wait until the configured threshold is breached and if the shrunk index has

// not successfully recovered until then, we rewind to the "cleanup-shrink-index" step to delete this unsuccessful shrunk index

// and retry the operation by generating a new shrink index name and attempting to shrink again

ClusterStateWaitUntilThresholdStep allocated = new ClusterStateWaitUntilThresholdStep(

new ShrunkShardsAllocatedStep(enoughShardsKey, copyMetadataKey),

cleanupShrinkIndexKey

);

Which matches the behavior here, so I believe it is consistent.

dakrone

Thanks for working on this Niels! I left some comments but they're mostly cosmetic.

dakrone · 2025-09-02T19:30:02Z

docs/changelog/133954.yaml

@@ -0,0 +1,6 @@
+pr: 133954
+summary: "ILM: Force merge on zero-replica cloned index before snapshot"


Perhaps mention that this is for the searchable snapshot step here?

I changed it to ILM: Force merge on zero-replica cloned index before snapshotting for searchable snapshots. Let me know if that matches what you had in mind.

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/DeleteStep.java

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/ForceMergeStep.java

server/src/main/java/org/elasticsearch/cluster/metadata/LifecycleExecutionState.java

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/GenerateSnapshotNameStep.java

dakrone · 2025-09-02T19:54:19Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/SearchableSnapshotAction.java

+                IndexMetadata indexMetadata = project.index(index);
+                assert indexMetadata != null : "index " + index.getName() + " must exist in the cluster state";
+                String cloneIndexName = indexMetadata.getLifecycleExecutionState().forceMergeIndexName();
+                return cloneIndexName != null && project.index(cloneIndexName) != null;


What happens here if a user manually removes the cloned index after it was created so that project.index(cloneIndexName) returns null? Wouldn't we erroneously assume we're on the no-clone path?

If a user manually removes the cloned index after it was created and we then run the DeleteStep on the force-merged index, it would fail, right?. So, that assumption doesn't sound "erroneous" to me. Or am I misinterpreting your comment?

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/SearchableSnapshotAction.java

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/SegmentCountStep.java

...de/src/javaRestTest/java/org/elasticsearch/xpack/ilm/actions/SearchableSnapshotActionIT.java

dakrone

This mostly looks good to me, but I have one concern (EDIT: see the bottom).

~~I can't comment on the code lines (because it's out of the review), but in the SearchableSnapshotAction steps where we copy over the execution state and the lifecycle name setting:~~

        CopyExecutionStateStep copyMetadataStep = new CopyExecutionStateStep(
            copyMetadataKey,
            copyLifecyclePolicySettingKey,
            (index, executionState) -> getRestoredIndexPrefix(copyMetadataKey) + index,
            keyForReplicateForOrContinue
        );
        CopySettingsStep copySettingsStep = new CopySettingsStep(
            copyLifecyclePolicySettingKey,
            dataStreamCheckBranchingKey,
            forceMergeIndex ? conditionalDeleteForceMergedIndexKey : dataStreamCheckBranchingKey,
            (index, lifecycleState) -> getRestoredIndexPrefix(copyLifecyclePolicySettingKey) + index,
            LifecycleSettings.LIFECYCLE_NAME
        );

It appears that we're using getRestoredIndexPrefix(copyLifecyclePolicySettingKey) + index for the name of the index into which we should copy the execution state settings. This would normally be fine, because it would be:

* my-backing-index which is snapshotted
* mount as either partial-my-backing-index or restored-my-backing-index
* copy the state to the partial or restored version depending on which prefix was used.

~~However, in this case, my-backing…~~

RECORD SCRATCH.

It was at this point in my thinking and typing this comment out that I realized that we still control the mounted index name, so even if we snapshot fm-clone-my-backing-index we still mount it as partial-my-backing-index, NOT partial-fm-clone-my-backing-index, which means that the concern above is not valid! However, I've decided to leave my initially-erroneous assessment in for posterity for anyone else that might come looking or have the same concern.

Feel free to ignore the above, as the PR looks good to me, thanks Niels. 😄

...lugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/IndexLifecycleExplainResponse.java

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/SegmentCountStep.java

dakrone · 2025-09-04T22:13:41Z

...de/src/javaRestTest/java/org/elasticsearch/xpack/ilm/actions/SearchableSnapshotActionIT.java

+            // With multiple primary shards, the segments are more spread out, so it's even less likely that we'll get more than 1 segment
+            // in one shard, and some shards might even be empty.
+            assertThat(preLifecycleBackingIndexSegments, greaterThanOrEqualTo(0));


Even if there are multiple primary shards and some shards may be empty, shouldn't the total still be >= 1, since we've indexed at least one document? I'm not sure I understand how having one primary means we have >= 1 segment, but having more than one primary means that we may get 0 segments.

Yeah I understand your confusion. I had a similar confusion the first time I looked at it, but forgot to mention that somewhere, sorry. The caveat here is that TimeSeriesRestDriver#getNumberOfPrimarySegments(Client, String) returns the number of segments for the "first" primary shard, i.e. it just gets shard 0:

elasticsearch/x-pack/plugin/ilm/qa/multi-node/src/javaRestTest/java/org/elasticsearch/xpack/TimeSeriesRestDriver.java

Line 394 in 7af81af

List<Map<String, Object>> shards = (List<Map<String, Object>>) responseEntity.get("0");

That means that it will essentially return the number of segments of a random primary shard if there are multiple primary shards. There is only one other usage of that method, which runs a test with only one primary shard, so I think we can change the implementation of the method to return the sum of segments across all primary shards. What do you think?

Ahhh okay, the name definitely doesn't make that clear. I'd be in favor of making it return the sum of segments across all the primaries as well.

Fixed in 5806e64.

…5834) In this PR we move the force-merge operation from the downsampling request to the ILM action. Our goal is to decouple the downsampling operation from the force-merge operation. With this change the downsampling request is responsible to ensure that the downsampled index is refreshed and flushed but not to force merge it. We believe that most of the time this is not necessary, and executing the force-merge operation unnecessarily can increase the load on the cluster. To preserve backwards compatibility we move the responsibility to execute the existing force merge to the downsample ILM action and we make it configurable. By default, it will run but a user can disable it just as they can with a searchable snapshot. ``` "downsample": { "fixed_interval": "1h", "force_merge_index": false } ``` **Update** As a follow up of this PR, we pose the question is the force merge in the downsample action intentional and useful? To answer this question, we extend time series telemetry. We define that the force merge step in the downsample ILM action is useful, if this is the only force merge step operation before a searchable snapshot. Effectively, by this definition, we argue that the force merge in downsampling is not an intentional operation the user has requested but only the result of the implementation. We identify the biggest impact of removing it to be a searchable snapshot, but if the searchable snapshot performs its own force merge (and more performant force merge #133954) then we could skip this operation in the downsample action altogether. Fixes: #135618

…stic#135834) In this PR we move the force-merge operation from the downsampling request to the ILM action. Our goal is to decouple the downsampling operation from the force-merge operation. With this change the downsampling request is responsible to ensure that the downsampled index is refreshed and flushed but not to force merge it. We believe that most of the time this is not necessary, and executing the force-merge operation unnecessarily can increase the load on the cluster. To preserve backwards compatibility we move the responsibility to execute the existing force merge to the downsample ILM action and we make it configurable. By default, it will run but a user can disable it just as they can with a searchable snapshot. ``` "downsample": { "fixed_interval": "1h", "force_merge_index": false } ``` **Update** As a follow up of this PR, we pose the question is the force merge in the downsample action intentional and useful? To answer this question, we extend time series telemetry. We define that the force merge step in the downsample ILM action is useful, if this is the only force merge step operation before a searchable snapshot. Effectively, by this definition, we argue that the force merge in downsampling is not an intentional operation the user has requested but only the result of the implementation. We identify the biggest impact of removing it to be a searchable snapshot, but if the searchable snapshot performs its own force merge (and more performant force merge elastic#133954) then we could skip this operation in the downsample action altogether. Fixes: elastic#135618

…le_snapshot` In elastic#133954, we modified the `searchable_snapshot` ILM action to clone the index with 0 replicas before performing the force-merge. We didn't take the `index.auto_expand_replicas` setting into account, which could result in the clone having indices after all. That's harmless, as it merely nullifies the optimization of that PR, but we should remove the setting to ensure we achieve the intended optimizations.

If an index is in either `logsdb` or `time_series` mode and specifies a non-default `@timestamp` type mapping (e.g. `date_nanos`), using the clone, split, or shrink API will result in shards that are unable to initialize/recover due to a mapping conflict. As of #133954, the `searchable_snapshot` ILM action makes use of the clone API by default - if the index has more than `0` replicas - and will thus also run into this issue.

If an index is in either `logsdb` or `time_series` mode and specifies a non-default `@timestamp` type mapping (e.g. `date_nanos`), using the clone, split, or shrink API will result in shards that are unable to initialize/recover due to a mapping conflict. As of elastic#133954, the `searchable_snapshot` ILM action makes use of the clone API by default - if the index has more than `0` replicas - and will thus also run into this issue.

…le_snapshot` (#137111) In #133954, we modified the `searchable_snapshot` ILM action to clone the index with 0 replicas before performing the force-merge. We didn't take the `index.auto_expand_replicas` setting into account, which could result in the clone having indices after all. That's harmless, as it merely nullifies the optimization of that PR, but we should remove the setting to ensure we achieve the intended optimizations.

…le_snapshot` (elastic#137111) In elastic#133954, we modified the `searchable_snapshot` ILM action to clone the index with 0 replicas before performing the force-merge. We didn't take the `index.auto_expand_replicas` setting into account, which could result in the clone having indices after all. That's harmless, as it merely nullifies the optimization of that PR, but we should remove the setting to ensure we achieve the intended optimizations.

If an index is in either `logsdb` or `time_series` mode and specifies a non-default `@timestamp` type mapping (e.g. `date_nanos`), using the clone, split, or shrink API will result in shards that are unable to initialize/recover due to a mapping conflict. As of #133954, the `searchable_snapshot` ILM action makes use of the clone API by default - if the index has more than `0` replicas - and will thus also run into this issue.

…le_snapshot` (#137111) (#137120) In #133954, we modified the `searchable_snapshot` ILM action to clone the index with 0 replicas before performing the force-merge. We didn't take the `index.auto_expand_replicas` setting into account, which could result in the clone having indices after all. That's harmless, as it merely nullifies the optimization of that PR, but we should remove the setting to ensure we achieve the intended optimizations.

If an index is in either `logsdb` or `time_series` mode and specifies a non-default `@timestamp` type mapping (e.g. `date_nanos`), using the clone, split, or shrink API will result in shards that are unable to initialize/recover due to a mapping conflict. As of elastic#133954, the `searchable_snapshot` ILM action makes use of the clone API by default - if the index has more than `0` replicas - and will thus also run into this issue.

If an index is in either `logsdb` or `time_series` mode and specifies a non-default `@timestamp` type mapping (e.g. `date_nanos`), using the clone, split, or shrink API will result in shards that are unable to initialize/recover due to a mapping conflict. As of #133954, the `searchable_snapshot` ILM action makes use of the clone API by default - if the index has more than `0` replicas - and will thus also run into this issue.

…le snapshot action In elastic#133954, we modified ILM's searchable snapshot action to perform the force-merge on a clone of the index with 0 replicas. This optimization avoids performing the force-merge redundantly on replicas, as the subsequent snapshot operation only looks at primary shards. We've seen some cases where cloning the index resulted in issues; there was a bug in the clone API that caused shards to be initializing permanently under specific circumstances (fixed by elastic#137096), and cloned shards are unable to be assigned if their source lives on a node that is close/past the low watermark disk threshold (will be fixed soon by the Distributed Coordination team). Therefore, we implement an opt-out flag that users can configure in the `searchable_snapshot` action of their ILM policy if they don't want to clone the index with 0 replicas before performing the force-merge. We implement an opt-out instead of an opt-in, as we believe these issues to be rather specific (and soon resolved), and the clone is worth doing by default.

…le snapshot action (#137375) In #133954, we modified ILM's searchable snapshot action to perform the force-merge on a clone of the index with 0 replicas. This optimization avoids performing the force-merge redundantly on replicas, as the subsequent snapshot operation only looks at primary shards. We've seen some cases where cloning the index resulted in issues; there was a bug in the clone API that caused shards to be initializing permanently under specific circumstances (fixed by #137096), and cloned shards are unable to be assigned if their source lives on a node that is close/past the low watermark disk threshold (will be fixed soon by the Distributed Coordination team). Therefore, we implement an opt-out flag that users can configure in the `searchable_snapshot` action of their ILM policy if they don't want to clone the index with 0 replicas before performing the force-merge. We implement an opt-out instead of an opt-in, as we believe these issues to be rather specific (and soon resolved), and the clone is worth doing by default.

…le snapshot action (elastic#137375) In elastic#133954, we modified ILM's searchable snapshot action to perform the force-merge on a clone of the index with 0 replicas. This optimization avoids performing the force-merge redundantly on replicas, as the subsequent snapshot operation only looks at primary shards. We've seen some cases where cloning the index resulted in issues; there was a bug in the clone API that caused shards to be initializing permanently under specific circumstances (fixed by elastic#137096), and cloned shards are unable to be assigned if their source lives on a node that is close/past the low watermark disk threshold (will be fixed soon by the Distributed Coordination team). Therefore, we implement an opt-out flag that users can configure in the `searchable_snapshot` action of their ILM policy if they don't want to clone the index with 0 replicas before performing the force-merge. We implement an opt-out instead of an opt-in, as we believe these issues to be rather specific (and soon resolved), and the clone is worth doing by default. (cherry picked from commit 0ab3240) # Conflicts: # server/src/main/resources/transport/upper_bounds/9.2.csv # server/src/main/resources/transport/upper_bounds/9.3.csv # x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/action/RestPutLifecycleAction.java

…le snapshot action (elastic#137375) In elastic#133954, we modified ILM's searchable snapshot action to perform the force-merge on a clone of the index with 0 replicas. This optimization avoids performing the force-merge redundantly on replicas, as the subsequent snapshot operation only looks at primary shards. We've seen some cases where cloning the index resulted in issues; there was a bug in the clone API that caused shards to be initializing permanently under specific circumstances (fixed by elastic#137096), and cloned shards are unable to be assigned if their source lives on a node that is close/past the low watermark disk threshold (will be fixed soon by the Distributed Coordination team). Therefore, we implement an opt-out flag that users can configure in the `searchable_snapshot` action of their ILM policy if they don't want to clone the index with 0 replicas before performing the force-merge. We implement an opt-out instead of an opt-in, as we believe these issues to be rather specific (and soon resolved), and the clone is worth doing by default. (cherry picked from commit 0ab3240) # Conflicts: # server/src/main/resources/transport/upper_bounds/9.3.csv # x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/action/RestPutLifecycleAction.java

…archable snapshot action (#137375) (#137463) * Allow opting out of force-merging on a cloned index in ILM's searchable snapshot action (#137375) In #133954, we modified ILM's searchable snapshot action to perform the force-merge on a clone of the index with 0 replicas. This optimization avoids performing the force-merge redundantly on replicas, as the subsequent snapshot operation only looks at primary shards. We've seen some cases where cloning the index resulted in issues; there was a bug in the clone API that caused shards to be initializing permanently under specific circumstances (fixed by #137096), and cloned shards are unable to be assigned if their source lives on a node that is close/past the low watermark disk threshold (will be fixed soon by the Distributed Coordination team). Therefore, we implement an opt-out flag that users can configure in the `searchable_snapshot` action of their ILM policy if they don't want to clone the index with 0 replicas before performing the force-merge. We implement an opt-out instead of an opt-in, as we believe these issues to be rather specific (and soon resolved), and the clone is worth doing by default. (cherry picked from commit 0ab3240) # Conflicts: # server/src/main/resources/transport/upper_bounds/9.3.csv # x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/action/RestPutLifecycleAction.java * Fix transport version

As of elastic#133954, we clone indices before performing the force-merge step in the `searchable_snapshot` action. On slow CI servers, 10 seconds for the index to go through the whole `searchable_snapshot` action isn't enough, so we bump the timeout to 20 seconds. I looked at the logs of a few test failures and ILM was clearly still progressing when the test timed out. I didn't identify any particular step that was taking extraordinarily long; there were always just a few steps that took a bit longer. I would love to make these tests faster rather than bumping the timeout, but the `searchable_snapshot` action is simply one of the largest ILM actions and ILM itself isn't particularly fast. That being said, if a timeout of 20 seconds proves to be insufficient, I do think it's worth having a look at reducing the runtime of the tests somehow first before we increase the timeout further.

If an index is in either `logsdb` or `time_series` mode and specifies a non-default `@timestamp` type mapping (e.g. `date_nanos`), using the clone, split, or shrink API will result in shards that are unable to initialize/recover due to a mapping conflict. As of elastic#133954, the `searchable_snapshot` ILM action makes use of the clone API by default - if the index has more than `0` replicas - and will thus also run into this issue.

…le_snapshot` (elastic#137111) In elastic#133954, we modified the `searchable_snapshot` ILM action to clone the index with 0 replicas before performing the force-merge. We didn't take the `index.auto_expand_replicas` setting into account, which could result in the clone having indices after all. That's harmless, as it merely nullifies the optimization of that PR, but we should remove the setting to ensure we achieve the intended optimizations.

As of #133954, we clone indices before performing the force-merge step in the `searchable_snapshot` action. On slow CI servers, 10 seconds for the index to go through the whole `searchable_snapshot` action isn't enough, so we bump the timeout to 20 seconds. I looked at the logs of a few test failures, and ILM was clearly still progressing when the test timed out. I didn't identify any particular step that was taking extraordinarily long; there were always just a few steps that took a bit longer. I would love to make these tests faster rather than bumping the timeout, but the `searchable_snapshot` action is simply one of the largest ILM actions and ILM itself isn't particularly fast. That being said, if a timeout of 20 seconds proves to be insufficient (i.e. test failures come back), I do think it's worth having a look at reducing the runtime of the tests somehow first before we increase the timeout further. Closes #137149 Closes #137151 Closes #137152 Closes #137153 Closes #137156 Closes #137166 Closes #137167 Closes #137192

As of elastic#133954, we clone indices before performing the force-merge step in the `searchable_snapshot` action. On slow CI servers, 10 seconds for the index to go through the whole `searchable_snapshot` action isn't enough, so we bump the timeout to 20 seconds. I looked at the logs of a few test failures, and ILM was clearly still progressing when the test timed out. I didn't identify any particular step that was taking extraordinarily long; there were always just a few steps that took a bit longer. I would love to make these tests faster rather than bumping the timeout, but the `searchable_snapshot` action is simply one of the largest ILM actions and ILM itself isn't particularly fast. That being said, if a timeout of 20 seconds proves to be insufficient (i.e. test failures come back), I do think it's worth having a look at reducing the runtime of the tests somehow first before we increase the timeout further. Closes elastic#137149 Closes elastic#137151 Closes elastic#137152 Closes elastic#137153 Closes elastic#137156 Closes elastic#137166 Closes elastic#137167 Closes elastic#137192 (cherry picked from commit 60b89a8) # Conflicts: # muted-tests.yml

…7524) As of #133954, we clone indices before performing the force-merge step in the `searchable_snapshot` action. On slow CI servers, 10 seconds for the index to go through the whole `searchable_snapshot` action isn't enough, so we bump the timeout to 20 seconds. I looked at the logs of a few test failures, and ILM was clearly still progressing when the test timed out. I didn't identify any particular step that was taking extraordinarily long; there were always just a few steps that took a bit longer. I would love to make these tests faster rather than bumping the timeout, but the `searchable_snapshot` action is simply one of the largest ILM actions and ILM itself isn't particularly fast. That being said, if a timeout of 20 seconds proves to be insufficient (i.e. test failures come back), I do think it's worth having a look at reducing the runtime of the tests somehow first before we increase the timeout further. Closes #137149 Closes #137151 Closes #137152 Closes #137153 Closes #137156 Closes #137166 Closes #137167 Closes #137192 (cherry picked from commit 60b89a8) # Conflicts: # muted-tests.yml

nielsbauman requested a review from Copilot September 1, 2025 17:34

elasticsearchmachine added v9.2.0 needs:triage Requires assignment of a team area label labels Sep 1, 2025

Copilot AI reviewed Sep 1, 2025

View reviewed changes

Small comment fixes

0ea2ce4

Co-authored-by: Copilot <[email protected]>

nielsbauman added >enhancement :Data Management/ILM+SLM Index and Snapshot lifecycle management and removed needs:triage Requires assignment of a team area label labels Sep 1, 2025

nielsbauman requested a review from dakrone September 1, 2025 17:38

elasticsearchmachine added the Team:Data Management Meta label for data/management team label Sep 1, 2025

nielsbauman and others added 4 commits September 1, 2025 19:38

Update docs/changelog/133954.yaml

15abc01

Add extra waits to fix tests

7e9d072

[CI] Auto commit changes from spotless

405fb5b

Add another wait

c720e0c

nielsbauman commented Sep 2, 2025

View reviewed changes

dakrone requested changes Sep 2, 2025

View reviewed changes

nielsbauman added 2 commits September 3, 2025 10:35

Randomize the phase of the tests

4f0ae14

Add comment on indexSurvives

2f044a3

nielsbauman mentioned this pull request Sep 3, 2025

Modify DeleteStep to allow deletion of other indices #134078

Closed

nielsbauman added 6 commits September 4, 2025 07:47

Pass index name supplier

0456269

Put new steps behind if-statement

9912bc4

Also catch AssertionError in tests

cda388b

Update changelog phrasing

439c051

Rename index prefix and fields

a6b0c50

Merge branch 'main' into searchable-snapshot-clone

740554e

nielsbauman requested a review from dakrone September 4, 2025 12:22

dakrone approved these changes Sep 4, 2025

View reviewed changes

gmarouli mentioned this pull request Oct 3, 2025

Move force merge from the downsmapling request to the ILM action #135834

Merged

nielsbauman mentioned this pull request Oct 24, 2025

Fix mapping conflicts in clone/split/shrink APIs #137096

Merged

nielsbauman mentioned this pull request Oct 24, 2025

Remove auto_expand_replicas setting during index clone in searchable_snapshot #137111

Merged

nielsbauman mentioned this pull request Oct 30, 2025

Allow opting out of force-merging on a cloned index in ILM's searchable snapshot action #137375

Merged

nielsbauman mentioned this pull request Nov 3, 2025

Increase timeout for searchable snapshots in ILM tests #137514

Merged

This was referenced Nov 6, 2025

Mirror upstream elastic/elasticsearch#137375 for AI review (snapshot of HEAD tree) phananh1010/elasticsearch#240

Closed

Mirror upstream elastic/elasticsearch#137375 for AI review (snapshot of HEAD tree) phananh1010/elasticsearch#270

Closed

nielsbauman mentioned this pull request Nov 26, 2025

[CI] TimeSeriesLifecycleActionsIT testWaitForSnapshot failing #138669

Open

	ClusterStateWaitUntilThresholdStep checkShrinkReadyStep = new ClusterStateWaitUntilThresholdStep(
	new CheckShrinkReadyStep(allocationRoutedKey, shrinkKey),
	setSingleNodeKey
	);

	// wait until the shrunk index is recovered. we again wait until the configured threshold is breached and if the shrunk index has
	// not successfully recovered until then, we rewind to the "cleanup-shrink-index" step to delete this unsuccessful shrunk index
	// and retry the operation by generating a new shrink index name and attempting to shrink again
	ClusterStateWaitUntilThresholdStep allocated = new ClusterStateWaitUntilThresholdStep(
	new ShrunkShardsAllocatedStep(enoughShardsKey, copyMetadataKey),
	cleanupShrinkIndexKey
	);

		@@ -0,0 +1,6 @@
		pr: 133954
		summary: "ILM: Force merge on zero-replica cloned index before snapshot"

ILM: Force merge on zero-replica cloned index before snapshot #133954

ILM: Force merge on zero-replica cloned index before snapshot #133954

Uh oh!

Conversation

nielsbauman commented Sep 1, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elasticsearchmachine commented Sep 1, 2025

Uh oh!

elasticsearchmachine commented Sep 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dakrone left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dakrone left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nielsbauman Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nielsbauman Sep 4, 2025 •

edited

Loading