Skip to content

Conversation

@nielsbauman
Copy link
Contributor

When performing a searchable snapshot action with force merge enabled, if the source index has one or more replicas, ILM now clones the index with zero replicas and performs the force merge on the clone. The snapshot is then taken from the force-merged clone instead of the source index, ensuring only primary shards are force-merged. The cloned index is deleted after the snapshot is mounted, and all references and step logic have been updated accordingly. Test coverage was added for the new flow, including handling retries and cleanup of failed clones.

Key changes:

  • Execution state: Track the force-merged clone index in ILM state and propagate through relevant APIs.
  • SearchableSnapshotAction: Add conditional steps to clone the index with 0 replicas, force-merge, and delete the clone as needed.
  • Steps: Update ForceMerge, SegmentCount, Snapshot, and Delete steps to operate on the correct index (source or clone).
  • Tests/QA: Add and enhance tests to verify force-merge and snapshot behavior with and without replicas, including retry/cleanup paths and configuration for stable force-merges.

Resolves #75478

When performing a searchable snapshot action with force merge enabled,
if the source index has one or more replicas, ILM now clones the index
with zero replicas and performs the force merge on the clone. The
snapshot is then taken from the force-merged clone instead of the source
index, ensuring only primary shards are force-merged. The cloned index
is deleted after the snapshot is mounted, and all references and step
logic have been updated accordingly. Test coverage was added for the new
flow, including handling retries and cleanup of failed clones.

Key changes:
- Execution state: Track the force-merged clone index in ILM state and
propagate through relevant APIs.
- SearchableSnapshotAction: Add conditional steps to clone the index
with 0 replicas, force-merge, and delete the clone as needed.
- Steps: Update ForceMerge, SegmentCount, Snapshot, and Delete steps to
operate on the correct index (source or clone).
- Tests/QA: Add and enhance tests to verify force-merge and snapshot
behavior with and without replicas, including retry/cleanup paths and
configuration for stable force-merges.

Resolves elastic#75478
@nielsbauman nielsbauman requested a review from Copilot September 1, 2025 17:34
@elasticsearchmachine elasticsearchmachine added v9.2.0 needs:triage Requires assignment of a team area label labels Sep 1, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a new ILM behavior for searchable snapshot actions with force merge enabled. When the source index has one or more replicas, ILM now clones the index with zero replicas and performs the force merge on the clone instead of the original index. This optimization avoids unnecessarily force-merging replica shards since snapshots only capture primary shards.

  • Force merge optimization by cloning indices with replicas to avoid merging unnecessary replica shards
  • New execution state tracking for force-merged clone indices throughout the ILM lifecycle
  • Enhanced test coverage including failure scenarios and retry mechanisms

Reviewed Changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
TransportExplainLifecycleAction.java Adds force merge index name to ILM explain response
SearchableSnapshotActionIT.java Comprehensive test coverage for new clone-based force merge behavior
TimeSeriesRestDriver.java Utility method for moving indices between ILM steps in tests
build.gradle Test cluster configuration to prevent shard rebalancing during force merges
SearchableSnapshotActionTests.java Unit tests validating clone step configuration and replica settings
IndexLifecycleExplainResponse*.java Response model updates to include force merge index tracking
SegmentCountStep.java Updated to operate on cloned index when available
SearchableSnapshotAction.java Core logic implementing conditional clone steps and cleanup
MountSnapshotStep.java Enhanced snapshot mounting logic for cloned indices
GenerateSnapshotNameStep.java Updated to use cloned index name for snapshot generation
ForceMergeStep.java Modified to target cloned index when present
DeleteStep.java Enhanced with configurable target index deletion capability
CreateSnapshotStep.java Updated to snapshot the force-merged clone instead of original
ESRestTestCase.java Test framework utility for waiting on index deletion
LifecycleExecutionState.java Core state tracking for force merge index names

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Co-authored-by: Copilot <[email protected]>
@nielsbauman nielsbauman added >enhancement :Data Management/ILM+SLM Index and Snapshot lifecycle management and removed needs:triage Requires assignment of a team area label labels Sep 1, 2025
@nielsbauman nielsbauman requested a review from dakrone September 1, 2025 17:38
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Sep 1, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Collaborator

Hi @nielsbauman, I've created a changelog YAML for you.

ClusterHealthStatus.GREEN,
FORCE_MERGE_INDEX_NAME_SUPPLIER
),
cleanupClonedIndexKey
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I thought I copied the approach of the ShrinkAction here by going back to the cleanup step if the threshold/timeout is passed. But it looks like that's not the case:

ClusterStateWaitUntilThresholdStep checkShrinkReadyStep = new ClusterStateWaitUntilThresholdStep(
new CheckShrinkReadyStep(allocationRoutedKey, shrinkKey),
setSingleNodeKey
);

The ShrinkAction just goes back to SetSingleNodeAllocateStep. I'm inclined to think my current approach is safer, but I'm also a fan of consistency. Anyone else have any thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one you linked is the "wait for single node allocation bit", where the new shrunken index hasn't been created yet (so there's nothing to clean up). You use the same behavior to go back to the cleanup later on in the file:

// wait until the shrunk index is recovered. we again wait until the configured threshold is breached and if the shrunk index has
// not successfully recovered until then, we rewind to the "cleanup-shrink-index" step to delete this unsuccessful shrunk index
// and retry the operation by generating a new shrink index name and attempting to shrink again
ClusterStateWaitUntilThresholdStep allocated = new ClusterStateWaitUntilThresholdStep(
new ShrunkShardsAllocatedStep(enoughShardsKey, copyMetadataKey),
cleanupShrinkIndexKey
);

Which matches the behavior here, so I believe it is consistent.

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this Niels! I left some comments but they're mostly cosmetic.

@@ -0,0 +1,6 @@
pr: 133954
summary: "ILM: Force merge on zero-replica cloned index before snapshot"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps mention that this is for the searchable snapshot step here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to ILM: Force merge on zero-replica cloned index before snapshotting for searchable snapshots. Let me know if that matches what you had in mind.

IndexMetadata indexMetadata = project.index(index);
assert indexMetadata != null : "index " + index.getName() + " must exist in the cluster state";
String cloneIndexName = indexMetadata.getLifecycleExecutionState().forceMergeIndexName();
return cloneIndexName != null && project.index(cloneIndexName) != null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens here if a user manually removes the cloned index after it was created so that project.index(cloneIndexName) returns null? Wouldn't we erroneously assume we're on the no-clone path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a user manually removes the cloned index after it was created and we then run the DeleteStep on the force-merged index, it would fail, right?. So, that assumption doesn't sound "erroneous" to me. Or am I misinterpreting your comment?

@nielsbauman nielsbauman requested a review from dakrone September 4, 2025 12:22
Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mostly looks good to me, but I have one concern (EDIT: see the bottom).

I can't comment on the code lines (because it's out of the review), but in the SearchableSnapshotAction steps where we copy over the execution state and the lifecycle name setting:

        CopyExecutionStateStep copyMetadataStep = new CopyExecutionStateStep(
            copyMetadataKey,
            copyLifecyclePolicySettingKey,
            (index, executionState) -> getRestoredIndexPrefix(copyMetadataKey) + index,
            keyForReplicateForOrContinue
        );
        CopySettingsStep copySettingsStep = new CopySettingsStep(
            copyLifecyclePolicySettingKey,
            dataStreamCheckBranchingKey,
            forceMergeIndex ? conditionalDeleteForceMergedIndexKey : dataStreamCheckBranchingKey,
            (index, lifecycleState) -> getRestoredIndexPrefix(copyLifecyclePolicySettingKey) + index,
            LifecycleSettings.LIFECYCLE_NAME
        );

It appears that we're using getRestoredIndexPrefix(copyLifecyclePolicySettingKey) + index for the name of the index into which we should copy the execution state settings. This would normally be fine, because it would be:

* my-backing-index which is snapshotted
* mount as either partial-my-backing-index or restored-my-backing-index
* copy the state to the partial or restored version depending on which prefix was used.

However, in this case, my-backing

RECORD SCRATCH.

It was at this point in my thinking and typing this comment out that I realized that we still control the mounted index name, so even if we snapshot fm-clone-my-backing-index we still mount it as partial-my-backing-index, NOT partial-fm-clone-my-backing-index, which means that the concern above is not valid! However, I've decided to leave my initially-erroneous assessment in for posterity for anyone else that might come looking or have the same concern.

Feel free to ignore the above, as the PR looks good to me, thanks Niels. 😄

Comment on lines +1180 to +1182
// With multiple primary shards, the segments are more spread out, so it's even less likely that we'll get more than 1 segment
// in one shard, and some shards might even be empty.
assertThat(preLifecycleBackingIndexSegments, greaterThanOrEqualTo(0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if there are multiple primary shards and some shards may be empty, shouldn't the total still be >= 1, since we've indexed at least one document? I'm not sure I understand how having one primary means we have >= 1 segment, but having more than one primary means that we may get 0 segments.

Copy link
Contributor Author

@nielsbauman nielsbauman Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I understand your confusion. I had a similar confusion the first time I looked at it, but forgot to mention that somewhere, sorry. The caveat here is that TimeSeriesRestDriver#getNumberOfPrimarySegments(Client, String) returns the number of segments for the "first" primary shard, i.e. it just gets shard 0:

List<Map<String, Object>> shards = (List<Map<String, Object>>) responseEntity.get("0");

That means that it will essentially return the number of segments of a random primary shard if there are multiple primary shards. There is only one other usage of that method, which runs a test with only one primary shard, so I think we can change the implementation of the method to return the sum of segments across all primary shards. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh okay, the name definitely doesn't make that clear. I'd be in favor of making it return the sum of segments across all the primaries as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 5806e64.

elasticsearchmachine pushed a commit that referenced this pull request Oct 14, 2025
…5834)

In this PR we move the force-merge operation from the downsampling
request to the ILM action. 

Our goal is to decouple the downsampling operation from the force-merge
operation. With this change the downsampling request is responsible to
ensure that the downsampled index is refreshed and flushed but not to
force merge it.

We believe that most of the time this is not necessary, and executing
the force-merge operation unnecessarily can increase the load on the
cluster.

To preserve backwards compatibility we move the responsibility to
execute the existing force merge to the downsample ILM action and we
make it configurable. By default, it will run but a user can disable it
just as they can with a searchable snapshot.

```
"downsample": {
  "fixed_interval": "1h",
  "force_merge_index": false
}
```

**Update**

As a follow up of this PR, we pose the question is the force merge in
the downsample action intentional and useful? 

To answer this question, we extend time series telemetry. We define that
the force merge step in the downsample ILM action is useful, if this is
the only force merge step operation before a searchable snapshot.

Effectively, by this definition, we argue that the force merge in
downsampling is not an intentional operation the user has requested but
only the result of the implementation. We identify the biggest impact of
removing it to be a searchable snapshot, but if the searchable snapshot
performs its own force merge (and more performant force merge #133954)
then we could skip this operation in the downsample action altogether.

Fixes: #135618
Kubik42 pushed a commit to Kubik42/elasticsearch that referenced this pull request Oct 16, 2025
…stic#135834)

In this PR we move the force-merge operation from the downsampling
request to the ILM action. 

Our goal is to decouple the downsampling operation from the force-merge
operation. With this change the downsampling request is responsible to
ensure that the downsampled index is refreshed and flushed but not to
force merge it.

We believe that most of the time this is not necessary, and executing
the force-merge operation unnecessarily can increase the load on the
cluster.

To preserve backwards compatibility we move the responsibility to
execute the existing force merge to the downsample ILM action and we
make it configurable. By default, it will run but a user can disable it
just as they can with a searchable snapshot.

```
"downsample": {
  "fixed_interval": "1h",
  "force_merge_index": false
}
```

**Update**

As a follow up of this PR, we pose the question is the force merge in
the downsample action intentional and useful? 

To answer this question, we extend time series telemetry. We define that
the force merge step in the downsample ILM action is useful, if this is
the only force merge step operation before a searchable snapshot.

Effectively, by this definition, we argue that the force merge in
downsampling is not an intentional operation the user has requested but
only the result of the implementation. We identify the biggest impact of
removing it to be a searchable snapshot, but if the searchable snapshot
performs its own force merge (and more performant force merge elastic#133954)
then we could skip this operation in the downsample action altogether.

Fixes: elastic#135618
nielsbauman added a commit to nielsbauman/elasticsearch that referenced this pull request Oct 24, 2025
…le_snapshot`

In elastic#133954, we modified the `searchable_snapshot` ILM action to clone
the index with 0 replicas before performing the force-merge. We didn't
take the `index.auto_expand_replicas` setting into account, which could
result in the clone having indices after all. That's harmless, as it
merely nullifies the optimization of that PR, but we should remove the
setting to ensure we achieve the intended optimizations.
nielsbauman added a commit that referenced this pull request Oct 24, 2025
If an index is in either `logsdb` or `time_series` mode and specifies a non-default `@timestamp` type mapping (e.g. `date_nanos`), using the clone, split, or shrink API will result in shards that are unable to initialize/recover due to a mapping conflict.
As of #133954, the `searchable_snapshot` ILM action makes use of the clone API by default - if the index has more than `0` replicas - and will thus also run into this issue.
nielsbauman added a commit to nielsbauman/elasticsearch that referenced this pull request Oct 24, 2025
If an index is in either `logsdb` or `time_series` mode and specifies a non-default `@timestamp` type mapping (e.g. `date_nanos`), using the clone, split, or shrink API will result in shards that are unable to initialize/recover due to a mapping conflict.
As of elastic#133954, the `searchable_snapshot` ILM action makes use of the clone API by default - if the index has more than `0` replicas - and will thus also run into this issue.
nielsbauman added a commit that referenced this pull request Oct 24, 2025
…le_snapshot` (#137111)

In #133954, we modified the `searchable_snapshot` ILM action to clone
the index with 0 replicas before performing the force-merge. We didn't
take the `index.auto_expand_replicas` setting into account, which could
result in the clone having indices after all. That's harmless, as it
merely nullifies the optimization of that PR, but we should remove the
setting to ensure we achieve the intended optimizations.
nielsbauman added a commit to nielsbauman/elasticsearch that referenced this pull request Oct 24, 2025
…le_snapshot` (elastic#137111)

In elastic#133954, we modified the `searchable_snapshot` ILM action to clone
the index with 0 replicas before performing the force-merge. We didn't
take the `index.auto_expand_replicas` setting into account, which could
result in the clone having indices after all. That's harmless, as it
merely nullifies the optimization of that PR, but we should remove the
setting to ensure we achieve the intended optimizations.
elasticsearchmachine pushed a commit that referenced this pull request Oct 24, 2025
If an index is in either `logsdb` or `time_series` mode and specifies a non-default `@timestamp` type mapping (e.g. `date_nanos`), using the clone, split, or shrink API will result in shards that are unable to initialize/recover due to a mapping conflict.
As of #133954, the `searchable_snapshot` ILM action makes use of the clone API by default - if the index has more than `0` replicas - and will thus also run into this issue.
elasticsearchmachine pushed a commit that referenced this pull request Oct 24, 2025
…le_snapshot` (#137111) (#137120)

In #133954, we modified the `searchable_snapshot` ILM action to clone
the index with 0 replicas before performing the force-merge. We didn't
take the `index.auto_expand_replicas` setting into account, which could
result in the clone having indices after all. That's harmless, as it
merely nullifies the optimization of that PR, but we should remove the
setting to ensure we achieve the intended optimizations.
nielsbauman added a commit to nielsbauman/elasticsearch that referenced this pull request Oct 25, 2025
If an index is in either `logsdb` or `time_series` mode and specifies a non-default `@timestamp` type mapping (e.g. `date_nanos`), using the clone, split, or shrink API will result in shards that are unable to initialize/recover due to a mapping conflict.
As of elastic#133954, the `searchable_snapshot` ILM action makes use of the clone API by default - if the index has more than `0` replicas - and will thus also run into this issue.
elasticsearchmachine pushed a commit that referenced this pull request Oct 25, 2025
If an index is in either `logsdb` or `time_series` mode and specifies a non-default `@timestamp` type mapping (e.g. `date_nanos`), using the clone, split, or shrink API will result in shards that are unable to initialize/recover due to a mapping conflict.
As of #133954, the `searchable_snapshot` ILM action makes use of the clone API by default - if the index has more than `0` replicas - and will thus also run into this issue.
nielsbauman added a commit to nielsbauman/elasticsearch that referenced this pull request Oct 30, 2025
…le snapshot action

In elastic#133954, we modified ILM's searchable snapshot action to perform the
force-merge on a clone of the index with 0 replicas. This optimization
avoids performing the force-merge redundantly on replicas, as the
subsequent snapshot operation only looks at primary shards.

We've seen some cases where cloning the index resulted in issues; there
was a bug in the clone API that caused shards to be initializing
permanently under specific circumstances (fixed by elastic#137096), and cloned
shards are unable to be assigned if their source lives on a node that is
close/past the low watermark disk threshold (will be fixed soon by the
Distributed Coordination team).

Therefore, we implement an opt-out flag that users can configure in the
`searchable_snapshot` action of their ILM policy if they don't want to
clone the index with 0 replicas before performing the force-merge. We
implement an opt-out instead of an opt-in, as we believe these issues to
be rather specific (and soon resolved), and the clone is worth doing by
default.
nielsbauman added a commit that referenced this pull request Oct 30, 2025
…le snapshot action (#137375)

In #133954, we modified ILM's searchable snapshot action to perform the
force-merge on a clone of the index with 0 replicas. This optimization
avoids performing the force-merge redundantly on replicas, as the
subsequent snapshot operation only looks at primary shards.

We've seen some cases where cloning the index resulted in issues; there
was a bug in the clone API that caused shards to be initializing
permanently under specific circumstances (fixed by #137096), and cloned
shards are unable to be assigned if their source lives on a node that is
close/past the low watermark disk threshold (will be fixed soon by the
Distributed Coordination team).

Therefore, we implement an opt-out flag that users can configure in the
`searchable_snapshot` action of their ILM policy if they don't want to
clone the index with 0 replicas before performing the force-merge. We
implement an opt-out instead of an opt-in, as we believe these issues to
be rather specific (and soon resolved), and the clone is worth doing by
default.
nielsbauman added a commit to nielsbauman/elasticsearch that referenced this pull request Oct 31, 2025
…le snapshot action (elastic#137375)

In elastic#133954, we modified ILM's searchable snapshot action to perform the
force-merge on a clone of the index with 0 replicas. This optimization
avoids performing the force-merge redundantly on replicas, as the
subsequent snapshot operation only looks at primary shards.

We've seen some cases where cloning the index resulted in issues; there
was a bug in the clone API that caused shards to be initializing
permanently under specific circumstances (fixed by elastic#137096), and cloned
shards are unable to be assigned if their source lives on a node that is
close/past the low watermark disk threshold (will be fixed soon by the
Distributed Coordination team).

Therefore, we implement an opt-out flag that users can configure in the
`searchable_snapshot` action of their ILM policy if they don't want to
clone the index with 0 replicas before performing the force-merge. We
implement an opt-out instead of an opt-in, as we believe these issues to
be rather specific (and soon resolved), and the clone is worth doing by
default.

(cherry picked from commit 0ab3240)

# Conflicts:
#	server/src/main/resources/transport/upper_bounds/9.2.csv
#	server/src/main/resources/transport/upper_bounds/9.3.csv
#	x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/action/RestPutLifecycleAction.java
nielsbauman added a commit to nielsbauman/elasticsearch that referenced this pull request Oct 31, 2025
…le snapshot action (elastic#137375)

In elastic#133954, we modified ILM's searchable snapshot action to perform the
force-merge on a clone of the index with 0 replicas. This optimization
avoids performing the force-merge redundantly on replicas, as the
subsequent snapshot operation only looks at primary shards.

We've seen some cases where cloning the index resulted in issues; there
was a bug in the clone API that caused shards to be initializing
permanently under specific circumstances (fixed by elastic#137096), and cloned
shards are unable to be assigned if their source lives on a node that is
close/past the low watermark disk threshold (will be fixed soon by the
Distributed Coordination team).

Therefore, we implement an opt-out flag that users can configure in the
`searchable_snapshot` action of their ILM policy if they don't want to
clone the index with 0 replicas before performing the force-merge. We
implement an opt-out instead of an opt-in, as we believe these issues to
be rather specific (and soon resolved), and the clone is worth doing by
default.

(cherry picked from commit 0ab3240)

# Conflicts:
#	server/src/main/resources/transport/upper_bounds/9.3.csv
#	x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/action/RestPutLifecycleAction.java
elasticsearchmachine pushed a commit that referenced this pull request Oct 31, 2025
…archable snapshot action (#137375) (#137463)

* Allow opting out of force-merging on a cloned index in ILM's searchable snapshot action (#137375)

In #133954, we modified ILM's searchable snapshot action to perform the
force-merge on a clone of the index with 0 replicas. This optimization
avoids performing the force-merge redundantly on replicas, as the
subsequent snapshot operation only looks at primary shards.

We've seen some cases where cloning the index resulted in issues; there
was a bug in the clone API that caused shards to be initializing
permanently under specific circumstances (fixed by #137096), and cloned
shards are unable to be assigned if their source lives on a node that is
close/past the low watermark disk threshold (will be fixed soon by the
Distributed Coordination team).

Therefore, we implement an opt-out flag that users can configure in the
`searchable_snapshot` action of their ILM policy if they don't want to
clone the index with 0 replicas before performing the force-merge. We
implement an opt-out instead of an opt-in, as we believe these issues to
be rather specific (and soon resolved), and the clone is worth doing by
default.

(cherry picked from commit 0ab3240)

# Conflicts:
#	server/src/main/resources/transport/upper_bounds/9.3.csv
#	x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/action/RestPutLifecycleAction.java

* Fix transport version
nielsbauman added a commit to nielsbauman/elasticsearch that referenced this pull request Nov 3, 2025
As of elastic#133954, we clone indices before performing the force-merge step
in the `searchable_snapshot` action. On slow CI servers, 10 seconds for
the index to go through the whole `searchable_snapshot` action isn't
enough, so we bump the timeout to 20 seconds.

I looked at the logs of a few test failures and ILM was clearly still
progressing when the test timed out. I didn't identify any particular
step that was taking extraordinarily long; there were always just a few
steps that took a bit longer. I would love to make these tests faster
rather than bumping the timeout, but the `searchable_snapshot` action is
simply one of the largest ILM actions and ILM itself isn't particularly
fast.

That being said, if a timeout of 20 seconds proves to be insufficient, I
do think it's worth having a look at reducing the runtime of the tests
somehow first before we increase the timeout further.
fzowl pushed a commit to voyage-ai/elasticsearch that referenced this pull request Nov 3, 2025
If an index is in either `logsdb` or `time_series` mode and specifies a non-default `@timestamp` type mapping (e.g. `date_nanos`), using the clone, split, or shrink API will result in shards that are unable to initialize/recover due to a mapping conflict.
As of elastic#133954, the `searchable_snapshot` ILM action makes use of the clone API by default - if the index has more than `0` replicas - and will thus also run into this issue.
fzowl pushed a commit to voyage-ai/elasticsearch that referenced this pull request Nov 3, 2025
…le_snapshot` (elastic#137111)

In elastic#133954, we modified the `searchable_snapshot` ILM action to clone
the index with 0 replicas before performing the force-merge. We didn't
take the `index.auto_expand_replicas` setting into account, which could
result in the clone having indices after all. That's harmless, as it
merely nullifies the optimization of that PR, but we should remove the
setting to ensure we achieve the intended optimizations.
nielsbauman added a commit that referenced this pull request Nov 3, 2025
As of #133954, we clone indices before performing the force-merge step in the `searchable_snapshot` action. On slow CI servers, 10 seconds for the index to go through the whole `searchable_snapshot` action isn't enough, so we bump the timeout to 20 seconds.

I looked at the logs of a few test failures, and ILM was clearly still progressing when the test timed out. I didn't identify any particular step that was taking extraordinarily long; there were always just a few steps that took a bit longer. I would love to make these tests faster rather than bumping the timeout, but the `searchable_snapshot` action is simply one of the largest ILM actions and ILM itself isn't particularly fast.

That being said, if a timeout of 20 seconds proves to be insufficient (i.e. test failures come back), I do think it's worth having a look at reducing the runtime of the tests somehow first before we increase the timeout further.

Closes #137149
Closes #137151
Closes #137152
Closes #137153
Closes #137156
Closes #137166
Closes #137167
Closes #137192
nielsbauman added a commit to nielsbauman/elasticsearch that referenced this pull request Nov 3, 2025
As of elastic#133954, we clone indices before performing the force-merge step in the `searchable_snapshot` action. On slow CI servers, 10 seconds for the index to go through the whole `searchable_snapshot` action isn't enough, so we bump the timeout to 20 seconds.

I looked at the logs of a few test failures, and ILM was clearly still progressing when the test timed out. I didn't identify any particular step that was taking extraordinarily long; there were always just a few steps that took a bit longer. I would love to make these tests faster rather than bumping the timeout, but the `searchable_snapshot` action is simply one of the largest ILM actions and ILM itself isn't particularly fast.

That being said, if a timeout of 20 seconds proves to be insufficient (i.e. test failures come back), I do think it's worth having a look at reducing the runtime of the tests somehow first before we increase the timeout further.

Closes elastic#137149
Closes elastic#137151
Closes elastic#137152
Closes elastic#137153
Closes elastic#137156
Closes elastic#137166
Closes elastic#137167
Closes elastic#137192

(cherry picked from commit 60b89a8)

# Conflicts:
#	muted-tests.yml
elasticsearchmachine pushed a commit that referenced this pull request Nov 3, 2025
…7524)

As of #133954, we clone indices before performing the force-merge step in the `searchable_snapshot` action. On slow CI servers, 10 seconds for the index to go through the whole `searchable_snapshot` action isn't enough, so we bump the timeout to 20 seconds.

I looked at the logs of a few test failures, and ILM was clearly still progressing when the test timed out. I didn't identify any particular step that was taking extraordinarily long; there were always just a few steps that took a bit longer. I would love to make these tests faster rather than bumping the timeout, but the `searchable_snapshot` action is simply one of the largest ILM actions and ILM itself isn't particularly fast.

That being said, if a timeout of 20 seconds proves to be insufficient (i.e. test failures come back), I do think it's worth having a look at reducing the runtime of the tests somehow first before we increase the timeout further.

Closes #137149
Closes #137151
Closes #137152
Closes #137153
Closes #137156
Closes #137166
Closes #137167
Closes #137192

(cherry picked from commit 60b89a8)

# Conflicts:
#	muted-tests.yml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Data Management/ILM+SLM Index and Snapshot lifecycle management >enhancement Team:Data Management Meta label for data/management team v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Can we avoid force-merging all shard copies?

3 participants