Skip to content

Conversation

@nielsbauman
Copy link
Contributor

In #133954, we modified ILM's searchable snapshot action to perform the force-merge on a clone of the index with 0 replicas. This optimization avoids performing the force-merge redundantly on replicas, as the subsequent snapshot operation only looks at primary shards.

We've seen some cases where cloning the index resulted in issues; there was a bug in the clone API that caused shards to be initializing permanently under specific circumstances (fixed by #137096), and cloned shards are unable to be assigned if their source lives on a node that is close/past the low watermark disk threshold (will be fixed soon by the Distributed Coordination team).

Therefore, we implement an opt-out flag that users can configure in the searchable_snapshot action of their ILM policy if they don't want to clone the index with 0 replicas before performing the force-merge. We implement an opt-out instead of an opt-in, as we believe these issues to be rather specific (and soon resolved), and the clone is worth doing by default.

…le snapshot action

In elastic#133954, we modified ILM's searchable snapshot action to perform the
force-merge on a clone of the index with 0 replicas. This optimization
avoids performing the force-merge redundantly on replicas, as the
subsequent snapshot operation only looks at primary shards.

We've seen some cases where cloning the index resulted in issues; there
was a bug in the clone API that caused shards to be initializing
permanently under specific circumstances (fixed by elastic#137096), and cloned
shards are unable to be assigned if their source lives on a node that is
close/past the low watermark disk threshold (will be fixed soon by the
Distributed Coordination team).

Therefore, we implement an opt-out flag that users can configure in the
`searchable_snapshot` action of their ILM policy if they don't want to
clone the index with 0 replicas before performing the force-merge. We
implement an opt-out instead of an opt-in, as we believe these issues to
be rather specific (and soon resolved), and the clone is worth doing by
default.
@nielsbauman nielsbauman added >enhancement :Data Management/ILM+SLM Index and Snapshot lifecycle management auto-backport Automatically create backport pull requests when merged branch:9.2 labels Oct 30, 2025
@elasticsearchmachine elasticsearchmachine added Team:Data Management Meta label for data/management team v9.3.0 v9.2.1 labels Oct 30, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Collaborator

Hi @nielsbauman, I've created a changelog YAML for you.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 30, 2025

@github-actions
Copy link
Contributor

ℹ️ Important: Docs version tagging

👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version.

We use applies_to tags to mark version-specific features and changes.

Expand for a quick overview

When to use applies_to tags:

✅ At the page level to indicate which products/deployments the content applies to (mandatory)
✅ When features change state (e.g. preview, ga) in a specific version
✅ When availability differs across deployments and environments

What NOT to do:

❌ Don't remove or replace information that applies to an older version
❌ Don't add new information that applies to a specific version without an applies_to tag
❌ Don't forget that applies_to tags can be used at the page, section, and inline level

🤔 Need help?

@nielsbauman nielsbauman requested a review from Copilot October 30, 2025 12:58
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces an opt-out mechanism for the force-merge-on-clone optimization in ILM's searchable snapshot action. The optimization, which clones an index with 0 replicas before force-merging to avoid redundant work on replicas, can now be disabled via a new force_merge_on_clone field when issues arise with cloning (such as shard assignment problems near disk watermarks).

Key Changes:

  • Added a new force_merge_on_clone boolean field to SearchableSnapshotAction that defaults to true (maintaining current behavior)
  • Modified the step execution logic to conditionally skip clone-related steps when force_merge_on_clone is false
  • Added validation to ensure force_merge_on_clone is only set when force_merge_index is true

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
SearchableSnapshotAction.java Core implementation adding the force_merge_on_clone field with validation and step execution logic
SearchableSnapshotActionTests.java Updated unit tests to cover the new field in serialization, mutation, and step key generation
SearchableSnapshotActionIT.java Added integration test validating behavior when opting out of force-merge-on-clone
RestPutLifecycleAction.java Added capability flag for the new feature
90_searchable_snapshot.yml Added YAML REST test verifying the new field can be set and retrieved
ilm-searchable-snapshot.md Added documentation for the new force_merge_on_clone parameter
TimeseriesLifecycleTypeTests.java, LifecyclePolicyTests.java Updated test utilities to pass null for the new parameter
transport version files Added new transport version for backward compatibility
137375.yaml Added changelog entry

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}
this.replicateFor = replicateFor;

if (forceMergeIndex == false && forceMergeOnClone != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't ready through the whole PR yet so I might be missing something, but what if forceMergeOnClone is false? We don't want to throw an exception here do we? Is there any reason for forceMergeOnClone to be a Boolean rather than a boolean? Is there any real value in it being able to have 3 states instead of 2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was a bit on the fence here. My reason for going with a Boolean was that it better represents what the user actually does (null, false, true). However, we don't currently make a lot of use of that null state, so I don't have super strong feelings on this. If you strongly prefer having a boolean, I'm ok with changing it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense with the toXcontent not returning it when null. But is this a bug here that you can't say force_merge_on_clone: false and force_merge_index: false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, I forgot to comment on that. I intentionally went with that behavior, as I thought it was cleaner and less confusing if force_merge_on_clone is simply not supported when force_merge_index: false. It wouldn't have any effect if you were able to specify it, but I think it's confusing to allow configuring fields that are ignored. This way, it's clear that the force_merge_on_clone field is not relevant/supported. What are your thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a bit odd to me, but it sort of makes sense.

if (replicateFor != null) {
builder.field(REPLICATE_FOR.getPreferredName(), replicateFor);
}
if (forceMergeOnClone != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this answers my question from above -- we don't want it in the xcontent if it has not been set explicitly.

Copy link
Contributor

@szybia szybia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm 🚀

Copy link
Member

@masseyke masseyke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nielsbauman nielsbauman enabled auto-merge (squash) October 30, 2025 20:36
@nielsbauman nielsbauman merged commit 0ab3240 into elastic:main Oct 30, 2025
34 checks passed
@nielsbauman nielsbauman deleted the allow-opt-out-ilm branch October 30, 2025 22:07
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
9.2 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 137375

@nielsbauman
Copy link
Contributor Author

💚 All backports created successfully

Status Branch Result
9.2

Questions ?

Please refer to the Backport tool documentation

nielsbauman added a commit to nielsbauman/elasticsearch that referenced this pull request Oct 31, 2025
…le snapshot action (elastic#137375)

In elastic#133954, we modified ILM's searchable snapshot action to perform the
force-merge on a clone of the index with 0 replicas. This optimization
avoids performing the force-merge redundantly on replicas, as the
subsequent snapshot operation only looks at primary shards.

We've seen some cases where cloning the index resulted in issues; there
was a bug in the clone API that caused shards to be initializing
permanently under specific circumstances (fixed by elastic#137096), and cloned
shards are unable to be assigned if their source lives on a node that is
close/past the low watermark disk threshold (will be fixed soon by the
Distributed Coordination team).

Therefore, we implement an opt-out flag that users can configure in the
`searchable_snapshot` action of their ILM policy if they don't want to
clone the index with 0 replicas before performing the force-merge. We
implement an opt-out instead of an opt-in, as we believe these issues to
be rather specific (and soon resolved), and the clone is worth doing by
default.

(cherry picked from commit 0ab3240)

# Conflicts:
#	server/src/main/resources/transport/upper_bounds/9.2.csv
#	server/src/main/resources/transport/upper_bounds/9.3.csv
#	x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/action/RestPutLifecycleAction.java
@nielsbauman
Copy link
Contributor Author

💚 All backports created successfully

Status Branch Result
9.2

Questions ?

Please refer to the Backport tool documentation

nielsbauman added a commit to nielsbauman/elasticsearch that referenced this pull request Oct 31, 2025
…le snapshot action (elastic#137375)

In elastic#133954, we modified ILM's searchable snapshot action to perform the
force-merge on a clone of the index with 0 replicas. This optimization
avoids performing the force-merge redundantly on replicas, as the
subsequent snapshot operation only looks at primary shards.

We've seen some cases where cloning the index resulted in issues; there
was a bug in the clone API that caused shards to be initializing
permanently under specific circumstances (fixed by elastic#137096), and cloned
shards are unable to be assigned if their source lives on a node that is
close/past the low watermark disk threshold (will be fixed soon by the
Distributed Coordination team).

Therefore, we implement an opt-out flag that users can configure in the
`searchable_snapshot` action of their ILM policy if they don't want to
clone the index with 0 replicas before performing the force-merge. We
implement an opt-out instead of an opt-in, as we believe these issues to
be rather specific (and soon resolved), and the clone is worth doing by
default.

(cherry picked from commit 0ab3240)

# Conflicts:
#	server/src/main/resources/transport/upper_bounds/9.3.csv
#	x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/action/RestPutLifecycleAction.java
elasticsearchmachine pushed a commit that referenced this pull request Oct 31, 2025
…archable snapshot action (#137375) (#137463)

* Allow opting out of force-merging on a cloned index in ILM's searchable snapshot action (#137375)

In #133954, we modified ILM's searchable snapshot action to perform the
force-merge on a clone of the index with 0 replicas. This optimization
avoids performing the force-merge redundantly on replicas, as the
subsequent snapshot operation only looks at primary shards.

We've seen some cases where cloning the index resulted in issues; there
was a bug in the clone API that caused shards to be initializing
permanently under specific circumstances (fixed by #137096), and cloned
shards are unable to be assigned if their source lives on a node that is
close/past the low watermark disk threshold (will be fixed soon by the
Distributed Coordination team).

Therefore, we implement an opt-out flag that users can configure in the
`searchable_snapshot` action of their ILM policy if they don't want to
clone the index with 0 replicas before performing the force-merge. We
implement an opt-out instead of an opt-in, as we believe these issues to
be rather specific (and soon resolved), and the clone is worth doing by
default.

(cherry picked from commit 0ab3240)

# Conflicts:
#	server/src/main/resources/transport/upper_bounds/9.3.csv
#	x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/action/RestPutLifecycleAction.java

* Fix transport version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged :Data Management/ILM+SLM Index and Snapshot lifecycle management >enhancement Team:Data Management Meta label for data/management team v9.2.1 v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants