Skip to content

Conversation

@nielsbauman
Copy link
Contributor

Updates the background service of ILM to process all projects in the cluster.

Updates the background service of ILM to process all projects in the
cluster.
@nielsbauman nielsbauman added >non-issue :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Jun 24, 2025
@elasticsearchmachine elasticsearchmachine added Team:Data Management Meta label for data/management team v9.1.0 labels Jun 24, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

Comment on lines +420 to +422
// We're updating the policy registry cache here, which doesn't actually work with multiple projects because the policies from one
// project would overwrite the polices from another project. However, since we're not planning on running ILM in a multi-project
// cluster, we can ignore this.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to add an annotation and use it here before I merge this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we could also assert that there's only one project? I would feel more comfortable knowing that a test would probably fail if we ever used the ILM plugin in a multi-project cluster, as well has having the annotation to help us remember.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought of an assertion too, but the IndexTemplateRegistry already installs multiple ILM polices in every project, so an assertion here would already trip.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I tried running the ILM Java REST tests and those do fail because of this (although it's not directly apparent from the failure that it's because of this registry cache).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I thought that this entire module was excluded in serverless, so this code simply doesn't exist there.

If the issue is just that we're tripping integration tests which have multiple projects but don't exclude this module like serverless does... we could fix that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't thought about this deeply and it may not be possible, but I'd quite like to understand why if not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the issue is just that we're tripping integration tests which have multiple projects but don't exclude this module like serverless does... we could fix that?

Yep, that's exactly what happens. All tests that make use of the DEFAULT distribution type will trip this assertion. We might not need to run all of these tests in MP mode, but probably a significant portion of them we do - more than we can easily convert for this PR at least.

Comment on lines +420 to +422
// We're updating the policy registry cache here, which doesn't actually work with multiple projects because the policies from one
// project would overwrite the polices from another project. However, since we're not planning on running ILM in a multi-project
// cluster, we can ignore this.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we could also assert that there's only one project? I would feel more comfortable knowing that a test would probably fail if we ever used the ILM plugin in a multi-project cluster, as well has having the annotation to help us remember.

}

static Set<String> indicesOnShuttingDownNodesInDangerousStep(ClusterState state, String nodeId) {
static boolean indicesOnShuttingDownNodesInDangerousStep(ClusterState state, String nodeId) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this needs a different name, since you're changing the return value. Previously, IIUC, it was returning a set of indices preventing shutdown, and it would consider it safe to shutdown if that was empty. Now it's directly returning whether that set is empty, and therefore whether it's safe to shutdown, right? The method name doesn't indicate this at all.

If it's easier to name the opposite of this method, rather than the method itself, it might even be worth inverting the sense for readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, how about areIndicesOnShuttingDownNodesInDangerousStep?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, something like that. Although I find the OnShuttingDown in the middle a bit confusing. (To be honest, I find it a bit confusing in the current name.) Is the point that it would be dangerous to shut down while they're in these steps? Is so, something like hasIndicesInDangerousStepForShutdown might read more naturally???

(It's a private method — or rather, it's visible only for testing — so don't sweat this too much!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the point is that it would be dangerous to shut down the node on which these indices are (that are in a dangerous step). So I went with hasIndicesInDangerousStepForNodeShutdown, to indicate that it's about node shutdown.

Copy link
Member

@PeteGillinElastic PeteGillinElastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mostly LGTM. I'm happy for you to make a call on whether it's feasible to get the tests passing with that assert in or not.

@nielsbauman nielsbauman enabled auto-merge (squash) June 25, 2025 12:29
@nielsbauman nielsbauman disabled auto-merge June 25, 2025 13:37
@nielsbauman nielsbauman enabled auto-merge (squash) June 25, 2025 13:37
@nielsbauman nielsbauman disabled auto-merge June 25, 2025 13:38
@nielsbauman nielsbauman merged commit bc515c4 into elastic:main Jun 25, 2025
30 of 34 checks passed
@nielsbauman nielsbauman deleted the ilm-service-mp branch June 25, 2025 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Data Management/ILM+SLM Index and Snapshot lifecycle management >non-issue Team:Data Management Meta label for data/management team v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants