-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Breakdown undesired allocations by shard routing role #132235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
| long unassignedShards, | ||
| long totalAllocations, | ||
| long undesiredAllocationsExcludingShuttingDownNodes, | ||
| Map<ShardRouting.Role, Long> undesiredAllocationsExcludingShuttingDownNodesByRole |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to collect it all in a map of roles, then shouldn't it also replace totalAllocations and undesiredAllocationsExcludingShuttingDownNodes since those would be the values that the "default" role would have in stateful? but then you get into what to do for the values returned from desired balance API in serverless, and there you'd have to sum up index_only and search_only I guess, and sprinkle a bunch of asserts since index/search and default should be mutually exclusive in this map. Another option is to just add the specific break downs we need here? index/searchTierAllocations and index/searchTierUndesiredAllocations? I think you'd need both total and undesired since we're going to need the ratio in the autoscaler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also how do we get these two stats (index/TierAllocations and indexTierUndesiredAllocations?) out of the balancer in the autoscaler? Is there a getter for these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is to just add the specific break downs we need here? index/searchTierAllocations and index/searchTierUndesiredAllocations?
As far as I can tell we have no concept of "tier" in the ES codebase. There is role, but we seem to stop short of defining that as equivalent to a tier. I don't quite understand why but I don't want to make assumptions about what role means in regards to tiers (see co.elastic.elasticsearch.stateless.allocation.StatelessAllocationDecider#canAllocateShardToNode, it's not even done there)
if we make two specific fields, we're baking in the assumption that
Role.INDEX_ONLYmeans indexing tier,Role.SEARCH_ONLYmeans search tier andRole.DEFAULTmeans the only tier in a stateful deployment- the existing set of roles are fixed
So we'd probably need to add assertions to trigger if these assumptions were no longer valid. I think having a map of role to counts (possibly Role -> record RoleStats(int total, int undesired)) at least leaves the interpretation of the roles to the context they're used in.
Maybe these are valid assumptions? Perhaps there is some history regarding the modelling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also how do we get these two stats (index/TierAllocations and indexTierUndesiredAllocations?) out of the balancer in the autoscaler? Is there a getter for these?
I added a getter and put up a serverless PR to illustrate that flow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell we have no concept of "tier" in the ES codebase. There is role, but we seem to stop short of defining that as equivalent to a tier
:))) it feels like we're saying the same thing. index/searchTierAllocations and index/searchTierUndesiredAllocations are not good names, I agree with you. I was suggesting instead of the map, we could just add two more variables for total allocation and undesired allocation of index_only (or could be called promotable shards), since we don't need the rest. that's the only part of that map, that we use for now for the specific ticket that initiated this change. we could also use the map and ignore the rest, if you prefer, that's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I prefer the map. It's a breakdown of those stats by node role. In this code we never cared about node role previously, and we should continue to not care.
If we pick out specific roles to populate fields we're baking in knowledge of the roles and their meaning in this code, I'd rather we tried to keep that knowledge in the stateless code, which is the only place we should see anything other than DEFAULT, and the only place we're using the per-role stats.
| * @param unassignedShards Shards that are not assigned to any node. | ||
| * @param allocationStatsByRole A breakdown of the allocations stats by {@link ShardRouting.Role} | ||
| */ | ||
| public record AllocationStats(long unassignedShards, Map<ShardRouting.Role, RoleAllocationStats> allocationStatsByRole) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you're going with replacing totals with the role-based map, doesn't the current version implicitly assume default and search_only/index_only are exclusive? Shouldn't we assert that either the counts for default are empty or for both search_only/index_only, so that when this assumption is not valid totalAllocations() and undesiredAllocationsExcludingShuttingDownNodes() wouldn't silently return something wrong? That's what I was saying in my previous comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, as per #132235 (comment) this code was previously agnostic of role and should continue to be so.
There's parts of the code that care about these stats broken down by role (in stateless) and parts that just want the totals (here).
| long unassignedShards, | ||
| long totalAllocations, | ||
| long undesiredAllocationsExcludingShuttingDownNodes, | ||
| Map<ShardRouting.Role, Long> undesiredAllocationsExcludingShuttingDownNodesByRole |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell we have no concept of "tier" in the ES codebase. There is role, but we seem to stop short of defining that as equivalent to a tier
:))) it feels like we're saying the same thing. index/searchTierAllocations and index/searchTierUndesiredAllocations are not good names, I agree with you. I was suggesting instead of the map, we could just add two more variables for total allocation and undesired allocation of index_only (or could be called promotable shards), since we don't need the rest. that's the only part of that map, that we use for now for the specific ticket that initiated this change. we could also use the map and ignore the rest, if you prefer, that's fine.
* upstream/main: (32 commits) Speed up loading keyword fields with index sorts (elastic#132950) Mute org.elasticsearch.index.mapper.LongFieldMapperTests testSyntheticSourceWithTranslogSnapshot elastic#132964 Simplify EsqlSession (elastic#132848) Implement WriteLoadConstraintDecider#canAllocate (elastic#132041) Mute org.elasticsearch.test.rest.yaml.CcsCommonYamlTestSuiteIT test {p0=search/400_synthetic_source/_doc_count} elastic#132965 Switch to PR-based benchmark pipeline defined in ES repo (elastic#132941) Breakdown undesired allocations by shard routing role (elastic#132235) Implement v_magnitude function (elastic#132765) Introduce execution location marker for better handling of remote/local compatibility (elastic#132205) Mute org.elasticsearch.cluster.ClusterInfoServiceIT testMaxQueueLatenciesInClusterInfo elastic#132957 Unmuting simulate index data stream mapping overrides yaml rest test (elastic#132946) Remove CrossClusterCancellationIT.createLocalIndex() (elastic#132952) Mute org.elasticsearch.index.mapper.LongFieldMapperTests testFetch elastic#132956 Fix failing UT by adding a required capability (elastic#132947) Precompute the BitsetCacheKey hashCode (elastic#132875) Adding simulate ingest effective mapping (elastic#132833) Mute org.elasticsearch.index.mapper.LongFieldMapperTests testFetchMany elastic#132948 Rename skipping logic to remove hard link to skip_unavailable (elastic#132861) Store ignored source in unique stored fields per entry (elastic#132142) Add random tests with match_only_text multi-field (elastic#132380) ...
In order that we can prevent scale-down in stateless when there are undesired allocations specifically in the indexing tier Closes: ES-12221 Co-authored-by: Pooya Salehi <[email protected]>
…-stats * upstream/main: (36 commits) Fix reproducability of builds against Java EA versions (elastic#132847) Speed up loading keyword fields with index sorts (elastic#132950) Mute org.elasticsearch.index.mapper.LongFieldMapperTests testSyntheticSourceWithTranslogSnapshot elastic#132964 Simplify EsqlSession (elastic#132848) Implement WriteLoadConstraintDecider#canAllocate (elastic#132041) Mute org.elasticsearch.test.rest.yaml.CcsCommonYamlTestSuiteIT test {p0=search/400_synthetic_source/_doc_count} elastic#132965 Switch to PR-based benchmark pipeline defined in ES repo (elastic#132941) Breakdown undesired allocations by shard routing role (elastic#132235) Implement v_magnitude function (elastic#132765) Introduce execution location marker for better handling of remote/local compatibility (elastic#132205) Mute org.elasticsearch.cluster.ClusterInfoServiceIT testMaxQueueLatenciesInClusterInfo elastic#132957 Unmuting simulate index data stream mapping overrides yaml rest test (elastic#132946) Remove CrossClusterCancellationIT.createLocalIndex() (elastic#132952) Mute org.elasticsearch.index.mapper.LongFieldMapperTests testFetch elastic#132956 Fix failing UT by adding a required capability (elastic#132947) Precompute the BitsetCacheKey hashCode (elastic#132875) Adding simulate ingest effective mapping (elastic#132833) Mute org.elasticsearch.index.mapper.LongFieldMapperTests testFetchMany elastic#132948 Rename skipping logic to remove hard link to skip_unavailable (elastic#132861) Store ignored source in unique stored fields per entry (elastic#132142) ...
Add a count of undesired shard allocations broken down by shard routing role.
This is so that we can check whether there are undesired shard allocations in a specific tier in serverless. We don't have an explicit concept of tier (search/indexing) in the code, but shard routing role is a good enough proxy for that.
Relates: ES-12221