-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Ideal state instance partitions metadata #17515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Ideal state instance partitions metadata #17515
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #17515 +/- ##
============================================
- Coverage 63.25% 63.18% -0.07%
- Complexity 1477 1479 +2
============================================
Files 3170 3173 +3
Lines 189469 190072 +603
Branches 28988 29090 +102
============================================
+ Hits 119840 120099 +259
- Misses 60339 60632 +293
- Partials 9290 9341 +51
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
2065108 to
b95aee0
Compare
b95aee0 to
5117d78
Compare
…ed during segment assignment
820e6a0 to
e642147
Compare
|
|
||
| public TableRebalancer(HelixManager helixManager) { | ||
| this(helixManager, null, null, null, null, null); | ||
| this(helixManager, null, null, null, null, null, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be true or false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This path is mainly used during testing, so I think it makes sense to keep it enabled to help catch any regressions?
| "Cannot rebalance disabled table without downtime", null, null, null, null, null); | ||
| } | ||
|
|
||
| // Wipe out ideal state instance partitions metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't wipe it until a rebalance is indeed required.
E.g. when segmentAssignmentUnchanged, we should check if instance partitions changed, then modify accordingly.
If we wipe it here, and following part throws exception, we might end up with an IS without instance partitions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we wipe it here, and following part throws exception, we might end up with an IS without instance partitions
That would cause the rebalance to fail, in which case it will be retried anyway right?
E.g. when segmentAssignmentUnchanged, we should check if instance partitions changed, then modify accordingly.
We're already updating ideal state instance partitions when segmentAssignmentUnchanged but instance partitions changed. But good point about segmentAssignmentUnchanged - when there's no instance partitions change, I think that would've wiped out ideal state instance partitions. I've updated the logic.
| Map<String, List<String>> idealStateListFields = currentIdealState.getRecord().getListFields(); | ||
| InstancePartitionsUtils.replaceInstancePartitionsInIdealState(currentIdealState, instancePartitionsList); | ||
|
|
||
| return HelixHelper.updateIdealState(_helixManager, tableNameWithType, is -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't perform retry here. The update needs to be version checked update to ensure consistency of IS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should wipe the IP with the first IS change, and restore it with the last IS change. Replacing IP as separate step can cause inconsistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's three callers for this method:
- When
segmentAssignmentUnchangedbut instance partitions is changed, this method is called to update ideal state instance partitions just before completing the rebalance. I think this one is safe. - After the end of the rebalance when the current assignment matches the target assignment. I think this one is safe too?
- Before starting the actual rebalance - I guess this is the one you're concerned about?
I've updated the wipe out logic to be performed alongside IS change, but the restoration part at the end is still separate because it looks like there could be cases where the assignment reaches the target assignment outside of the rebalance initiated IS updates.
| if (_updateIdealStateInstancePartitions) { | ||
| // Rebalance completed successfully, so we can update the instance partitions in the ideal state to reflect | ||
| // the new set of instance partitions. | ||
| List<InstancePartitions> instancePartitionsList = new ArrayList<>(instancePartitionsMap.values()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider making the order of this list deterministic, so that we can check if it is identical to the existing one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing one should've been wiped out before this point though.
pinot-common/src/main/java/org/apache/pinot/common/assignment/InstancePartitionsUtils.java
Outdated
Show resolved
Hide resolved
| Integer replicaGroup = Integer.parseInt(key.substring(separatorIndex + 1)); | ||
| listFields.getValue().forEach(value -> { | ||
| if (serverToReplicaGroupMap.containsKey(value)) { | ||
| LOGGER.warn("Server {} assigned to multiple replica groups ({}, {})", value, replicaGroup, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that one server is assigned to multiple replicas? If so, will this break routing?
Should we consider throwing exception and fall back when this happens?
...src/main/java/org/apache/pinot/broker/routing/instanceselector/SegmentInstanceCandidate.java
Outdated
Show resolved
Hide resolved
pinot-common/src/main/java/org/apache/pinot/common/assignment/InstancePartitions.java
Show resolved
Hide resolved
| for (InstancePartitions instancePartitions : instancePartitionsMap.values()) { | ||
| if (!instancePartitions.equals( | ||
| idealStateInstancePartitions.get(instancePartitions.getInstancePartitionsName()))) { | ||
| LOGGER.warn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a table level gauge to reflect if IP is wiped for IP enabled table
|
|
||
| // Assign instances | ||
| assignInstances(tableConfig, true); | ||
| assignInstances(tableConfig, idealState, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we revert the changes for instance assignment?
We should modify IS when assigning segment, not instance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm I can revert it for the update table path (when override is false) which only does instance assignment for new instance partitions (and rely on table rebalance to update like in other paths). But for new tables this is required, since they wouldn't have this ideal state instance partitions metadata otherwise.
MultiStageReplicaGroupSelector(as well as some differences in metadata maintained in inconsistent transient states). This ideal state metadata will be leveraged by future changes for better replica group routing.