- 
                Notifications
    You must be signed in to change notification settings 
- Fork 25.6k
System Index Migration Failure Results in a Non-Recoverable State #122326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System Index Migration Failure Results in a Non-Recoverable State #122326
Conversation
| Pinging @elastic/es-core-infra (Team:Core/Infra) | 
| Hi @JVerwolf, I've created a changelog YAML for you. | 
bf04c7b    to
    6fa041c      
    Compare
  
    | @elasticsearchmachine update branch | 
| Consumer<ClusterState> listener | ||
| ) { | ||
| logger.debug("cleaning up previous migration, task state: [{}]", taskState == null ? "null" : Strings.toString(taskState)); | ||
| if (taskState != null && taskState.getCurrentIndex() != null) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The taskState is always null, so the cleanup logic is never invoked here.
| updateTask.submit(clusterService); | ||
| } | ||
|  | ||
| private void prepareNextIndex(ClusterState clusterState, Consumer<ClusterState> listener, String lastFeatureName) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The clusterState was never used, so I removed this extraneous param.
| logger.debug("no incomplete index to remove"); | ||
| clearResults(clusterService, ActionListener.wrap(listener::accept, this::markAsFailed)); | ||
| } | ||
| clearResults( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose ofclearResults is to remove the Persistent Task State leftover from prior runs so that we begin the migration with fresh start. I'm not sure if we ever even store the persistent task state in practice? The call to  super.markAsFailed(e); in markAsFailed ultimately removes the task state, though perhaps this wasn't the original intention.
It may be possible that there are other failure scenarios that leave the teask sate (e.g. a node getting abruptly killed). So I left this herere for now untill I can reasont through this with @gwbrown when she's back.
| private <T> void deleteIndex(SystemIndexMigrationInfo migrationInfo, ActionListener<AcknowledgedResponse> listener) { | ||
| logger.info("removing index [{}] from feature [{}]", migrationInfo.getNextIndexName(), migrationInfo.getFeatureName()); | ||
| String newIndexName = migrationInfo.getNextIndexName(); | ||
| baseClient.admin().indices().prepareDelete(newIndexName).execute(ActionListener.wrap(ackedResponse -> { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is identical to the prior cleanUpPreviousMigration code except for that I'm using a baseClient here since I don't have access to the migrationInfo needed for migrationInfo.createClient(baseClient).  This previous call produced an Origin Setting Client.  I don't think we need that here, though I'm not 100% on this.
| baseClient.admin().indices().prepareDelete(newIndexName).execute(ActionListener.wrap(ackedResponse -> { | ||
| if (ackedResponse.isAcknowledged()) { | ||
| logger.info("successfully removed index [{}]", newIndexName); | ||
| listener.onResponse(ackedResponse); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is identical to the prior cleanUpPreviousMigration code except that I don't call clearResults(clusterService, ActionListener.wrap(listener::accept, this::markAsFailed)); here, as I've already cleaned up at the beginning of the migration, and I don't want to remove in-progress state since this now happens once the migration is already in progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So many callbacks!
        
          
                server/src/main/java/org/elasticsearch/upgrades/SystemIndexMigrator.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | createIndex(migrationInfo, ActionListener.wrap(listener::onResponse, e -> { | ||
| logger.warn("createIndex failed, retrying after removing index [{}] from previous attempt", migrationInfo.getNextIndexName()); | ||
| deleteIndex(migrationInfo, ActionListener.wrap(cleanupResponse -> createIndex(migrationInfo, listener), e2 -> { | ||
| logger.warn("createIndex failed after retrying, aborting", e2); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I lost in callbacks, but isn't this error consumer for deleteIndex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I've fixed this now. Nice catch! Also, welcome to callback hell.
| @elasticsearchmachine update branch | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| logger.warn("createIndex failed, retrying after removing index [{}] from previous attempt", migrationInfo.getNextIndexName()); | ||
| deleteIndex(migrationInfo, ActionListener.wrap(cleanupResponse -> createIndex(migrationInfo, l.delegateResponse((l3, e3) -> { | ||
| logger.error( | ||
| "createIndex failed after retrying, aborting; index [{}] will be left in an inconsistent state", | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left in an inconsistent state
I am not sure we can do anything about this, so this is more out of curiosity - would it be deleted if a user retries the migration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
If there is an existing index:
Previously:
- we would try to create a new index
- the create call would fail due to an existing index with the same name
Now, instead:
- We try to create and index
- The create call will fail due to a preexisting index with the same name
- We catch the error and then delete the preexisting index
- we retry the create, leaving the index in a broken state if it fails
Why not delete after step 4? Well, the error could be a cluster problem that prevents deletes as well. We'd still need to have the delete in response to a failed create. This was the original intention for how the code was supposed to work (IIRC after talking to @gwbrown), so I left the overall high-level approach as-is.
| private void createIndexRetryOnFailure(SystemIndexMigrationInfo migrationInfo, ActionListener<ShardsAcknowledgedResponse> listener) { | ||
| createIndex(migrationInfo, listener.delegateResponse((l, e) -> { | ||
| logger.warn("createIndex failed, retrying after removing index [{}] from previous attempt", migrationInfo.getNextIndexName()); | ||
| deleteIndex(migrationInfo, ActionListener.wrap(cleanupResponse -> createIndex(migrationInfo, l.delegateResponse((l3, e3) -> { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we try to delete the index and retry if there is any error with the prior create call. I could instead only catch the specific "resource already exists" exception. However, I'd rather be more broad here in case there are other valid exceptions that get thrown, where deleting the existing resource is the right thing to do. Otherwise, we may end up in a non-recoverable state as a result of some scenario we haven't predicted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that it is better to be more broad here
| @elasticsearchmachine update branch | 
…ix-migration-blocked-from-previous-failure
| 💔 Backport failedThe backport operation could not be completed due to the following error: You can use sqren/backport to manually backport by running  | 
| 💚 All backports created successfully
 Questions ?Please refer to the Backport tool documentation | 
…astic#122326) This PR changes the code to no-longer rely on the persistent task state for the cleanup logic of existing indices. (cherry picked from commit 9076ac4)
…astic#122326) This PR changes the code to no-longer rely on the persistent task state for the cleanup logic of existing indices. (cherry picked from commit 9076ac4)
…astic#122326) This PR changes the code to no-longer rely on the persistent task state for the cleanup logic of existing indices. (cherry picked from commit 9076ac4)
Jira: ES-10666
In #120566 I wrote a test that uncovered a pre-existing issue whereby the system indices could end up in a non-recoverable state if the migration failed.
Specifically, broken system indices from the previous migration attempts will not be cleaned-up (although there is code to do so) because persistent task state is missing on the subsequent runs. When the system index migration logic tries to make a new index, and an existing broken new index already exists, an exception will occur. This appears to be caused by the task state not being persisted correctly during the original failure.
This PR changes the code to no-longer rely on the persistent task state for the cleanup logic of exisiting indices.