-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Reduce Data Loss in System Indices Migration 8x #120566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce Data Loss in System Indices Migration 8x #120566
Conversation
…nt/es-9724-reduce-data-loss-system-indices-8x
Hi @JVerwolf, I've created a changelog YAML for you. |
…nt/es-9724-reduce-data-loss-system-indices-8x
… of github.com:JVerwolf/elasticsearch into enhancement/es-9724-reduce-data-loss-system-indices-8x
), | ||
e | ||
); | ||
removeReadOnlyBlockOnReindexFailure(oldIndex, delegate2, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error handling function (second param to ActionListener.wrap
) will be called If there is an error returned by the aliases request in setAliasAndRemoveOldIndex
, or if there is an an error thrown by the happy-path function that's the first parameter to ActionListener.wrap
.
This function will log the error and then remove the WRITE
block on the original index, in an attempt to not leave things in a broken state.
I'm not sure how to test this path, or even if this path is needed. I'd be happy to get feedback here from reviewers - thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for local testing - would introducing code that always throws an exception to TransportIndicesAliasesAction.masterOperation help to test this path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good idea. I'll look into how to do that - I think I remember seeing examples of that type of thing somewhere.
Pinging @elastic/es-core-infra (Team:Core/Infra) |
Side note: if this is a backport (looks like a backport?) could you add the tag please? |
server/src/main/java/org/elasticsearch/upgrades/SystemIndexMigrator.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/upgrades/SystemIndexMigrator.java
Show resolved
Hide resolved
), | ||
e | ||
); | ||
removeReadOnlyBlockOnReindexFailure(oldIndex, delegate2, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for local testing - would introducing code that always throws an exception to TransportIndicesAliasesAction.masterOperation help to test this path?
… of github.com:JVerwolf/elasticsearch into enhancement/es-9724-reduce-data-loss-system-indices-8x
…nt/es-9724-reduce-data-loss-system-indices-8x
// Retry the migration | ||
client().execute(PostFeatureUpgradeAction.INSTANCE, new PostFeatureUpgradeRequest(TEST_REQUEST_TIMEOUT)).get(); | ||
|
||
// Ensure that the migration is successful after the alias request is unblocked | ||
assertBusy(() -> { | ||
GetFeatureUpgradeStatusResponse statusResp = client().execute( | ||
GetFeatureUpgradeStatusAction.INSTANCE, | ||
new GetFeatureUpgradeStatusRequest(TEST_REQUEST_TIMEOUT) | ||
).get(); | ||
logger.info(Strings.toString(statusResp)); | ||
assertThat(statusResp.getUpgradeStatus(), equalTo(GetFeatureUpgradeStatusResponse.UpgradeStatus.NO_MIGRATION_NEEDED)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is failing. When I set breakpoints, the taskState
in org.elasticsearch.upgrades.SystemIndexMigrator#cleanUpPreviousMigration
is null, which prevents the previous "new" index from being cleaned up. I then get an exception upon trying to create an index that already exists.
@gwbrown Do you know why this might be happening? Thanks!
server/src/main/java/org/elasticsearch/upgrades/SystemIndexMigrator.java
Outdated
Show resolved
Hide resolved
…nt/es-9724-reduce-data-loss-system-indices-8x
The new test I added breaks the original code (without my changes) as well as my PR. It seems the task state is not being restored in the subsequent migration runs, which prevents the new index from being cleaned-up. @rjernst and I spent a while debugging this, but weren't able to locate the cause as of yet. I'll disable the test for now, and revisit it in a future PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…nt/es-9724-reduce-data-loss-system-indices-8x
Reverts #120566 The original PR is causing the following exception to be thrown when security is enabled: ``` system-indices-testing-es01-1 | org.elasticsearch.ElasticsearchSecurityException: action [indices:admin/block/add] is unauthorized for user [_system] with effective roles [_system], this action is granted by the index privileges [manage,all] ```
Jira: ES-9724
This PR removes a potential cause of data loss when migrating system indices. It does this by changing the way we set a "write-block" on the system index to migrate - now using a dedicated transport request rather than a settings update. Furthermore, we no longer delete the write-block prior to deleting the index, as this was another source of potential data loss. Additionally, we now remove the block if the migration fails.
main branch PR: #120168