Skip to content

Conversation

JVerwolf
Copy link
Contributor

@JVerwolf JVerwolf commented Jan 21, 2025

Jira: ES-9724

This PR removes a potential cause of data loss when migrating system indices. It does this by changing the way we set a "write-block" on the system index to migrate - now using a dedicated transport request rather than a settings update. Furthermore, we no longer delete the write-block prior to deleting the index, as this was another source of potential data loss. Additionally, we now remove the block if the migration fails.

main branch PR: #120168

@JVerwolf JVerwolf changed the base branch from main to 8.x January 21, 2025 23:26
@elastic elastic deleted a comment from github-actions bot Jan 21, 2025
…nt/es-9724-reduce-data-loss-system-indices-8x
@JVerwolf JVerwolf added v8.18.0 :Core/Infra/Core Core issues without another label and removed v9.0.0 labels Jan 21, 2025
@JVerwolf JVerwolf added >bug Team:Core/Infra Meta label for core/infra team labels Jan 21, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @JVerwolf, I've created a changelog YAML for you.

JVerwolf and others added 5 commits January 21, 2025 15:36
…nt/es-9724-reduce-data-loss-system-indices-8x
… of github.com:JVerwolf/elasticsearch into enhancement/es-9724-reduce-data-loss-system-indices-8x
),
e
);
removeReadOnlyBlockOnReindexFailure(oldIndex, delegate2, e);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error handling function (second param to ActionListener.wrap) will be called If there is an error returned by the aliases request in setAliasAndRemoveOldIndex, or if there is an an error thrown by the happy-path function that's the first parameter to ActionListener.wrap.

This function will log the error and then remove the WRITE block on the original index, in an attempt to not leave things in a broken state.

I'm not sure how to test this path, or even if this path is needed. I'd be happy to get feedback here from reviewers - thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for local testing - would introducing code that always throws an exception to TransportIndicesAliasesAction.masterOperation help to test this path?

Copy link
Contributor Author

@JVerwolf JVerwolf Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good idea. I'll look into how to do that - I think I remember seeing examples of that type of thing somewhere.

@JVerwolf JVerwolf marked this pull request as ready for review January 23, 2025 00:10
@JVerwolf JVerwolf requested review from a team and alexey-ivanov-es January 23, 2025 00:10
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@ldematte
Copy link
Contributor

Side note: if this is a backport (looks like a backport?) could you add the tag please?

),
e
);
removeReadOnlyBlockOnReindexFailure(oldIndex, delegate2, e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for local testing - would introducing code that always throws an exception to TransportIndicesAliasesAction.masterOperation help to test this path?

… of github.com:JVerwolf/elasticsearch into enhancement/es-9724-reduce-data-loss-system-indices-8x
…nt/es-9724-reduce-data-loss-system-indices-8x
Comment on lines +306 to +316
// Retry the migration
client().execute(PostFeatureUpgradeAction.INSTANCE, new PostFeatureUpgradeRequest(TEST_REQUEST_TIMEOUT)).get();

// Ensure that the migration is successful after the alias request is unblocked
assertBusy(() -> {
GetFeatureUpgradeStatusResponse statusResp = client().execute(
GetFeatureUpgradeStatusAction.INSTANCE,
new GetFeatureUpgradeStatusRequest(TEST_REQUEST_TIMEOUT)
).get();
logger.info(Strings.toString(statusResp));
assertThat(statusResp.getUpgradeStatus(), equalTo(GetFeatureUpgradeStatusResponse.UpgradeStatus.NO_MIGRATION_NEEDED));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is failing. When I set breakpoints, the taskState in org.elasticsearch.upgrades.SystemIndexMigrator#cleanUpPreviousMigration is null, which prevents the previous "new" index from being cleaned up. I then get an exception upon trying to create an index that already exists.

@gwbrown Do you know why this might be happening? Thanks!

@JVerwolf
Copy link
Contributor Author

The new test I added breaks the original code (without my changes) as well as my PR. It seems the task state is not being restored in the subsequent migration runs, which prevents the new index from being cleaned-up. @rjernst and I spent a while debugging this, but weren't able to locate the cause as of yet. I'll disable the test for now, and revisit it in a future PR.

Copy link
Contributor

@alexey-ivanov-es alexey-ivanov-es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JVerwolf JVerwolf merged commit a3919b0 into elastic:8.x Jan 28, 2025
15 checks passed
JVerwolf added a commit that referenced this pull request Jan 29, 2025
elasticsearchmachine pushed a commit that referenced this pull request Jan 29, 2025
Reverts #120566

The original PR is causing the following exception to be thrown when
security is enabled:

```
system-indices-testing-es01-1    | org.elasticsearch.ElasticsearchSecurityException: action [indices:admin/block/add] is unauthorized for user [_system] with effective roles [_system], this action is granted by the index privileges [manage,all]
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport >bug :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team v8.18.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants