Skip to content

Conversation

@DaveCTurner
Copy link
Contributor

As of #133030 Elasticsearch uses S3's support for conditional writes to
protect against repository corruption. It is possible that some other
storage system may claim to be fully S3-compatible despite rejecting
such requests.

This commit adds a repository setting that allows to make all writes
unconditional, unsafely disabling the corruption protection, but
allowing users some time to work with their storage supplier to address
the incompatibility.

As of elastic#133030 Elasticsearch uses S3's support for conditional writes to
protect against repository corruption. It is possible that some other
storage system may claim to be fully S3-compatible despite rejecting
such requests.

This commit adds a repository setting that allows to make all writes
unconditional, unsafely disabling the corruption protection, but
allowing users some time to work with their storage supplier to address
the incompatibility.
@DaveCTurner DaveCTurner requested a review from ywangd October 27, 2025 11:20
@DaveCTurner DaveCTurner added >non-issue :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v9.3.0 labels Oct 27, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Oct 27, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Comment on lines +636 to +637
// REPOSITORY_ANALYSIS is a strict check for 100% S3 compatibility, including conditional write support
return supportsConditionalWrites || purpose == OperationPurpose.REPOSITORY_ANALYSIS;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB no tests for the REPOSITORY_ANALYSIS condition here, they'll be added when we resolve #134632.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean repo analysis will always fail for storage service not supporting conditional writes? I assume that's intentional even if this means existing compatible ones could start failing? IIUC, this work has something to do with AWS's MPU guarantee. Do we consider a storage passing the analysis if it supports the MPU guarantees but not the conditional writes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's right, this is covered in the docs:

NOTE: Different versions of Elasticsearch may perform different checks for repository compatibility, with newer versions typically being stricter than older ones. A storage system that passes repository analysis with one version of Elasticsearch may fail with a different version. This indicates it behaves incorrectly in ways that the former version did not detect.

This doesn't yet matter for the non-linearizable MPUs but when we move to using conditional writes for CAS operations we will only do so if the storage supports conditional writes. If the user indicates that their storage doesn't support conditional writes yet then we'll fall back to today's MPU-based CAS implementation.

);

static final Setting<Boolean> UNSAFELY_INCOMPATIBLE_WITH_S3_WRITES = Setting.boolSetting(
"unsafely_incompatible_with_s3_conditional_writes",
Copy link
Contributor

@mhl-b mhl-b Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have other names for the setting in mind? I assume with conditional ETag there would be another parameter, since it's a different feature from existence check, and s3-like-storage might have one of those. Maybe supports_s3_write_existence_check, and later supports_s3_write_etag_check?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be with unsafe_

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't planning on distinguishing conditional-on-object-exists from conditional-on-etag writes. Do you see a need to do that? My thinking is that "conditional writes" covers both cases, and although these things were announced by AWS as separate features they're now documented together.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only argument I have is that these features came at different time and use different implementation that s3-like-repos might have only one of them.

But the feature parity is blurry, and how much we should put effort into featuring these knobs. I think it's s3-like-repo responsibility provide own s3-like-client rather than using not fully compatible aws client.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be worth to name this setting more precisely. If a storage does support etag but not the other, it is not obvious whether it should set this setting to true or false.

I also wonder whether it is better named more positively, e.g. supports_s3_conditional_writes, which seems more straightforward. But maybe you prefer the more negative form to also serve as a warnng?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a storage does support etag but not the other, it is not obvious

I think it's clear in that case that the storage does not properly support conditional writes so the user needs to disable them until they can upgrade their storage. I don't think it means much to the user to distinguish the different kinds of conditional writes.

I also wonder whether it is better named more positively

Yeah I'd prefer that but it's also important to get the unsafe bit into the name so that users know they need to work on their storage upgrade ASAP.

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code changes look fine to me. My question is about how we want to roll this out.

Once it is released, a 3rd party storage that previously passed the repo analysis check can start failing. I assume that customer does not really run repo analysis as a regular check on upgrade? So they may run with the code assuming full compatibility (the default) for a while before it is caught by running repo analysis again. What would be our recommendation at this point? Since storage server upgrade might not be feasible, do we suggest configuring the setting to false and ignore the rather severe warning? It also brings up the question on whether their existing data is already corrupted because the system has running with a false assumption?

Comment on lines +636 to +637
// REPOSITORY_ANALYSIS is a strict check for 100% S3 compatibility, including conditional write support
return supportsConditionalWrites || purpose == OperationPurpose.REPOSITORY_ANALYSIS;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean repo analysis will always fail for storage service not supporting conditional writes? I assume that's intentional even if this means existing compatible ones could start failing? IIUC, this work has something to do with AWS's MPU guarantee. Do we consider a storage passing the analysis if it supports the MPU guarantees but not the conditional writes?

);

static final Setting<Boolean> UNSAFELY_INCOMPATIBLE_WITH_S3_WRITES = Setting.boolSetting(
"unsafely_incompatible_with_s3_conditional_writes",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be worth to name this setting more precisely. If a storage does support etag but not the other, it is not obvious whether it should set this setting to true or false.

I also wonder whether it is better named more positively, e.g. supports_s3_conditional_writes, which seems more straightforward. But maybe you prefer the more negative form to also serve as a warnng?

@DaveCTurner
Copy link
Contributor Author

Once it is released, a 3rd party storage that previously passed the repo analysis check can start failing. I assume that customer does not really run repo analysis as a regular check on upgrade?

Generally that's what we'd recommend. See e.g. these docs which say to verify snapshot repositories with repo analysis as part of the test-cluster upgrade.

So they may run with the code assuming full compatibility (the default) for a while before it is caught by running repo analysis again. What would be our recommendation at this point? Since storage server upgrade might not be feasible, do we suggest configuring the setting to false and ignore the rather severe warning? It also brings up the question on whether their existing data is already corrupted because the system has running with a false assumption?

If after an upgrade they don't notice that their storage doesn't support conditional writes then it must be ignoring the If-None-Match header we added in #133030, in which case it doesn't really matter if we continue to send it or not, no action is needed. My immediate concern is more that some storage might actively reject these requests because of the new If-None-Match header, and we need a way to turn that header off temporarily.

@DaveCTurner DaveCTurner requested review from mhl-b and ywangd October 28, 2025 14:37
Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I am quite satisfied since running repository analysis is in fact a suggested step for upgrade.

If after an upgrade they don't notice that their storage doesn't support conditional writes then it must be ignoring the If-None-Match header we added in #133030, in which case it doesn't really matter if we continue to send it or not, no action is needed.

My question is that if they run repo analysis after upgrade and encounter the failure. What would be our suggestion? I see 3 options:

  1. Ignore the repo analysis failure
  2. Set unsafely_incompatible_with_s3_conditional_writes to true which sounds rather uncomfortable on the paper due to the name and associated warning.
  3. Ask their storage provider for feature parity

I guess in pratice we should take option 2. In that case, I wonder whether the wording should be a bit softer since "may lead to repository corruption" sounds quite concerning. Or maybe my understanding is off and repo analysis does not fail if the storage server ignores the header?

@DaveCTurner
Copy link
Contributor Author

I see 3 options:

We'd generally expect users to do both option 2 (short-term workaround) and option 3 (long-term fix).

I wonder whether the wording should be a bit softer since "may lead to repository corruption" sounds quite concerning.

It's true tho: e.g. if you accidentally set up two clusters to write to the repository at the same time, without overwrite protection they might both write the same index-${N} root blob and then start deleting blobs they believe to be unreferenced, and some of those deletes will be incorrect.

@DaveCTurner
Copy link
Contributor Author

Or maybe my understanding is off and repo analysis does not fail if the storage server ignores the header?

Actually this is currently true, we do not verify that failIfAlreadyExists actually does fail on objects that already exist. But we'd like to address that too, see #134632.

@DaveCTurner DaveCTurner requested a review from ywangd October 29, 2025 07:20
Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks for explaining.

if you accidentally set up two clusters to write to the repository at the same time

We don't support such setup, right? The conditional write will help us better protect against it. But just by having such a setup itself is already on the disaster path.

@DaveCTurner
Copy link
Contributor Author

We don't support such setup, right? The conditional write will help us better protect against it. But just by having such a setup itself is already on the disaster path.

That's correct, this isn't expected to work, but it's an easy mistake to make it seems and we'd much rather have it fail to work with a benign error message before it harms the repository contents :)

@DaveCTurner DaveCTurner merged commit dc8afd2 into elastic:main Oct 29, 2025
34 checks passed
@DaveCTurner DaveCTurner deleted the 2025/10/27/s3-conditional-writes-option branch October 29, 2025 07:59
DaveCTurner added a commit to elastic/docs-content that referenced this pull request Oct 29, 2025
chrisparrinello pushed a commit to chrisparrinello/elasticsearch that referenced this pull request Nov 3, 2025
As of elastic#133030 Elasticsearch uses S3's support for conditional writes to
protect against repository corruption. It is possible that some other
storage system may claim to be fully S3-compatible despite rejecting
such requests.

This commit adds a repository setting that allows to make all writes
unconditional, unsafely disabling the corruption protection, but
allowing users some time to work with their storage supplier to address
the incompatibility.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Nov 21, 2025
In elastic#137185 we introduced a new S3 repository setting called
`unsafely_incompatible_with_s3_conditional_writes` but named the code
symbol `UNSAFELY_INCOMPATIBLE_WITH_S3_WRITES` omitting the "conditional"
bit. That makes the dependent code confusing to read, so this commit
fixes the symbol name.
DaveCTurner added a commit that referenced this pull request Nov 21, 2025
In #137185 we introduced a new S3 repository setting called
`unsafely_incompatible_with_s3_conditional_writes` but named the code
symbol `UNSAFELY_INCOMPATIBLE_WITH_S3_WRITES` omitting the "conditional"
bit. That makes the dependent code confusing to read, so this commit
fixes the symbol name.
ncordon pushed a commit to ncordon/elasticsearch that referenced this pull request Nov 26, 2025
…ic#138406)

In elastic#137185 we introduced a new S3 repository setting called
`unsafely_incompatible_with_s3_conditional_writes` but named the code
symbol `UNSAFELY_INCOMPATIBLE_WITH_S3_WRITES` omitting the "conditional"
bit. That makes the dependent code confusing to read, so this commit
fixes the symbol name.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Nov 27, 2025
As of elastic#133030 Elasticsearch uses S3's support for conditional writes to
protect against repository corruption. It is possible that some other
storage system may claim to be fully S3-compatible despite rejecting
such requests.

This commit adds a repository setting that allows to make all writes
unconditional, unsafely disabling the corruption protection, but
allowing users some time to work with their storage supplier to address
the incompatibility.

Backport of elastic#137185 and elastic#138406 to 9.2
DaveCTurner added a commit that referenced this pull request Nov 28, 2025
As of #133030 Elasticsearch uses S3's support for conditional writes to
protect against repository corruption. It is possible that some other
storage system may claim to be fully S3-compatible despite rejecting
such requests.

This commit adds a repository setting that allows to make all writes
unconditional, unsafely disabling the corruption protection, but
allowing users some time to work with their storage supplier to address
the incompatibility.

Backport of #137185 and #138406 to 9.2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants