Skip to content

Conversation

@joshua-adams-1
Copy link
Contributor

@joshua-adams-1 joshua-adams-1 commented Oct 27, 2025

Replaces the existing ChecksumBlobStoreFormat INDEX_METADATA_FORMAT inside BlobStoreRepository with a new INDEX_SHARD_COUNT_FORMAT. This removes the need for writing an entire IndexMetadata object to, and reading from, heap memory. Instead, we write to and read from heap only the shard count for an index, reducing the total heap memory needed for snapshotting and reducing the likelihood a node will go OOMe.

Relates to #131822
Relates to ES-12539
Closes ES-12538

Replaces the existing ChecksumBlobStoreFormat INDEX_METADATA_FORMAT
inside BlobStoreRepository with a new INDEX_SHARD_COUNT_FORMAT. This
removes the need for writing an entire IndexMetadata object to, and
reading from, heap memory. Instead, we write to and read from heap only
the shard count for an index, reducing the total heap memory needed for
snapshotting and reducing the likelihood a node will go OOMe.

Closes ES-12539
Closes ES-12538
@joshua-adams-1 joshua-adams-1 self-assigned this Oct 27, 2025
@joshua-adams-1 joshua-adams-1 added >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Oct 27, 2025
@joshua-adams-1 joshua-adams-1 changed the title New index metadata format [WIP] Introduce INDEX_SHARD_COUNT_FORMAT Oct 27, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @joshua-adams-1, I've created a changelog YAML for you.

@joshua-adams-1
Copy link
Contributor Author

This is a WIP PR experimenting with an alternative solution to #134441 and #136952. In this PR, instead of writing the entire IndexMetadata object to and from the heap, I write only the shard count. This will hopefully remove the need to limit concurrency for smaller nodes during snapshotting, as they are loading less into memory in parallel.

As a note, I have left the existing INDEX_METADATA_FORMAT since this is in use elsewhere in the codebase, and can be deprecated afterwards.

As a second note, if this is approved, I would push this change, and then merge into #134441, so that we are still only loading the shard count once per index. I would then close #136952 as there are no concurrency concerns

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a note, I have left the existing INDEX_METADATA_FORMAT since this is in use elsewhere in the codebase, and can be deprecated afterwards.

I don't think we can deprecate this, no, it's important to be able to read the entire index metadata sometimes (e.g. when restoring an index from snapshot).

* Name prefix for the blobs that hold the global {@link IndexShardCount} for each index
* @see #SHARD_COUNT_NAME_FORMAT
*/
public static final String SHARD_COUNT_PREFIX = "shard-count-";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't really introduce a new blob here (it won't exist in any older repositories) but nor is there a need to do so. Instead, let's just read the existing meta-${UUID}.dat blob with a different reader that ignores everything except the shard count.

@joshua-adams-1 joshua-adams-1 marked this pull request as ready for review October 28, 2025 18:54
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Oct 28, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to isolate the changes with o.e.repositories.blobstore rather than also adding these things to IndexMetadata? This change is very blobstore-specific, I'd rather it wasn't so visible to non-blobstore users of IndexMetadata.

public static final ChecksumBlobStoreFormat<IndexMetadata.IndexShardCount> INDEX_SHARD_COUNT_FORMAT = new ChecksumBlobStoreFormat<>(
"index-metadata",
METADATA_NAME_FORMAT,
(repoName, parser) -> IndexMetadata.IndexShardCount.fromLegacyIndexMetaData(parser),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, TIL, I didn't know we had this notion of a "fallback" parser. Apparently we added it in #78244 to add back support for repositories dating back to 5.0. Any idea what's changed? I'm not sure we need this just to get hold of the settings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was originally to preserve functionality with the existing INDEX_METADATA_FORMAT. Since I was still technically parsing the IndexMetadata (even if just for one field) I assumed I would also need a legacy and non-legacy parsing method. However, since I am only parsing the shard count and ignoring everything else, there is no difference between versions and this can be removed. Good catch!

Comment on lines 3306 to 3308
if (legacy) {
checkSettingIndexVersionCompatibility(settings);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I don't think it makes sense to reject older index metadata like this - we're only trying to get the shard count, we don't need to be strict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per my response to this comment, I can remove this

Comment on lines 1916 to 1917
// Write the index metadata and the shard count to memory for each index in the snapshot, so that it persists
// even if the repository is deleted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"write to memory" is kinda odd phrasing - writes are supposed to be persistent here. Maybe this is leftover from an earlier change?

Comment on lines 3255 to 3258
public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
builder.field(KEY_SHARD_COUNT, shardCount);
return builder;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should have a toXContent here, we're only using this to read the shard count, not to write it out again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally defined the INDEX_SHARD_COUNT_FORMAT variable as:

public static final ChecksumBlobStoreFormat<Integer> INDEX_SHARD_COUNT_FORMAT =

However, attempting to create a ChecksumBlobStoreFormat<T> where T does not implement the ToXContent interface results in the below error:

no instance(s) of type variable(s) exist so that T conforms to ToXContent

To fix this, I needed to define a separate class IndexShardCount, and ensure it implemented the ToXContent interface (hence this method's inclusion).

If you know a work around for this then I am happy to change!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this seems to work for me:

    public static final ChecksumBlobStoreFormat<Integer> SHARD_COUNT_FORMAT = new ChecksumBlobStoreFormat<>(
        "shard-count",
        METADATA_NAME_FORMAT,
        (repoName, parser) -> Integer.valueOf(0) /* TODO */,
        (ignored) -> {
            assert false;
            throw new UnsupportedOperationException();
        }
    );

I mean I'd prefer to have a separate class (record?) for IndexShardCount rather than just having a bare Integer but I don't think this class needs to support ToXContent unless I'm missing something.

Comment on lines 3264 to 3265
public static class Builder {
private int count;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also seems kinda excessive - we don't need a builder for a class with just one int field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to remove!

assertEquals(numberOfShards, count.getCount());
}

public void testFromIndexMetaDataWithoutNumberOfShardsSettingReturnsNegativeOne() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these legacy versions, I'd rather we had an actual copy of the index metadata blob written by the older version and verified that we read it as expected. Otherwise if we make a change to IndexMetadata serialization and a parallel change to the IndexShardCount deserialization then we may not notice a problem with deserializing actual blobs until it's too late. I mean it's ok to synthesize the older versions like this too, but using a real copy of some ancient metadata would be a valuable addition.

Copy link
Contributor Author

@joshua-adams-1 joshua-adams-1 Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any advice on how to do this? I've looked into writing a qa test similar to repository-multi-version here but it seems quite convoluted writing a build.gradle file to try and get an old version of elasticsearch running just to serialise and then deserialise a blob. I imagine it is a common problem to want to run different versions of ElasticSearch and that this is abstracted into a test base class somewhere.

If not, I can extend the above test to create a snapshot in an older version and then delete it in a newer version - this will then invoke the snapshot deletion logic that eventually will call INDEX_SHARD_COUNT_FORMAT.read(... and the custom parsing logic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's a good point, we could add something to org.elasticsearch.upgrades.MultiVersionRepositoryAccessIT to check that we can load the shard count from the blob actually written by the old cluster. We almost cover this already since e.g. testUpgradeMovesRepoToNewMetaVersion uses the current version to delete snapshots created by a much older version, except that I think we delete all the snapshots for the old index in one go so we never actually look at its shard count.

However those tests only go back as far as 8.0.0 AFAICT. That's actually a testing gap, I think we need to be able to deal with snapshot repositories going back as far as 5.0.0 (via the old-lucene-versions plugin for pre-8.0.0 indices).

Comment on lines 3242 to 3244
public static class IndexShardCount implements ToXContentFragment {
private static final String KEY_SHARD_COUNT = "shard_count";
private final int shardCount;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer all these changes live in blobstore package instead here. It is kinda confusing for people who does not care about how snapshot works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree! Will move them

Comment on lines 3304 to 3305
if (currentFieldName.equals(KEY_SETTINGS)) {
Settings settings = Settings.fromXContent(parser);
Copy link
Member

@ywangd ywangd Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we still need to read the settings, I wonder whether it is necessary to have this as a separate format. The following "inclusive filtering parser" seems to work for me

try (
    XContentParser parser = createParser(
        XContentParserConfiguration.EMPTY.withDeprecationHandler(LoggingDeprecationHandler.INSTANCE)
            .withFiltering(Set.of("*.settings", "*.mapping_version", "*.settings_version"), null, false),
        JsonXContent.jsonXContent,
        BytesReference.bytes(builder)
    )
) {
    final var indexMetadata = IndexMetadata.fromXContent(parser);
}

It builds an IndexMetadata with the minimal number of required information which is not much besides settings.

Copy link
Contributor Author

@joshua-adams-1 joshua-adams-1 Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks promising - let me have a go locally and see what I can do! As a note, if it is still loading some IndexMetadata object with N size, since at this time I don't know the value of N, we may still need to implement concurrency if having 10 threads loading this in parallel would cause a small node to OOMe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that trying to construct an almost-empty IndexMetadata instance will break some invariant somewhere (or make it hard to add invariants in future). I think I prefer a separate reader, plus testing to verify that the separate reader can read all possible formats of IndexMetadata.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a WIP PR #137335 if this matches your expectations. If not, I am also working on a new revision of this PR with the above requested changes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is still loading some IndexMetadata object with N size, since at this time I don't know the value of N, we may still need to implement concurrency if having 10 threads loading this in parallel would cause a small node to OOMe.

It should just load what is specified in the config, which is settings and a few longs. Since the new Format reads settings as well, I think the memory footprint should be quite close.

make it hard to add invariants in future

This is true. I was actually a little surprised on how little invariants we have today for the almost empty IndexMetadata.

I have a WIP PR #137335 if this matches your expectations

Thanks for testing it out. I personally quite like its simplicity. But I am ok with a new reader as well. Also the latest trimmed down version looks simpler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer a new ChecksumBlobStoreFormat rather than passing the filter directly to INDEX_METADATA_FORMAT. However, it would be nice for the new format to use a filtering XContentParserConfiguration somehow because I believe this can skip over all the filtered-out junk without allocating anything. Possibly we can achieve something similar with skipChildren() but the filter seems clearer IMO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ChecksumBlobStoreFormat deserialization is here. The parser is defined here as:

XContentParser parser = XContentHelper.createParserNotCompressed(
    XContentParserConfiguration.EMPTY.withRegistry(namedXContentRegistry)
        .withDeprecationHandler(LoggingDeprecationHandler.INSTANCE),
    bytesReference,
    XContentType.SMILE
)

There is currently no way to configure this. I extended this method in #137335 to have a XContentParserConfiguration passed in as a parameter, but I think it would be redundant to integrate this change into this PR (as opposed to just having that by itself anyways). If I was defining the parser using a XContentParserConfiguration to skip over all the fields other than shard count, would the reader not just become a wrapper to invoking the parser and returning the result?

@joshua-adams-1 joshua-adams-1 changed the title [WIP] Introduce INDEX_SHARD_COUNT_FORMAT Introduce INDEX_SHARD_COUNT_FORMAT Oct 29, 2025
Comment on lines 57 to 58
while ((token = parser.nextToken()) != XContentParser.Token.END_OBJECT) {
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be performant to call parser.skipChildren() and avoid the while loops.

@joshua-adams-1
Copy link
Contributor Author

joshua-adams-1 commented Oct 30, 2025

As an update to reviewers, I have:

  1. Introduced a new format INDEX_SHARD_COUNT_FORMAT and have closed Pass XContentParserConfiguration into ChecksumBlobStoreFormat #137335
  2. Written unit tests. I then wrote a test testUpgradeMovesRepoToNewMetaVersion that explicitly tests whether snapshots created in an old version can be deleted in a newer version. However, this only tests from version 8 onwards. I can't see easily how to extend this to test from 5.X onwards (I'm struggling to get the old-lucene-versions plugin working). Since 1) changing MultiVersionRepositoryAccessIT to run from 5.x -> 9.X may affect the success rate of the existing test suite and 2) this change blocks Skip unnecessary loading of IndexMetadata during snapshot deletion #134441 I will extend the ES test versions in a follow up PR, with it's own review cycle and, importantly, release into the Elasticsearch build pipeline (since I'm changing a BWC test infrastructure there may be a large blast radius from such a change). Does that make sense to do?

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, much nicer than inventing some arbitrary concurrency limits. I left some comments.

* @return Returns an {@link IndexShardCount} containing the shard count for the index
* @throws IOException Thrown if the {@link IndexMetadata} object cannot be parsed correctly
*/
public static IndexShardCount fromIndexMetaData(XContentParser parser) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: metadata is one word

Suggested change
public static IndexShardCount fromIndexMetaData(XContentParser parser) throws IOException {
public static IndexShardCount fromIndexMetadata(XContentParser parser) throws IOException {

Comment on lines 36 to 41
if (parser.currentToken() == null) { // fresh parser? move to the first token
parser.nextToken();
}
if (parser.currentToken() == XContentParser.Token.START_OBJECT) { // on a start object move to next token
parser.nextToken();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this leniency is needed in IndexMetadata because it gets parsed from several different sources (at least, it probably used to, maybe not any more). Here we know we're reading from the blob in the snapshot repository so I think we can tighten this up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran BlobStoreRepositoryDeleteThrottlingTests here on debug mode and confirmed that both of these code paths are executed during the runtime of that test, and removing them fails the test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure but do we ever not run these branches?

Comment on lines 56 to 57
// Iterate through the object, but we don't care for it's contents
parser.skipChildren();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't think we need a comment describing what skipChildren() does

Comment on lines 53 to 54
Settings settings = Settings.fromXContent(parser);
indexShardCount = new IndexShardCount(settings.getAsInt(SETTING_NUMBER_OF_SHARDS, -1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we push the skipping behaviour down even further here? There's no need to load a whole Settings. It's not as big as the mappings but can still be sizeable (e.g. index.query.default_field could list thousands of field names).

// longer exist) and may contain dangling blobs too. A subsequent delete that hits this index may repair the state if
// the metadata read error is transient, but if not then the stale indices cleanup will eventually remove this index
// and all its extra data anyway.
// shard count from a subsequent indexMetaGeneration, or we might just not process these shards. If we skip these shards
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I preferred "another metadata blob" here. "Subsequent" implies some kind of ordering but we're loading all this stuff in parallel, and might instead have got the shard count from an earlier blob. Also indexMetaGeneration is really the name of the blob rather than the blob itself, so metadata blob is more accurate.

XContentParser.Token token = parser.nextToken();
XContentParserUtils.ensureExpectedToken(XContentParser.Token.START_OBJECT, token, parser);

IndexShardCount indexShardCount = new IndexShardCount(-1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a new dummy object here? We could use null to avoid the allocation. Or maybe we can return from this whole method early as soon as we've read the shard count? No need to consume this parser all the way to the end IIRC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately we do need to consume the parser to the end in order to satisfy this constraint, here:

result = reader.apply(projectRepo, parser);
XContentParserUtils.ensureExpectedToken(null, parser.nextToken(), parser);

My original solution returned early as an optimisation and threw an error for this reason.

However, I can instantiate the variable to null if you believe that to be better. By setting it to -1 I was protecting us from returning a null in the case of a malformed IndexMetadata object. If this now does occur, this code here:

try {
    updateShardCount(
        INDEX_SHARD_COUNT_FORMAT.read(getProjectRepo(), indexContainer, indexMetaGeneration, namedXContentRegistry).count()
    );
} catch (Exception ex) {
    logger.warn(() -> format("[%s] [%s] failed to read shard count for index", indexMetaGeneration, indexId.getName()), ex);
    // Definitely indicates something fairly badly wrong with the repo, but not immediately fatal here: we might get the
    // shard count from another metadata blob, or we might just not process these shards. If we skip these shards
    // then the repository will technically enter an invalid state (these shards' index-XXX blobs will refer to snapshots
    // that no longer exist) and may contain dangling blobs too. A subsequent delete that hits this index may repair
    // the state if the metadata read error is transient, but if not then the stale indices cleanup will eventually
    // remove this index and all its extra data anyway.
    // TODO: Should we fail the delete here? See https://github.com/elastic/elasticsearch/issues/100569.
}

will throw a NPE since null.count() isn't possible. The error will be caught and ignored, and we will move on to the next blob. I am happy with this behaviour, just wanted it documented

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see ok thanks for checking. I'd rather we convert a null into a non-null placeholder value at the end of the method (since it should basically never occur absent corruption)

// Delete a bulk number of snapshots, avoiding the case where we delete all snapshots since this invokes
// cleanup code and bulk snapshot deletion logic which is tested in testUpgradeMovesRepoToNewMetaVersion
final List<String> snapshotsToDeleteInBulk = randomSubsetOf(randomIntBetween(1, snapshotNames.size() - 1), snapshotNames);
deleteSnapshot(repoName, snapshotsToDeleteInBulk.toArray(String[]::new));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a bit convoluted to convert the list to an array only then to convert it again into a string with comma separators. Maybe make an overload that takes a List<String> and use org.elasticsearch.common.Strings#collectionToCommaDelimitedString to go directly from list to string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Coordination Meta label for Distributed Coordination team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants