Skip to content

Conversation

mhl-b
Copy link
Contributor

@mhl-b mhl-b commented Oct 3, 2024

Introduce new S3 repository setting - max_multipart_parts. Currently, with default settings, to upload large files to S3 we split them into 5TB chunks and then each chunk into parts with up to 100MB size. That means we need to send up to 50k parts, where S3 limit is 10k. Attempts to upload files larger than 1TB would fail using default settings.

With new setting we split files by smallest of chunk_size or buffer_size x max_multipart_parts. With default values of buffer_size=100MB and max_multipart_parts=10_000 maximum "actual" chunk size will be ~1TB, not 5TB.

This PR is a successor to #112708

@mhl-b mhl-b added >enhancement :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v9.0.0 labels Oct 3, 2024
Copy link
Contributor

github-actions bot commented Oct 3, 2024

Documentation preview:

@elasticsearchmachine
Copy link
Collaborator

Hi @mhl-b, I've created a changelog YAML for you.

@mhl-b mhl-b marked this pull request as ready for review October 3, 2024 01:36
@mhl-b mhl-b requested review from DaveCTurner and ywangd October 3, 2024 01:36
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@mhl-b mhl-b requested review from nicktindall and removed request for ywangd October 3, 2024 01:38
Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a deeper look later but I'd definitely like a different name for this setting. Could we call it max_multipart_parts?

@mhl-b mhl-b changed the title Add parts_number setting to S3 repository Add max_multipart_parts setting to S3 repository Oct 3, 2024
@mhl-b mhl-b requested a review from DaveCTurner October 3, 2024 18:07
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except some suggestions for the docs and a naming nit. We can iterate on the docs separately if you'd prefer.

Comment on lines 264 to 267
(<<byte-units,byte value>>) Big files can be broken down into chunks during snapshotting if needed.
Specify the chunk size as a value and unit, for example:
When large file split into chunks, the chunk size will be defined by smallest of `chunk_size`
or `buffer_size * max_multipart_parts`. Specify the chunk size as a value and unit, for example:
`1TB`, `1GB`, `10MB`. Defaults to the maximum size of a blob in the S3 which is `5TB`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never really liked the wording here to start with :) How about something like this?

    (<<byte-units,byte value>>) The maximum size of object that {es} will write to the repository when creating a snapshot. Files which are
    larger than `chunk_size` will be chunked into several smaller objects. {es} may also split a file across multiple objects to satisfy
    other constraints such as the `max_multipart_parts` limit. Defaults to `5TB` which is the
    https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html[maximum size of an object in AWS S3].

Copy link
Contributor Author

@mhl-b mhl-b Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ on doc changes, thank you. Will include in this PR

Comment on lines 298 to 302
(<<number,numeric>>) Maximum number of parts for multipart upload. When large file split into
chunks, the chunk size will be defined by smallest of `chunk_size` or `buffer_size * max_multipart_parts`.
Default value is 10,000, also see https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html[S3 multipart upload limits].
For example, with `buffer_size=100MB` and `max_multipart_parts=10,000` summation of all parts is about 1TB.
If chunk_size is set to 5TB then smallest between two would be 1TB.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be inclined not to dwell too much on the calculations here, how about something like this?

Suggested change
(<<number,numeric>>) Maximum number of parts for multipart upload. When large file split into
chunks, the chunk size will be defined by smallest of `chunk_size` or `buffer_size * max_multipart_parts`.
Default value is 10,000, also see https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html[S3 multipart upload limits].
For example, with `buffer_size=100MB` and `max_multipart_parts=10,000` summation of all parts is about 1TB.
If chunk_size is set to 5TB then smallest between two would be 1TB.
(<<number,integer>>) The maximum number of parts that {es} will write during a multipart upload of a single object. Files which are
larger than `buffer_size × max_multipart_parts` will be chunked into several smaller objects. {es} may also split a file across multiple
objects to satisfy other constraints such as the `chunk_size` limit. Defaults to `10000` which is the
https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html[maximum number of parts in a multipart upload in AWS S3].

* @param partSize part size in s3 or buffer_size
* @param partsNum number of parts(buffers)
*/
static ByteSizeValue objectSizeLimit(ByteSizeValue objectSize, ByteSizeValue partSize, int partsNum) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: could we align the argument names with the names of the variables passed in? I know they're kinda weird names but we can't change the settings' names for legacy reasons and we're following that naming elsewhere.

@mhl-b mhl-b merged commit fc1bee2 into elastic:main Oct 4, 2024
16 checks passed
@DaveCTurner
Copy link
Contributor

As discussed elsewhere, it'd be good to backport this to 8.x too. I'm adding the appropriate labels.

mhl-b added a commit to mhl-b/elasticsearch that referenced this pull request Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.16.0 v9.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants