Introduce a bulk format that uses a prefix length #135506

iverase · 2025-09-26T07:30:16Z

The bulk API currently uses a suffix marker to signal the end of a document. This way of breaking the documents has a few drawbacks:

It requires to parse all the bytes of a document in order to find the mark
It s only properly supported for Json content type. It can in theory be used for Smile format too but it breaks as the document might contain the marker

This change proposes to add an alternative format for the bulk API that uses a 4 byte prefix for each document containing the length of the document in big endian order. This makes the bulk easier to parse as we don't need to read every byte to know the length of the document and it supports any type of content type.

fixes #94319

elasticsearchmachine · 2025-09-26T07:30:59Z

Hi @iverase, I've created a changelog YAML for you.

… into prefix-length-bulk

server/src/main/java/org/elasticsearch/action/bulk/BulkRequestParser.java

elasticsearchmachine · 2025-09-29T08:45:58Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

elasticsearchmachine · 2025-09-29T08:45:59Z

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

elasticsearchmachine · 2025-09-29T08:46:37Z

Hi @iverase, I've updated the changelog YAML for you.

benwtrent

I really like how simple this is! I am all for it :)

Some comments on the header name, and looks like some left over debugging stuff?

benwtrent · 2025-09-30T17:52:42Z

server/src/main/java/org/elasticsearch/rest/action/document/RestBulkAction.java

    public static final String FAILURE_STORE_STATUS_CAPABILITY = "failure_store_status";
+    public static final String PREFIX_LENGTH_FORMAT_CAPABILITY = "prefix_length_format";
+
+    private static final String BULK_FORMAT_HEADER = "Bulk-Format";


I have been debating on if this should be called X-Bulk-Format, but this RFC (https://datatracker.ietf.org/doc/html/rfc6648) indicates that we shouldn't rely on that convention.

I am fine with either.

I defer to @elastic/es-core-infra on this decision.

I changed to X-Bulk-Format but I am fine either way.

server/src/test/java/org/elasticsearch/action/bulk/BulkRequestParserTestCase.java

… into prefix-length-bulk

iverase · 2025-10-01T05:28:37Z

I wan to to point out that the only formats currently accepted are JSON and SMILE as the RestController has code to reject the request otherwise:

elasticsearch/server/src/main/java/org/elasticsearch/rest/RestController.java

Line 426 in eeca493

if (handler.supportsBulkContent()

We can in theory support any format with this change so we might need to change this if we want to support them.

DaveCTurner

This is a great idea, thanks for tackling it. I have one concern, see inline comment.

DaveCTurner · 2025-10-02T09:16:38Z

server/src/main/java/org/elasticsearch/rest/action/document/RestBulkAction.java

    public static final String FAILURE_STORE_STATUS_CAPABILITY = "failure_store_status";
+    public static final String PREFIX_LENGTH_FORMAT_CAPABILITY = "prefix_length_format";
+
+    private static final String BULK_FORMAT_HEADER = "X-Bulk-Format";


I do not think we should introduce a separate header for this feature. Instead, we should let clients say that the request body is a length-prefixed bulk request using an appropriate value for the existing Content-type header. We already define a few vendor-specific content types with the application/vnd.elasticsearch+ prefix. Let's come up with a way to add some more of them.

@DaveCTurner it was your idea actually. I did start adding a new XContentType but I was unsure how to extend it. It will be great if we can do that and remove the need of a new header.

I cannot claim to have invented length prefixes 😁

I can't say exactly how best to plumb this through the XContentType framework. I suspect that yes we should add enum values for all 4 new length-prefixed-bulk types. We won't have to support rendering any values into these new types, only parsing, and indeed only parsing in the particular case of a bulk request.

Hmm but maybe we don't want this to be an XContentType at all. We can't really either parse or render this data using the regular XContent utils which all expect something very much like JSON. It seems it'd make more sense to me to treat it completely separately. Can we instead use org.elasticsearch.rest.RestRequest#getParsedContentType to access the Content-type header value directly for bulk REST requests, leaving org.elasticsearch.rest.RestRequest#xContentType as null?

I agree XContent(Type) should remain representing a single json blob.

For another content type, it doesn't seem there is any standard, but I find many past ideas in forums. Since this is our own content type it seems we can do whatever we want. My suggestion is something that makes it clear this is a set of (multiple) xcontent, and the particular formatting of that as a parameter to that content type. eg:

Content-Type=application/x-json-stream; format=length-prefixed

Also, regarding the format itself, for json I think we should stick with ascii encoding, so that it's easy to write this with a text editor. Binary formats like smile can have a binary length prefix, but text based formats like json should stick with text IMO.

Having Content-Type headers that do not resolve on XContentType adds an incredible amount of complexity in how rest requests are handled. Not sure how to proceed, do you have any suggestion?

This sort of thing seems to be sufficient (at least to get as far as BulkRequestParser):

diff --git a/server/src/main/java/org/elasticsearch/rest/RestController.java b/server/src/main/java/org/elasticsearch/rest/RestController.java index 66532026fc1c..d9354b9ecf82 100644 --- a/server/src/main/java/org/elasticsearch/rest/RestController.java +++ b/server/src/main/java/org/elasticsearch/rest/RestController.java @@ -425,7 +425,8 @@ public class RestController implements HttpServerTransport.Dispatcher { // TODO consider refactoring to handler.supportsContentStream(xContentType). It is only used with JSON and SMILE if (handler.supportsBulkContent() && XContentType.JSON != xContentType.canonical() - && XContentType.SMILE != xContentType.canonical()) { + && XContentType.SMILE != xContentType.canonical() + && request.hasLengthPrefixedStreamingContent() == false) { channel.sendResponse( RestResponse.createSimpleErrorResponse( channel, diff --git a/server/src/main/java/org/elasticsearch/rest/RestRequest.java b/server/src/main/java/org/elasticsearch/rest/RestRequest.java index 92e83fb9701a..51329ac5fbe7 100644 --- a/server/src/main/java/org/elasticsearch/rest/RestRequest.java +++ b/server/src/main/java/org/elasticsearch/rest/RestRequest.java @@ -332,11 +332,16 @@ public class RestRequest implements ToXContent.Params, Traceable { public void ensureContent() { if (hasContent() == false) { throw new ElasticsearchParseException("request body is required"); - } else if (xContentType.get() == null) { + } else if (xContentType.get() == null && hasLengthPrefixedStreamingContent() == false) { throwValidationException("unknown content type"); } } + public boolean hasLengthPrefixedStreamingContent() { + return parsedContentType != null + && "application/vnd.elasticsearch+test-type".equals(parsedContentType.mediaTypeWithoutParameters()); + } + /** * Returns reference to the network buffer of HTTP content or throw an exception if the body or content type is missing. * See {@link #content()}. diff --git a/server/src/main/java/org/elasticsearch/rest/action/document/RestBulkAction.java b/server/src/main/java/org/elasticsearch/rest/action/document/RestBulkAction.java index 20a9fb7ed23a..e943e4f2032f 100644 --- a/server/src/main/java/org/elasticsearch/rest/action/document/RestBulkAction.java +++ b/server/src/main/java/org/elasticsearch/rest/action/document/RestBulkAction.java @@ -94,6 +94,11 @@ public class RestBulkAction extends BaseRestHandler { return incrementalEnabled.get(); } + @Override + public boolean mediaTypesValid(RestRequest request) { + return request.hasLengthPrefixedStreamingContent() || super.mediaTypesValid(request); + } + @Override public RestChannelConsumer prepareRequest(final RestRequest request, final NodeClient client) throws IOException { if (request.isStreamedContent() == false) {

The issue with this approach (having XContentType as null) is that the response will always be in JSON, where I would expect the response to be in the same XContent as the request.

First try by adding a XContentLengthPrefixedStreamingType in fc7513a

the response will always be in JSON

The client can specify the response type using the Accept header, independently of the request content type. Given that this change requires client-side support too, I think that's sufficient. We should really consider whether any of the XContent response types are really appropriate for the bulk API - they're all significantly chattier (harder to parse) than what's required in most cases.

DaveCTurner · 2025-10-06T09:37:45Z

The conflict created by #135900 was deliberate - we should be doing all the Content-type validation within org.elasticsearch.rest.RestHandler#mediaTypesValid rather than having some slightly weird extra custom logic based on supportsBulkContent() to which this PR was adding.

iverase added 2 commits September 26, 2025 07:52

[Draft] Introduce a bulk format that uses a prefix length

c0cd23c

iter

5fe2f98

iverase added >enhancement :Distributed Indexing/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. v9.2.0 labels Sep 26, 2025

iverase marked this pull request as draft September 26, 2025 07:30

iverase and others added 4 commits September 26, 2025 09:30

Update docs/changelog/135506.yaml

fd1d01a

iter

9a76c9c

Merge branch 'prefix-length-bulk' of github.com:iverase/elasticsearch…

b457285

… into prefix-length-bulk

iter

a5b22bf

ChrisHegarty reviewed Sep 26, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/action/bulk/BulkRequestParser.java Show resolved Hide resolved

server/src/main/java/org/elasticsearch/action/bulk/BulkRequestParser.java Show resolved Hide resolved

benwtrent added the :Core/Infra/Core Core issues without another label label Sep 26, 2025

iverase added 3 commits September 27, 2025 19:05

improve test

e1ac118

Merge branch 'main' into prefix-length-bulk

5ae85ef

Integration tests

f48ff2b

iverase changed the title ~~[Draft] Introduce a bulk format that uses a prefix length~~ Introduce a bulk format that uses a prefix length Sep 29, 2025

iverase marked this pull request as ready for review September 29, 2025 08:45

elasticsearchmachine added Team:Core/Infra Meta label for core/infra team Team:Distributed Indexing Meta label for Distributed Indexing team labels Sep 29, 2025

iverase and others added 2 commits September 29, 2025 10:46

Update docs/changelog/135506.yaml

60c2877

Merge branch 'main' into prefix-length-bulk

fc02961

benwtrent reviewed Sep 30, 2025

View reviewed changes

iverase added 3 commits October 1, 2025 07:06

Merge branch 'main' into prefix-length-bulk

5759b1e

iter

e7c6ebe

Merge branch 'prefix-length-bulk' of github.com:iverase/elasticsearch…

eb50ecb

… into prefix-length-bulk

Merge branch 'main' into prefix-length-bulk

476df29

elasticsearchmachine added v9.3.0 and removed v9.2.0 labels Oct 2, 2025

DaveCTurner reviewed Oct 2, 2025

View reviewed changes

iverase added 5 commits October 3, 2025 12:02

Add XContentLengthPrefixedStreamingType

fc7513a

Merge branch 'main' into prefix-length-bulk

e9998e6

iter

1cb3c97

iter

bb7db43

renames

71b1081

iter

23d9f65

Introduce a bulk format that uses a prefix length #135506

Are you sure you want to change the base?

Introduce a bulk format that uses a prefix length #135506

Uh oh!

Conversation

iverase commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Sep 26, 2025

Uh oh!

Uh oh!

Uh oh!

elasticsearchmachine commented Sep 29, 2025

Uh oh!

elasticsearchmachine commented Sep 29, 2025

Uh oh!

elasticsearchmachine commented Sep 29, 2025

Uh oh!

benwtrent left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

iverase commented Oct 1, 2025

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iverase Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjernst Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iverase Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DaveCTurner commented Oct 6, 2025

Uh oh!

Uh oh!

iverase commented Sep 26, 2025 •

edited

Loading

iverase Oct 2, 2025 •

edited

Loading

rjernst Oct 2, 2025 •

edited

Loading

iverase Oct 3, 2025 •

edited

Loading