Skip to content

Conversation

iverase
Copy link
Contributor

@iverase iverase commented Sep 26, 2025

The bulk API currently uses a suffix marker to signal the end of a document. This way of breaking the documents has a few drawbacks:

  1. It requires to parse all the bytes of a document in order to find the mark
  2. It s only properly supported for Json content type. It can in theory be used for Smile format too but it breaks as the document might contain the marker

This change proposes to add an alternative format for the bulk API that uses a 4 byte prefix for each document containing the length of the document in big endian order. This makes the bulk easier to parse as we don't need to read every byte to know the length of the document and it supports any type of content type.

fixes #94319

@iverase iverase added >enhancement :Distributed Indexing/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. v9.2.0 labels Sep 26, 2025
@iverase iverase marked this pull request as draft September 26, 2025 07:30
@elasticsearchmachine
Copy link
Collaborator

Hi @iverase, I've created a changelog YAML for you.

@benwtrent benwtrent added the :Core/Infra/Core Core issues without another label label Sep 26, 2025
@iverase iverase changed the title [Draft] Introduce a bulk format that uses a prefix length Introduce a bulk format that uses a prefix length Sep 29, 2025
@iverase iverase marked this pull request as ready for review September 29, 2025 08:45
@elasticsearchmachine elasticsearchmachine added Team:Core/Infra Meta label for core/infra team Team:Distributed Indexing Meta label for Distributed Indexing team labels Sep 29, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

@elasticsearchmachine
Copy link
Collaborator

Hi @iverase, I've updated the changelog YAML for you.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like how simple this is! I am all for it :)

Some comments on the header name, and looks like some left over debugging stuff?

public static final String FAILURE_STORE_STATUS_CAPABILITY = "failure_store_status";
public static final String PREFIX_LENGTH_FORMAT_CAPABILITY = "prefix_length_format";

private static final String BULK_FORMAT_HEADER = "Bulk-Format";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been debating on if this should be called X-Bulk-Format, but this RFC (https://datatracker.ietf.org/doc/html/rfc6648) indicates that we shouldn't rely on that convention.

I am fine with either.

I defer to @elastic/es-core-infra on this decision.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed to X-Bulk-Format but I am fine either way.

@iverase
Copy link
Contributor Author

iverase commented Oct 1, 2025

I wan to to point out that the only formats currently accepted are JSON and SMILE as the RestController has code to reject the request otherwise:

We can in theory support any format with this change so we might need to change this if we want to support them.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great idea, thanks for tackling it. I have one concern, see inline comment.

public static final String FAILURE_STORE_STATUS_CAPABILITY = "failure_store_status";
public static final String PREFIX_LENGTH_FORMAT_CAPABILITY = "prefix_length_format";

private static final String BULK_FORMAT_HEADER = "X-Bulk-Format";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we should introduce a separate header for this feature. Instead, we should let clients say that the request body is a length-prefixed bulk request using an appropriate value for the existing Content-type header. We already define a few vendor-specific content types with the application/vnd.elasticsearch+ prefix. Let's come up with a way to add some more of them.

Copy link
Contributor Author

@iverase iverase Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DaveCTurner it was your idea actually. I did start adding a new XContentType but I was unsure how to extend it. It will be great if we can do that and remove the need of a new header.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot claim to have invented length prefixes 😁

I can't say exactly how best to plumb this through the XContentType framework. I suspect that yes we should add enum values for all 4 new length-prefixed-bulk types. We won't have to support rendering any values into these new types, only parsing, and indeed only parsing in the particular case of a bulk request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm but maybe we don't want this to be an XContentType at all. We can't really either parse or render this data using the regular XContent utils which all expect something very much like JSON. It seems it'd make more sense to me to treat it completely separately. Can we instead use org.elasticsearch.rest.RestRequest#getParsedContentType to access the Content-type header value directly for bulk REST requests, leaving org.elasticsearch.rest.RestRequest#xContentType as null?

Copy link
Member

@rjernst rjernst Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree XContent(Type) should remain representing a single json blob.

For another content type, it doesn't seem there is any standard, but I find many past ideas in forums. Since this is our own content type it seems we can do whatever we want. My suggestion is something that makes it clear this is a set of (multiple) xcontent, and the particular formatting of that as a parameter to that content type. eg:

Content-Type=application/x-json-stream; format=length-prefixed

Also, regarding the format itself, for json I think we should stick with ascii encoding, so that it's easy to write this with a text editor. Binary formats like smile can have a binary length prefix, but text based formats like json should stick with text IMO.

Copy link
Contributor Author

@iverase iverase Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having Content-Type headers that do not resolve on XContentType adds an incredible amount of complexity in how rest requests are handled. Not sure how to proceed, do you have any suggestion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sort of thing seems to be sufficient (at least to get as far as BulkRequestParser):

diff --git a/server/src/main/java/org/elasticsearch/rest/RestController.java b/server/src/main/java/org/elasticsearch/rest/RestController.java
index 66532026fc1c..d9354b9ecf82 100644
--- a/server/src/main/java/org/elasticsearch/rest/RestController.java
+++ b/server/src/main/java/org/elasticsearch/rest/RestController.java
@@ -425,7 +425,8 @@ public class RestController implements HttpServerTransport.Dispatcher {
             // TODO consider refactoring to handler.supportsContentStream(xContentType). It is only used with JSON and SMILE
             if (handler.supportsBulkContent()
                 && XContentType.JSON != xContentType.canonical()
-                && XContentType.SMILE != xContentType.canonical()) {
+                && XContentType.SMILE != xContentType.canonical()
+                && request.hasLengthPrefixedStreamingContent() == false) {
                 channel.sendResponse(
                     RestResponse.createSimpleErrorResponse(
                         channel,
diff --git a/server/src/main/java/org/elasticsearch/rest/RestRequest.java b/server/src/main/java/org/elasticsearch/rest/RestRequest.java
index 92e83fb9701a..51329ac5fbe7 100644
--- a/server/src/main/java/org/elasticsearch/rest/RestRequest.java
+++ b/server/src/main/java/org/elasticsearch/rest/RestRequest.java
@@ -332,11 +332,16 @@ public class RestRequest implements ToXContent.Params, Traceable {
     public void ensureContent() {
         if (hasContent() == false) {
             throw new ElasticsearchParseException("request body is required");
-        } else if (xContentType.get() == null) {
+        } else if (xContentType.get() == null && hasLengthPrefixedStreamingContent() == false) {
             throwValidationException("unknown content type");
         }
     }

+    public boolean hasLengthPrefixedStreamingContent() {
+        return parsedContentType != null
+            && "application/vnd.elasticsearch+test-type".equals(parsedContentType.mediaTypeWithoutParameters());
+    }
+
     /**
      * Returns reference to the network buffer of HTTP content or throw an exception if the body or content type is missing.
      * See {@link #content()}.
diff --git a/server/src/main/java/org/elasticsearch/rest/action/document/RestBulkAction.java b/server/src/main/java/org/elasticsearch/rest/action/document/RestBulkAction.java
index 20a9fb7ed23a..e943e4f2032f 100644
--- a/server/src/main/java/org/elasticsearch/rest/action/document/RestBulkAction.java
+++ b/server/src/main/java/org/elasticsearch/rest/action/document/RestBulkAction.java
@@ -94,6 +94,11 @@ public class RestBulkAction extends BaseRestHandler {
         return incrementalEnabled.get();
     }

+    @Override
+    public boolean mediaTypesValid(RestRequest request) {
+        return request.hasLengthPrefixedStreamingContent() || super.mediaTypesValid(request);
+    }
+
     @Override
     public RestChannelConsumer prepareRequest(final RestRequest request, final NodeClient client) throws IOException {
         if (request.isStreamedContent() == false) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with this approach (having XContentType as null) is that the response will always be in JSON, where I would expect the response to be in the same XContent as the request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First try by adding a XContentLengthPrefixedStreamingType in fc7513a

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the response will always be in JSON

The client can specify the response type using the Accept header, independently of the request content type. Given that this change requires client-side support too, I think that's sufficient. We should really consider whether any of the XContent response types are really appropriate for the bulk API - they're all significantly chattier (harder to parse) than what's required in most cases.

@DaveCTurner
Copy link
Contributor

The conflict created by #135900 was deliberate - we should be doing all the Content-type validation within org.elasticsearch.rest.RestHandler#mediaTypesValid rather than having some slightly weird extra custom logic based on supportsBulkContent() to which this PR was adding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label :Distributed Indexing/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. >enhancement Team:Core/Infra Meta label for core/infra team Team:Distributed Indexing Meta label for Distributed Indexing team v9.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SMILE support for the _bulk API
6 participants