-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Introduce a bulk format that uses a prefix length #135506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hi @iverase, I've created a changelog YAML for you. |
server/src/main/java/org/elasticsearch/action/bulk/BulkRequestParser.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/bulk/BulkRequestParser.java
Show resolved
Hide resolved
Pinging @elastic/es-core-infra (Team:Core/Infra) |
Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing) |
Hi @iverase, I've updated the changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like how simple this is! I am all for it :)
Some comments on the header name, and looks like some left over debugging stuff?
public static final String FAILURE_STORE_STATUS_CAPABILITY = "failure_store_status"; | ||
public static final String PREFIX_LENGTH_FORMAT_CAPABILITY = "prefix_length_format"; | ||
|
||
private static final String BULK_FORMAT_HEADER = "Bulk-Format"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been debating on if this should be called X-Bulk-Format
, but this RFC (https://datatracker.ietf.org/doc/html/rfc6648) indicates that we shouldn't rely on that convention.
I am fine with either.
I defer to @elastic/es-core-infra on this decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed to X-Bulk-Format
but I am fine either way.
server/src/test/java/org/elasticsearch/action/bulk/BulkRequestParserTestCase.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/action/bulk/BulkRequestParserTestCase.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/action/bulk/BulkRequestParserTestCase.java
Outdated
Show resolved
Hide resolved
I wan to to point out that the only formats currently accepted are JSON and SMILE as the RestController has code to reject the request otherwise:
We can in theory support any format with this change so we might need to change this if we want to support them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great idea, thanks for tackling it. I have one concern, see inline comment.
public static final String FAILURE_STORE_STATUS_CAPABILITY = "failure_store_status"; | ||
public static final String PREFIX_LENGTH_FORMAT_CAPABILITY = "prefix_length_format"; | ||
|
||
private static final String BULK_FORMAT_HEADER = "X-Bulk-Format"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we should introduce a separate header for this feature. Instead, we should let clients say that the request body is a length-prefixed bulk request using an appropriate value for the existing Content-type
header. We already define a few vendor-specific content types with the application/vnd.elasticsearch+
prefix. Let's come up with a way to add some more of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DaveCTurner it was your idea actually. I did start adding a new XContentType but I was unsure how to extend it. It will be great if we can do that and remove the need of a new header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot claim to have invented length prefixes 😁
I can't say exactly how best to plumb this through the XContentType
framework. I suspect that yes we should add enum values for all 4 new length-prefixed-bulk types. We won't have to support rendering any values into these new types, only parsing, and indeed only parsing in the particular case of a bulk request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm but maybe we don't want this to be an XContentType
at all. We can't really either parse or render this data using the regular XContent
utils which all expect something very much like JSON. It seems it'd make more sense to me to treat it completely separately. Can we instead use org.elasticsearch.rest.RestRequest#getParsedContentType
to access the Content-type
header value directly for bulk REST requests, leaving org.elasticsearch.rest.RestRequest#xContentType
as null
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree XContent(Type) should remain representing a single json blob.
For another content type, it doesn't seem there is any standard, but I find many past ideas in forums. Since this is our own content type it seems we can do whatever we want. My suggestion is something that makes it clear this is a set of (multiple) xcontent, and the particular formatting of that as a parameter to that content type. eg:
Content-Type=application/x-json-stream; format=length-prefixed
Also, regarding the format itself, for json I think we should stick with ascii encoding, so that it's easy to write this with a text editor. Binary formats like smile can have a binary length prefix, but text based formats like json should stick with text IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having Content-Type headers that do not resolve on XContentType adds an incredible amount of complexity in how rest requests are handled. Not sure how to proceed, do you have any suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sort of thing seems to be sufficient (at least to get as far as BulkRequestParser
):
diff --git a/server/src/main/java/org/elasticsearch/rest/RestController.java b/server/src/main/java/org/elasticsearch/rest/RestController.java
index 66532026fc1c..d9354b9ecf82 100644
--- a/server/src/main/java/org/elasticsearch/rest/RestController.java
+++ b/server/src/main/java/org/elasticsearch/rest/RestController.java
@@ -425,7 +425,8 @@ public class RestController implements HttpServerTransport.Dispatcher {
// TODO consider refactoring to handler.supportsContentStream(xContentType). It is only used with JSON and SMILE
if (handler.supportsBulkContent()
&& XContentType.JSON != xContentType.canonical()
- && XContentType.SMILE != xContentType.canonical()) {
+ && XContentType.SMILE != xContentType.canonical()
+ && request.hasLengthPrefixedStreamingContent() == false) {
channel.sendResponse(
RestResponse.createSimpleErrorResponse(
channel,
diff --git a/server/src/main/java/org/elasticsearch/rest/RestRequest.java b/server/src/main/java/org/elasticsearch/rest/RestRequest.java
index 92e83fb9701a..51329ac5fbe7 100644
--- a/server/src/main/java/org/elasticsearch/rest/RestRequest.java
+++ b/server/src/main/java/org/elasticsearch/rest/RestRequest.java
@@ -332,11 +332,16 @@ public class RestRequest implements ToXContent.Params, Traceable {
public void ensureContent() {
if (hasContent() == false) {
throw new ElasticsearchParseException("request body is required");
- } else if (xContentType.get() == null) {
+ } else if (xContentType.get() == null && hasLengthPrefixedStreamingContent() == false) {
throwValidationException("unknown content type");
}
}
+ public boolean hasLengthPrefixedStreamingContent() {
+ return parsedContentType != null
+ && "application/vnd.elasticsearch+test-type".equals(parsedContentType.mediaTypeWithoutParameters());
+ }
+
/**
* Returns reference to the network buffer of HTTP content or throw an exception if the body or content type is missing.
* See {@link #content()}.
diff --git a/server/src/main/java/org/elasticsearch/rest/action/document/RestBulkAction.java b/server/src/main/java/org/elasticsearch/rest/action/document/RestBulkAction.java
index 20a9fb7ed23a..e943e4f2032f 100644
--- a/server/src/main/java/org/elasticsearch/rest/action/document/RestBulkAction.java
+++ b/server/src/main/java/org/elasticsearch/rest/action/document/RestBulkAction.java
@@ -94,6 +94,11 @@ public class RestBulkAction extends BaseRestHandler {
return incrementalEnabled.get();
}
+ @Override
+ public boolean mediaTypesValid(RestRequest request) {
+ return request.hasLengthPrefixedStreamingContent() || super.mediaTypesValid(request);
+ }
+
@Override
public RestChannelConsumer prepareRequest(final RestRequest request, final NodeClient client) throws IOException {
if (request.isStreamedContent() == false) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue with this approach (having XContentType as null) is that the response will always be in JSON, where I would expect the response to be in the same XContent as the request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First try by adding a XContentLengthPrefixedStreamingType in fc7513a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the response will always be in JSON
The client can specify the response type using the Accept
header, independently of the request content type. Given that this change requires client-side support too, I think that's sufficient. We should really consider whether any of the XContent
response types are really appropriate for the bulk API - they're all significantly chattier (harder to parse) than what's required in most cases.
The conflict created by #135900 was deliberate - we should be doing all the |
The bulk API currently uses a suffix marker to signal the end of a document. This way of breaking the documents has a few drawbacks:
This change proposes to add an alternative format for the bulk API that uses a 4 byte prefix for each document containing the length of the document in big endian order. This makes the bulk easier to parse as we don't need to read every byte to know the length of the document and it supports any type of content type.
fixes #94319