-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Encode args in a info doc value rather than using placeholders #132782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encode args in a info doc value rather than using placeholders #132782
Conversation
dataInput.writeVInt(arguments.size()); | ||
for (var arg : arguments) { | ||
dataInput.writeVInt(arg.type.toCode()); | ||
dataInput.writeVInt(arg.offsetInTemplate); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the offset? If we keep the generic placeholder in the template, we can get the offsets from the latter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did test adding the placeholder to the template, then scanning the template before message reconstruction to find offsets. This was quite fast in a micro-benchmark, only about 2x slower that the stored offsets in this PR. But this method does not deal with the other issue which this PR addresses: handling messages which contain the placeholder value. There are certainly ways we could deal with this, but those seems complicated.
x-pack/plugin/logsdb/src/main/java/org/elasticsearch/xpack/logsdb/patternedtext/Arg.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good change that sets us up for the future.
x-pack/plugin/logsdb/src/main/java/org/elasticsearch/xpack/logsdb/patternedtext/Arg.java
Show resolved
Hide resolved
x-pack/plugin/logsdb/src/main/java/org/elasticsearch/xpack/logsdb/patternedtext/Arg.java
Outdated
Show resolved
Hide resolved
|
||
// Add args schema | ||
String argsSchemaEncoded = Arg.encodeSchema(parts.schemas()); | ||
context.doc().add(new SortedSetDocValuesField(fieldType().argsSchemaFieldName(), new BytesRef(argsSchemaEncoded))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think schema field name can be stored using SortedDocValuesField
? Given that encodeSchema (...)
stores the schemas as one value so store only one value per document?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, they should be able to be stored as regular SortedDocValues. This is true for all the doc values columns in the patterned_text type. But I ran into an issue where a mapper test class defined in Lucene did not handle SortedDocValues correctly. I submitted this fix: apache/lucene#14839, and it has been merged. If we're using 10.3, I could go ahead and update all doc values in this type to SortedDocValues. But I'm inclined to do it in a separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, and thanks for fixing this in Lucene 🚀
int prevArgOffset = 0; | ||
for (String token : tokens) { | ||
if (token.isEmpty()) { | ||
// add the previous delimiter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if we can use
(double space) to track the presence of an arg, and replace all other whitespace with
(single space) in the template. That will help avoid tracking offsets in arg schema and further compress the template, reducing the storage footprint. The reconstructed msg will look slightly different, but afaict whitespaces are barely used in term and phrase queries.
That may impact reconstruction performance, though. The logic is very streamlined, currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am inclined to reconstruct the original message exactly, at least for the time being. This simplify testing as no changes need to be taken into account when testing for equality between synthetic and stored source.
x-pack/plugin/logsdb/src/main/java/org/elasticsearch/xpack/logsdb/patternedtext/Arg.java
Outdated
Show resolved
Hide resolved
...test/java/org/elasticsearch/xpack/logsdb/patternedtext/PatternedTextValueProcessorTests.java
Outdated
Show resolved
Hide resolved
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
…ic#132782) The existing method of determining where to insert arguments in the template is by replacing known placeholder values which were inserted in the template string. For example, the message found 5 errors would be separated into the argument 5, and the template found %W errors, where %W is the placeholder. There are a few problems with this method. First, we need special handling if the original message contains a placeholder string. We could handle this with some sort of escape, but this adds complexity, and costs time during ingestion. The second issue is that scanning for placeholders within the template string is slow: it is much faster to reconstruct the original message if we already know the location of the arguments in the template string. This PR adds a new doc value column which encodes the location of all arguments in the template. For each argument, it stores the offset in the template string and the type of the argument. There is currently only one GENERIC argument type. These values are encoded in a base64 encoded string stored as SortedSetDocValues. Since messages with the same template will have arguments at the same location, and indices are sorted by template_id, this field compresses very well.
The existing method of determining where to insert arguments in the template is by replacing known placeholder values which were inserted in the template string. For example, the message
found 5 errors
would be separated into the argument5
, and the templatefound %W errors
, where%W
is the placeholder. There are a few problems with this method. First, we need special handling if the original message contains a placeholder string. We could handle this with some sort of escape, but this adds complexity, and costs time during ingestion. The second issue is that scanning for placeholders within the template string is slow: it is much faster to reconstruct the original message if we already know the location of the arguments in the template string.This PR adds a new doc value column which encodes the location of all arguments in the template. For each argument, it stores the offset in the template string and the type of the argument. There is currently only one
GENERIC
argument type. These values are encoded in a base64 encoded string stored as SortedSetDocValues. Since messages with the same template will have arguments at the same location, and indices are sorted by template_id, this field compresses very well.