-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Simple version of patterned_text with a single doc value for arguments #129292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple version of patterned_text with a single doc value for arguments #129292
Conversation
Keep tests but treat all arguments as text args
martijnvg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @parkertimmins! I did a first pass, just storing all arguments in a one sorted set doc values field is a good first approach.
...per-extras/src/main/java/org/elasticsearch/index/mapper/extras/MatchOnlyTextFieldMapper.java
Show resolved
Hide resolved
...ed-text/src/main/java/org/elasticsearch/xpack/patternedtext/PatternedTextIndexFieldData.java
Outdated
Show resolved
Hide resolved
.../main/java/org/elasticsearch/xpack/patternedtext/PatternedTextSyntheticFieldLoaderLayer.java
Outdated
Show resolved
Hide resolved
...ed-text/src/main/java/org/elasticsearch/xpack/patternedtext/PatternedTextValueProcessor.java
Outdated
Show resolved
Hide resolved
...rned-text/src/main/java/org/elasticsearch/xpack/patternedtext/PatternedTextMapperPlugin.java
Outdated
Show resolved
Hide resolved
PatternedDocValues implemented SortedSetDocValues, but this does not work since there is no ordinal for the full patterned text. Instead use binary doc values which don't use a term dictionary
|
|
||
| public class PatternedTextDocValues extends SortedSetDocValues { | ||
| private final SortedSetDocValues templateDocValues; | ||
| private final SortedSetDocValues argsDocValues; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be SortedDocValues. Since there is only args value per document? (all values are concatenated?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, there should only be one template and one concatenated args per doc
...erned-text/src/main/java/org/elasticsearch/xpack/patternedtext/PatternedTextFieldMapper.java
Show resolved
Hide resolved
x-pack/plugin/mapper-patterned-text/src/yamlRestTest/resources/rest-api-spec/test/10_basic.yml
Outdated
Show resolved
Hide resolved
martijnvg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few more comments.
...tterned-text/src/main/java/org/elasticsearch/xpack/patternedtext/PatternedTextFieldType.java
Outdated
Show resolved
Hide resolved
...per-extras/src/main/java/org/elasticsearch/index/mapper/extras/MatchOnlyTextFieldMapper.java
Show resolved
Hide resolved
| // Add args doc_values | ||
| if (parts.args().isEmpty() == false) { | ||
| String remainingArgs = PatternedTextValueProcessor.encodeRemainingArgs(parts); | ||
| context.doc().add(new SortedSetDocValuesField(fieldType().argsFieldName(), new BytesRef(remainingArgs))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like an earlier comment, both SortedSetDocValuesField usages can be replaced here with SortedDocValuesField
test/framework/src/main/java/org/elasticsearch/index/mapper/FieldTypeTestCase.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/mapper-patterned-text/src/yamlRestTest/resources/rest-api-spec/test/30_sort.yml
Outdated
Show resolved
Hide resolved
...ed-text/src/main/java/org/elasticsearch/xpack/patternedtext/PatternedTextIndexFieldData.java
Show resolved
Hide resolved
...tterned-text/src/main/java/org/elasticsearch/xpack/patternedtext/PatternedTextFieldType.java
Outdated
Show resolved
Hide resolved
...tterned-text/src/main/java/org/elasticsearch/xpack/patternedtext/PatternedTextFieldType.java
Show resolved
Hide resolved
| assertEquals(new ConstantScoreQuery(new TermQuery(new Term("field", "foo"))), ft.termQuery("foo", null)); | ||
| assertEquals(AutomatonQueries.caseInsensitiveTermQuery(new Term("field", "fOo")), ft.termQueryCaseInsensitive("fOo", null)); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up removing the test testFetchDocValues: cd2f9aa#diff-88edd76ca94733a91a1061049481ab3e6e321b534c4d783d517dbb0186d9c30dL110
I was having trouble accessing the doc values without going to fielddata. But since the message should be accessed through source (presumably synthetic) it doesn't seem the message should be accessible as a doc value.
|
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
|
Hi @parkertimmins, I've created a changelog YAML for you. |
martijnvg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @parkertimmins, a first step to get this new field type in! LGTM
Also we need to label this PR as non-issue, given that the new field type is gated with a feature flag and will not be available in released versions. The PR that removes the feature flag should have the right labels for the release notes.
| this.indexAnalyzer = builder.analyzers.getIndexAnalyzer(); | ||
| this.positionIncrementGap = builder.analyzers.positionIncrementGap.getValue(); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a follow up we should overwrite the iterator() method (from FieldMapper) and expose template field as keyword field mapper. This will allow us to make the template field a real sub field of the patterned_text field mapper, which allows it to be sortable, queryable and aggregateble.
...ogsdb/src/main/java/org/elasticsearch/xpack/logsdb/patternedtext/PatternedTextDocValues.java
Outdated
Show resolved
Hide resolved
|
|
||
| @Override | ||
| public ValueFetcher valueFetcher(SearchExecutionContext context, String format) { | ||
| return SourceValueFetcher.toString(name(), context, format); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we use this here, instead of fetching from doc values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is similar to what happens in match only text field mapper. The idea was to make this change as small as possible.
| return new SourceValueFetcherSortedBinaryIndexFieldData.Builder( | ||
| name(), | ||
| CoreValuesSourceType.KEYWORD, | ||
| SourceValueFetcher.toString(fieldDataContext.sourcePathsLookup().apply(name())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martijnvg does this fetch all synthetic source before filtering for this particular field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does generate the complete source from all fields, so this is slow. This is similar to match_only_text field type, its field data can only be used in script contexts (runtime fields).
For patterned_text, I don't think we should use its field data. But rely on field data of its sub fields (e.g. template field).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But there's no reason to fall back to the source here, since we have docvalues. We can synthesize the message from them and check - as done in the prototype. Same applies to phrase queries etc.
This is fine as a first iteration, but I think it needs to be addressed shortly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is now the same as in MatchOnlyTextFieldType. This field data will only be used in context of runtime fields. Field data usage of the message field directly is rare. In search api, sorting and aggregations on field are not allowed, and esql doesn't use field data. This can be improved, however I don't it is necessary.
In the PoC phase, index sorting was on the message field, but index sorting by the template field is sufficient (or template id field, when we add that). This is possible when template field is real sub field of patterned_text fields.
|
|
||
| @Override | ||
| public SortField sortField(Object missingValue, MultiValueMode sortMode, XFieldComparatorSource.Nested nested, boolean reverse) { | ||
| throw new IllegalArgumentException("not supported for source patterned text field type"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you're adding this to get you started, but this needs to be addressed asap. There's no point in using this field mapper without sorting..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no point in using this field mapper without sorting..
Typically logs are sorted by timestamp? I think we do want to support sorting by the template at some point, hence the comment about overwriting the iterator() method. Also I don't think match_only_text support sorting in the _search api? (field data is only used in search api)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't, but we won't get the storage improvements we measured without it.. I tested with sorting on the message (template) first, then on timestamp.
This can be addressed in a follow-up as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so index sorting needs to be configured on the template sub field. This why I made this comment: #129292 (comment)
This will then allow us the configure index sorting on message.template and @timestamp fields. That way we get the same savings you observed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's possible but I think we'd be exposing too much of the internal implementation. For instance, say that, for some configuration, we also want to sort by some arg fields after the template field (e.g. because we store doubles separately). If we sort directly on the message field, we can apply that implicitly, without changing index settings. Otherwise, it'll be very tricky to apply.
I'm also on the fence about exposing the .template vs .template_id, but that's somewhat orthogonal.
kkrik-es
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well done, Parker. I've a few comments but they can be addressed in follow-up changes.
Initial version of
patterned_textmapper. This will behavior similarly tomatch_only_text. This version uses a single SortedSetDocValues for a template and another for arguments. It splits the message by delimiters, the classifies a token as an argument if it contains a digit. All arguments are concatenated and inserted as a single doc value. A single inverted index is used, without positions. Phrase queries are still possible, using theSourceConfirmedTextQuery, but are not fast.Part of #128932