- 
                Notifications
    You must be signed in to change notification settings 
- Fork 25.6k
Don't store keyword multi fields when they trip ignore_above #132962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Accidentally nuked the previous PR while messing with branches. I did address all of the comments besides this one. | 
1f84854    to
    53ded56      
    Compare
  
    | storedFieldInBinaryFormat | ||
| isWithinMultiField, | ||
| storedFieldInBinaryFormat, | ||
| TextFieldMapper.SyntheticSourceHelper.syntheticSourceDelegate(getFieldType(), multiFields) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was somewhat of a bug - the syntheticSourceDelegate was missing despite MatchOnlyTextMapper using it indirectly. With this change, we're now passing the delegate directly to TextFieldMapper.
| "Field [" + name() + "] of type [" + CONTENT_TYPE + "] cannot run positional queries since [_source] is disabled." | ||
| ); | ||
| } | ||
| if (searchExecutionContext.isSourceSynthetic() && withinMultiField) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this block of code was broken down into three smaller functions for readability:
- parentFieldFetcher
- delegateFieldFetcher
- sourceFieldFetcher
| /** | ||
| * Returns a new {@link CompositeSyntheticFieldLoader} that merges this field loader with the given one. | ||
| */ | ||
| public CompositeSyntheticFieldLoader mergedWith(CompositeSyntheticFieldLoader other) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed to merge two field loaders: one for loading values stored by the parent field (when ignore_above is tripped), and the second for loading values stored by the keyword multi field (when ignore_above isn't tripped). Since the keyword multi field already produces a CompositeFieldLoader, I'm just extending that class.
| return new BlockSourceReader.BytesRefsBlockLoader(fetcher, sourceBlockLoaderLookup(blContext)); | ||
| } | ||
|  | ||
| public boolean isIgnoreAboveSet() { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Helper function for better readability
| public MatchOnlyTextFieldMapper build(MapperBuilderContext context) { | ||
| BuilderParams builderParams = builderParams(this, context); | ||
| MatchOnlyTextFieldType tft = buildFieldType(context, builderParams.multiFields()); | ||
| final boolean storeSource; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this logic has been replaced by the new logic in parseCreateField()
| * for synthetic source. | ||
| */ | ||
| public String originalName() { | ||
| return originalName; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this has been moved up to TextFamilyFieldType, which KeywordFieldType now extends
| private boolean normsDefault() { | ||
| if (indexCreatedVersion.onOrAfter(IndexVersions.DISABLE_NORMS_BY_DEFAULT_FOR_LOGSDB_AND_TSDB)) { | ||
| // don't enable norms by default if the index is LOGSDB or TSDB based | ||
| return indexMode != IndexMode.LOGSDB && indexMode != IndexMode.TIME_SERIES; | ||
| } | ||
| // bwc - historically, norms were enabled by default on text fields regardless of which index mode was used | ||
| return true; | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling this outside of the constructor was bad practice, hence I moved it into the constructor. See the 313-320 below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as it access fields that have been initialized the original method shouldn't be problematic. But I get what you say.
| Hi @Kubik42, I've created a changelog YAML for you. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks good. I had a couple of nits, and a bwc question in the TextFieldMapper. This should also definitely be reviewed by Martijn.
        
          
                ...per-extras/src/main/java/org/elasticsearch/index/mapper/extras/MatchOnlyTextFieldMapper.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                server/src/main/java/org/elasticsearch/index/mapper/CompositeSyntheticFieldLoader.java
          
            Show resolved
            Hide resolved
        
              
          
                ...rc/test/java/org/elasticsearch/index/mapper/annotatedtext/AnnotatedTextFieldMapperTests.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java
          
            Show resolved
            Hide resolved
        
              
          
                server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | // _ignored_source field will contain entries for this field if it is not stored | ||
| // and there is no syntheticSourceDelegate. | ||
| // See #syntheticSourceSupport(). | ||
| // But if a text field is a multi field it won't have an entry in _ignored_source. | ||
| // The parent might, but we don't have enough context here to figure this out. | ||
| // So we bail. | ||
| if (isSyntheticSource && syntheticSourceDelegate == null && parentField == null) { | ||
| return fallbackSyntheticSourceBlockLoader(blContext); | ||
| } | ||
|  | ||
| SourceValueFetcher fetcher = SourceValueFetcher.toString(blContext.sourcePaths(name())); | ||
| return new BlockSourceReader.BytesRefsBlockLoader(fetcher, blockReaderDisiLookup(blContext)); | ||
| } | ||
|  | ||
| FallbackSyntheticSourceBlockLoader fallbackSyntheticSourceBlockLoader(BlockLoaderContext blContext) { | ||
| var reader = new FallbackSyntheticSourceBlockLoader.SingleValueReader<BytesRef>(null) { | ||
| @Override | ||
| public void convertValue(Object value, List<BytesRef> accumulator) { | ||
| if (value != null) { | ||
| accumulator.add(new BytesRef(value.toString())); | ||
| } | ||
| } | ||
|  | ||
| @Override | ||
| protected void parseNonNullValue(XContentParser parser, List<BytesRef> accumulator) throws IOException { | ||
| var text = parser.textOrNull(); | ||
|  | ||
| if (text != null) { | ||
| accumulator.add(new BytesRef(text)); | ||
| } | ||
| } | ||
|  | ||
| @Override | ||
| public void writeToBlock(List<BytesRef> values, BlockLoader.Builder blockBuilder) { | ||
| var bytesRefBuilder = (BlockLoader.BytesRefBuilder) blockBuilder; | ||
|  | ||
| for (var value : values) { | ||
| bytesRefBuilder.appendBytesRef(value); | ||
| } | ||
| } | ||
| }; | ||
|  | ||
| return new FallbackSyntheticSourceBlockLoader( | ||
| reader, | ||
| name(), | ||
| IgnoredSourceFieldMapper.ignoredSourceFormat(blContext.indexSettings().getIndexVersionCreated()) | ||
| ) { | ||
| @Override | ||
| public Builder builder(BlockFactory factory, int expectedCount) { | ||
| return factory.bytesRefs(expectedCount); | ||
| } | ||
| }; | ||
| } | ||
|  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, in the old implementation, sometimes the synthetic source support would fall back to storing values in ignored_source. This is no longer the case with the new implementation.
However, we need to consider BWC. There will be some indices that have had data written before this change and thus might have data stored in ignored_source.
We still need to support loading values from ignored source in this case.
I suspect the reason the bwc tests have not caught this is because using the FallbackSyntheticSourceBlockLoader is just an optimization. If we use a BlockSourceReader, it will still work. However, this means it will be constructing the entire _source of the document from the synthetic source, then parsing it to determine the value. Better to use the FallbackSyntheticSourceBlockLoader which knows how to extract only the individual field we're interested in out of the ignored source.
        
          
                server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                server/src/main/java/org/elasticsearch/index/IndexVersions.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | @@ -1 +1 @@ | |||
| initial_elasticsearch_8_18_8,8840010 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idk why these files are popping up, I'm pulling them directly from main. When I rebase before merging, I hope they'll disappear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is occurring because you are out of date with upstream main. If you sync the upstream branch (ie the "Update Branch" button in Github) the diff should disappear. But you can also ignore for now, the transport version generation task is pulling in changes from upstream main.
This behavior isn't intentional; it's a bug in transport version generation, and should go away soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Kubik42 - LGTM
| private boolean normsDefault() { | ||
| if (indexCreatedVersion.onOrAfter(IndexVersions.DISABLE_NORMS_BY_DEFAULT_FOR_LOGSDB_AND_TSDB)) { | ||
| // don't enable norms by default if the index is LOGSDB or TSDB based | ||
| return indexMode != IndexMode.LOGSDB && indexMode != IndexMode.TIME_SERIES; | ||
| } | ||
| // bwc - historically, norms were enabled by default on text fields regardless of which index mode was used | ||
| return true; | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as it access fields that have been initialized the original method shouldn't be problematic. But I get what you say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks Dmitry!
… test names to camel case
This is a small refactor + bug for fix 131282.
The refactor changes how
text,match_only_text, andannotated_textfields usekeywordmulti fields for synthetic source. Currently, this is done via the hasSyntheticSourceCompatibleKeywordField argument, where we set a boolean flag to indicate whether there is a keyword multi field that is either stored or has doc values. This is not a good approach for addressing 131282 because we want to disable the following logic for multi fields. With that disabled, the parent fields will no longer have a multi field to use for synthetic source.We could designate one of the keyword fields as some kind of "synthetic source provider" for the parent. This way the field will always create a
StoredFieldwhenignore_aboveis tripped. However, this is a poor approach since it exposes how text fields are implemented to the keyword field. If the parent field decides how and what is stored, it'll be a lot clearer in the code.This is where this PR comes in. It aims to remove
hasSyntheticSourceCompatibleKeywordField(although kept for now for bwc) and instead relies on thesyntheticSourceDelegate. With the addition of a new methodcanUseSyntheticSourceDelegateForSyntheticSource(), which is called during indexing, we can determine whether a particular keyword multi field is a valid supporter of synthetic source. If it isn't, then the parent field will explicitly create aStoredFieldfor that.Note: there are a lot of changed files, that said, most of them are just constructor changes. The actual changes are pretty limited.