Skip to content
Closed
Show file tree
Hide file tree
Changes from 36 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
2a878cf
add tests with normalizer mappings
not-napoleon Apr 7, 2025
0720fe7
fix KeywordFieldBlockLoaderTests
not-napoleon Apr 10, 2025
625938f
Update docs/changelog/126623.yaml
not-napoleon Apr 10, 2025
e26e543
proof of concept
not-napoleon Apr 10, 2025
c8a0025
better, but not done
not-napoleon Apr 10, 2025
f8194ae
Merge remote-tracking branch 'refs/remotes/not-napoleon/generate-mapp…
not-napoleon Apr 10, 2025
6b20735
[CI] Auto commit changes from spotless
Apr 10, 2025
e829823
account for normalizing null values
not-napoleon Apr 11, 2025
d2efb0b
remove code I'd only commened out
not-napoleon Apr 11, 2025
c2a11e3
Merge remote-tracking branch 'refs/remotes/not-napoleon/generate-mapp…
not-napoleon Apr 11, 2025
417b084
Merge branch 'main' into generate-mappings-with-normalizers
not-napoleon Apr 11, 2025
949f42f
fix yaml tests
not-napoleon Apr 11, 2025
0ed6a2b
add test for opting out via keep all
not-napoleon Apr 15, 2025
8d29695
Update docs/changelog/126623.yaml
not-napoleon May 27, 2025
4478a76
Merge branch 'main' into generate-mappings-with-normalizers
not-napoleon May 27, 2025
e1f5958
fix merge semantic confilct
not-napoleon May 27, 2025
70dd8e7
changelog
not-napoleon May 27, 2025
0984db0
add test feature to gate the tests
not-napoleon May 28, 2025
63338d7
fix changelog, again
not-napoleon May 28, 2025
8c91da5
fix yaml
not-napoleon May 28, 2025
9110abe
remove no commit
not-napoleon May 28, 2025
c747219
skip the rest compatbility test for the breaking change
not-napoleon May 28, 2025
35153c1
unvar
not-napoleon Jun 3, 2025
aed9477
Feedback from MVG on fields with both keyword and text subfields
not-napoleon Jun 4, 2025
5c9f989
Merge branch 'main' into generate-mappings-with-normalizers
not-napoleon Jun 4, 2025
270f066
fix exact subfield tests
not-napoleon Jun 4, 2025
5039f3c
skip BWC for tests we expect to break
not-napoleon Jun 4, 2025
2fda1a4
Merge branch 'main' into generate-mappings-with-normalizers
not-napoleon Jun 5, 2025
a739e87
put the Rest Compatibility skip in the right place
not-napoleon Jun 5, 2025
030bb9c
Merge remote-tracking branch 'refs/remotes/not-napoleon/generate-mapp…
not-napoleon Jun 5, 2025
10cf2d1
Update docs/changelog/126623.yaml
not-napoleon Jun 5, 2025
d28b168
Merge branch 'main' into generate-mappings-with-normalizers
not-napoleon Jun 5, 2025
4ed5e3e
Update changelog
not-napoleon Jun 5, 2025
f158c00
Merge branch 'main' into generate-mappings-with-normalizers
not-napoleon Jun 6, 2025
fbe4adb
Update docs/changelog/126623.yaml
not-napoleon Jun 6, 2025
0ca54e9
fix changelog, again
not-napoleon Jun 6, 2025
86e5b22
response to PR feedback
not-napoleon Jun 6, 2025
63bf7d1
Revert "Feedback from MVG on fields with both keyword and text subfie…
not-napoleon Jun 10, 2025
43b1075
Revert "fix exact subfield tests"
not-napoleon Jun 10, 2025
2b9d384
leaving a note for future investigation
not-napoleon Jun 11, 2025
3b33b15
Update docs/changelog/126623.yaml
not-napoleon Jun 11, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions docs/changelog/126623.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
pr: 126623
summary: Enable synthetic source on normalized keyword mappings
area: Mapping
type: breaking
issues:
- 124369
- 121358
breaking:
title: Enable synthetic source on normalized keyword mappings
area: Mapping
details: |-
This changes the default behavior for Synthetic Source on keyword fields using normalizers. Prior to this change, normalized keywords were always stored to allow returning the non-normalized values. Under this change, such field will NOT be stored (i.e they will be synthesized from the index when returning source, like all other synthetic source fields). This should result in considerable space improvement for this use case.
Users can opt out of this behavior on a per-field basis by setting `synthetic_source_keep` to `all` on the field.
impact: "By default, normalized keyword fields in synthetic source indices will\
\ no longer return the non-normalized value in the source."
notable: false
1 change: 1 addition & 0 deletions rest-api-spec/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -89,4 +89,5 @@ tasks.named("yamlRestCompatTestTransform").configure ({ task ->
task.skipTest("indices.create/21_synthetic_source_stored/field param - keep root array", "Synthetic source keep arrays now stores leaf arrays natively")
task.skipTest("cluster.info/30_info_thread_pool/Cluster HTTP Info", "The search_throttled thread pool has been removed")
task.skipTest("synonyms/80_synonyms_from_index/Fail loading synonyms from index if synonyms_set doesn't exist", "Synonyms do no longer fail if the synonyms_set doesn't exist")
task.skipTest("mget/90_synthetic_source/keyword with normalizer", "Normalized keywords now use synthetic source")
})
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ setup:
---
keyword:
- requires:
cluster_features: ["gte_v8.4.0"]
cluster_features: [ "gte_v8.4.0" ]
reason: introduced in 8.4.0

- do:
Expand All @@ -23,15 +23,15 @@ keyword:

- do:
index:
index: test
id: 1
index: test
id: 1
body:
kwd: foo

- do:
index:
index: test
id: 2
index: test
id: 2
body:
kwd: bar

Expand All @@ -40,21 +40,24 @@ keyword:
mget:
index: test
body:
ids: [1, 2]
- match: {docs.0._index: "test"}
- match: {docs.0._id: "1"}
ids: [ 1, 2 ]
- match: { docs.0._index: "test" }
- match: { docs.0._id: "1" }
- match:
docs.0._source:
kwd: foo

- match: {docs.1._index: "test"}
- match: {docs.1._id: "2"}
- match: { docs.1._index: "test" }
- match: { docs.1._id: "2" }
- match:
docs.1._source:
kwd: bar

---
keyword with normalizer:
- requires:
cluster_features: "mapper.keyword.keyword_normalizer_synthetic_source"
reason: "Behavior changed in #126623"
- do:
indices.create:
index: test-keyword-with-normalizer
Expand Down Expand Up @@ -113,7 +116,94 @@ keyword with normalizer:
mget:
index: test-keyword-with-normalizer
body:
ids: [ 1, 2, 3 ]
ids: [ 1, 2, 3 ]
- match: { docs.0._index: "test-keyword-with-normalizer" }
- match: { docs.0._id: "1" }
- match:
docs.0._source:
keyword: "the quick brown fox jumps over the lazy dog"
keyword_with_ignore_above: "the Quick Brown Fox jumps over the lazy Dog"
keyword_without_doc_values: "the Quick Brown Fox jumps over the lazy Dog"

- match: { docs.1._index: "test-keyword-with-normalizer" }
- match: { docs.1._id: "2" }
- match:
docs.1._source:
keyword: "the five boxing wizards jump quickly"
keyword_with_ignore_above: "The five BOXING wizards jump Quickly"
keyword_without_doc_values: "The five BOXING wizards jump Quickly"

- match: { docs.2._index: "test-keyword-with-normalizer" }
- match: { docs.2._id: "3" }
- match:
docs.2._source:
keyword: [ "do or do not, there is no try", "may the force be with you!" ]
keyword_with_ignore_above: [ "May the FORCE be with You!", "Do or Do Not, There is no Try" ]
keyword_without_doc_values: [ "May the FORCE be with You!", "Do or Do Not, There is no Try" ]

---
keyword with normalizer, source keep mode all:
- do:
indices.create:
index: test-keyword-with-normalizer
body:
settings:
analysis:
normalizer:
lowercase:
type: custom
filter:
- lowercase
index:
mapping.source.mode: synthetic

mappings:
properties:
keyword:
type: keyword
normalizer: lowercase
synthetic_source_keep: all
keyword_with_ignore_above:
type: keyword
normalizer: lowercase
ignore_above: 10
keyword_without_doc_values:
type: keyword
normalizer: lowercase
doc_values: false

- do:
index:
index: test-keyword-with-normalizer
id: 1
body:
keyword: "the Quick Brown Fox jumps over the lazy Dog"
keyword_with_ignore_above: "the Quick Brown Fox jumps over the lazy Dog"
keyword_without_doc_values: "the Quick Brown Fox jumps over the lazy Dog"

- do:
index:
index: test-keyword-with-normalizer
id: 2
body:
keyword: "The five BOXING wizards jump Quickly"
keyword_with_ignore_above: "The five BOXING wizards jump Quickly"
keyword_without_doc_values: "The five BOXING wizards jump Quickly"

- do:
index:
index: test-keyword-with-normalizer
id: 3
body:
keyword: [ "May the FORCE be with You!", "Do or Do Not, There is no Try" ]
keyword_with_ignore_above: [ "May the FORCE be with You!", "Do or Do Not, There is no Try" ]
keyword_without_doc_values: [ "May the FORCE be with You!", "Do or Do Not, There is no Try" ]

- do:
mget:
index: test-keyword-with-normalizer
body:
ids: [ 1, 2, 3 ]
- match: { docs.0._index: "test-keyword-with-normalizer" }
- match: { docs.0._id: "1" }
- match:
Expand Down Expand Up @@ -141,7 +231,7 @@ keyword with normalizer:
---
stored text:
- requires:
cluster_features: ["gte_v8.5.0"]
cluster_features: [ "gte_v8.5.0" ]
reason: introduced in 8.5.0

- do:
Expand All @@ -159,39 +249,39 @@ stored text:

- do:
index:
index: test
id: 1
index: test
id: 1
body:
text: the quick brown fox

- do:
index:
index: test
id: 2
index: test
id: 2
body:
text: jumped over the lazy dog

- do:
mget:
index: test
body:
ids: [1, 2]
- match: {docs.0._index: "test"}
- match: {docs.0._id: "1"}
ids: [ 1, 2 ]
- match: { docs.0._index: "test" }
- match: { docs.0._id: "1" }
- match:
docs.0._source:
text: the quick brown fox

- match: {docs.1._index: "test"}
- match: {docs.1._id: "2"}
- match: { docs.1._index: "test" }
- match: { docs.1._id: "2" }
- match:
docs.1._source:
text: jumped over the lazy dog

---
force_synthetic_source_ok:
- requires:
cluster_features: ["gte_v8.4.0"]
cluster_features: [ "gte_v8.4.0" ]
reason: introduced in 8.4.0

- do:
Expand All @@ -210,15 +300,15 @@ force_synthetic_source_ok:

- do:
index:
index: test
id: 1
index: test
id: 1
body:
obj.kwd: foo

- do:
index:
index: test
id: 2
index: test
id: 2
body:
obj:
kwd: bar
Expand All @@ -228,7 +318,7 @@ force_synthetic_source_ok:
mget:
index: test
body:
ids: [1, 2]
ids: [ 1, 2 ]
- match:
docs.0._source:
obj.kwd: foo
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
import org.elasticsearch.common.lucene.search.AutomatonQueries;
import org.elasticsearch.common.unit.Fuzziness;
import org.elasticsearch.core.Nullable;
import org.elasticsearch.features.NodeFeature;
import org.elasticsearch.index.IndexMode;
import org.elasticsearch.index.IndexSortConfig;
import org.elasticsearch.index.IndexVersion;
Expand Down Expand Up @@ -94,6 +95,10 @@
*/
public final class KeywordFieldMapper extends FieldMapper {

public static final NodeFeature KEYWORD_NORMALIZER_SYNTHETIC_SOURCE = new NodeFeature(
"mapper.keyword.keyword_normalizer_synthetic_source"
);

private static final Logger logger = LogManager.getLogger(KeywordFieldMapper.class);

public static final String CONTENT_TYPE = "keyword";
Expand Down Expand Up @@ -1276,11 +1281,9 @@ private String originalName() {

@Override
protected SyntheticSourceSupport syntheticSourceSupport() {
if (hasNormalizer()) {
// NOTE: no matter if we have doc values or not we use fallback synthetic source
// to store the original value whose doc values would be altered by the normalizer
return SyntheticSourceSupport.FALLBACK;
}
/* NOTE: we allow enabling synthetic source on Keyword fields with a Normalizer, even though the returned synthetic value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we mention why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

may not perfectly match the original, pre-normalization, value.
*/

if (fieldType.stored() || hasDocValues) {
return new SyntheticSourceSupport.Native(() -> syntheticFieldLoader(fullPath(), leafName()));
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,8 @@ public Set<NodeFeature> getTestFeatures() {
NPE_ON_DIMS_UPDATE_FIX,
RESCORE_ZERO_VECTOR_QUANTIZED_VECTOR_MAPPING,
USE_DEFAULT_OVERSAMPLE_VALUE_FOR_BBQ,
IVF_FORMAT_CLUSTER_FEATURE
IVF_FORMAT_CLUSTER_FEATURE,
KeywordFieldMapper.KEYWORD_NORMALIZER_SYNTHETIC_SOURCE
);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -1008,17 +1008,21 @@ protected String delegatingTo() {
}
};
}
if (isStored()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this move? There is a comment below that says that it is more efficient to use the parent keyword field. I don't really understand how is that the case if the keyword field is stored though but there must be some reasoning here. Maybe let's check with someone from ESQL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, this addresses the case where both the text field is stored. Without this change, we were falling back to the keyword field, which is "less exact" because of normalization, when we had a non-normalized stored text available.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be moved all the way to the top then? Synthetic source delegate is not exact now either due to the change in getKeywordFieldMapperForSyntheticSource below.

Copy link
Contributor

@lkts lkts Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stepping back a little bit - are we sure we want to make changes to text at all? As you point out the idea is indeed to apply these optimizations when "parent is exact". So the check for normalizer here still serves its purpose because it protects from non-exactness.

return new BlockStoredFieldsReader.BytesFromStringsBlockLoader(name());
}
/*
* If this is a sub-text field try and return the parent's loader. Text
* fields will always be slow to load and if the parent is exact then we
* should use that instead.
*/
// TODO: should this be removed? I think SyntheticSourceHelper already does this:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SyntheticSourceHelper is concerned with multi fields of the text field. This logic is for when text field is the multii field of a keyword field.

{
  "t": {
    "type": "text",
    "fields": {
      "k": {
        "type": "keyword"
      }
  }
}

vs

{
  "k": {
    "type": "keyword",
    "fields": {
      "t": {
        "type": "text"
      }
    }
  }
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could move this logic into SyntheticSourceHelper just to be close to similar things.

String parentField = blContext.parentField(name());
if (parentField != null) {
MappedFieldType parent = blContext.lookup().fieldType(parentField);
if (parent.typeName().equals(KeywordFieldMapper.CONTENT_TYPE)) {
KeywordFieldMapper.KeywordFieldType kwd = (KeywordFieldMapper.KeywordFieldType) parent;
if (kwd.hasNormalizer() == false && (kwd.hasDocValues() || kwd.isStored())) {
if (kwd.hasDocValues() || kwd.isStored()) {
return new BlockLoader.Delegating(kwd.blockLoader(blContext)) {
@Override
protected String delegatingTo() {
Expand All @@ -1028,9 +1032,6 @@ protected String delegatingTo() {
}
}
}
if (isStored()) {
return new BlockStoredFieldsReader.BytesFromStringsBlockLoader(name());
}

// _ignored_source field will contain entries for this field if it is not stored
// and there is no syntheticSourceDelegate.
Expand Down Expand Up @@ -1579,7 +1580,7 @@ public static KeywordFieldMapper getKeywordFieldMapperForSyntheticSource(Iterabl
for (Mapper sub : multiFields) {
if (sub.typeName().equals(KeywordFieldMapper.CONTENT_TYPE)) {
KeywordFieldMapper kwd = (KeywordFieldMapper) sub;
if (kwd.hasNormalizer() == false && (kwd.fieldType().hasDocValues() || kwd.fieldType().isStored())) {
if (kwd.fieldType().hasDocValues() || kwd.fieldType().isStored()) {
return kwd;
}
}
Expand Down
Loading
Loading