Introduce FallbackSyntheticSourceBlockLoader and apply it to keyword fields #119546

lkts · 2025-01-03T22:40:05Z

This PR contains a prototype of FallbackSyntheticSourceBlockLoader. This is an alternative implementation of block loading for fields that use fallback synthetic source. Currently the approach is to synthesize the entire _source and extract the relevant part of it. This implementation skips synthesizing and directly reads underlying _ignored_source field and parses its value.

This implementation should mostly remove cases when _source is synthesized to be used in block loader. (Most) Block loaders already use doc_values or stored fields directly, meaning in most cases _source is not synthesized. There are some exceptions like geo fields that only ever use source and therefore synthetic source. However geo fields also use fallback synthetic source so will be covered by implementation similar to this PR.

See #115394. Note that the issue mentions working with ignore_above and ignore_malformed but (at least for keyword fields) the logic is actually the opposite - ignore_above and normalization identical to parsing are applied to values from source before being appended to block. So currently the "expected" result of block loader is to never have ignored values. This PR intends to follow the same approach but it of course can be discussed.

Includes #119415 for now.

kkrik-es · 2025-01-06T08:49:40Z

server/src/main/java/org/elasticsearch/index/mapper/FallbackSyntheticSourceBlockLoader.java

+ * and we would fall back to (potentially synthetic) _source. However, in case of synthetic source, there is actually no need to
+ * construct the entire _source. We know that there is no doc_values and stored fields, and therefore we will be using fallback synthetic
+ * source. That is equivalent to just reading _ignored_source stored field directly and doing an in-place synthetic source just
+ * for this field.


I'm still trying to understand if this is necessary. #113827 introduced filtering in source construction. If we set SourceFilter in SourceLoader.Synthetic initialization to filter for the requested fields only, do we get a more generic solution that covers all uses of synthetic source? I assume the latter is used in SourceBlockLoader.

I had a conversation with @martijnvg about this. There is a more general problem here. SourceProvider/SourceLoader infra we are currently using is designed for _search and get use cases and just doesn't fit into ESQL. See #117792. So this is not solely an optimization but also allows to remove workarounds like #117792.

lkts · 2025-01-09T16:43:27Z

server/src/main/java/org/elasticsearch/index/mapper/FallbackSyntheticSourceBlockLoader.java

+        }
+
+        @Override
+        public boolean canReuse(int startingDocID) {


I don't know what to do here.

Like BlockStoredFieldsReader this should return true, since this implementation doesn't keep references to data structures that can't be reused between threads.

martijnvg

I think this is in the right direction @lkts!

I have a comment about ignore_above, which I think currently isn't handled? And the new block loader implementation will be used when a keyword field isn't stored and doesn't have doc values or normalizer is configured?

martijnvg · 2025-01-10T09:35:30Z

server/src/main/java/org/elasticsearch/index/mapper/FallbackSyntheticSourceBlockLoader.java

+        }
+
+        @Override
+        public boolean canReuse(int startingDocID) {


Like BlockStoredFieldsReader this should return true, since this implementation doesn't keep references to data structures that can't be reused between threads.

martijnvg · 2025-01-10T09:39:02Z

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

+                var reader = new FallbackSyntheticSourceBlockLoader.Reader<BytesRef>() {
+                    @Override
+                    public void convertValue(Object value, List<BytesRef> accumulator) {
+                        assert value instanceof BytesRef;


maybe remove assert? The cast below would always fail if value is something else than BytesRef.

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

test/framework/src/main/java/org/elasticsearch/index/mapper/BlockLoaderTestCase.java

elasticsearchmachine · 2025-02-03T20:52:06Z

Hi @lkts, I've created a changelog YAML for you.

elasticsearchmachine · 2025-02-04T01:26:40Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

martijnvg

I left two questions, but LGTM!

martijnvg · 2025-02-04T14:51:46Z

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

            };
        }

+        private String applyIgnoreAboveAndNormalizer(String value) {


This got me thinking, but I think is expected in the context of es|ql? You would never see un-normalized or ignored values. With synthetic source (and normal source), you just get what got stored.

Yes, this is how reading from stored source works as well.

martijnvg · 2025-02-04T14:55:56Z

server/src/main/java/org/elasticsearch/index/mapper/FallbackSyntheticSourceBlockLoader.java

+                }
+            }
+
+            // TODO figure out how to handle XContentDataHelper#voidValue()


iirc voidValue means to ignore ignored source, because we do have values in doc values for a field? I'm trying to remember in what cases we use void value.

voidValue means "don't trust doc values for this field when doing synthetic source". It's to avoid copied values from showing up in synthetic source.
I think it's meaningless in this context because we need all values for this field regardless how they are represented in source. So if there are any other values for this field we'll decode them and if not then there was nothing copied to it. I'll remove this comment in a follow-up.

dnhatn

LGTM. Thanks @lkts

elasticsearchmachine · 2025-02-05T00:14:28Z

💔 Backport failed

Status	Branch	Result
❌	8.x	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 119546

lkts · 2025-02-05T17:37:15Z

💚 All backports created successfully

Status	Branch	Result
✅	8.x

Questions ?

Please refer to the Backport tool documentation

…fields (elastic#119546) (cherry picked from commit e885da1) # Conflicts: # server/src/test/java/org/elasticsearch/index/mapper/blockloader/KeywordFieldBlockLoaderTests.java # test/framework/src/main/java/org/elasticsearch/index/mapper/BlockLoaderTestCase.java

…yword fields (#119546) (#121800) * Introduce FallbackSyntheticSourceBlockLoader and apply it to keyword fields (#119546) (cherry picked from commit e885da1) # Conflicts: # server/src/test/java/org/elasticsearch/index/mapper/blockloader/KeywordFieldBlockLoaderTests.java # test/framework/src/main/java/org/elasticsearch/index/mapper/BlockLoaderTestCase.java * Add non-backported piece

We miss appending null when ignored_source is not available. Our randomized tests already cover this case, but we do not check it when loading fields. I labelled this non-issue for an unreleased bug. Closes #128959 Relates #119546

We miss appending null when ignored_source is not available. Our randomized tests already cover this case, but we do not check it when loading fields. I labelled this non-issue for an unreleased bug. Closes elastic#128959 Relates elastic#119546

We miss appending null when ignored_source is not available. Our randomized tests already cover this case, but we do not check it when loading fields. I labelled this non-issue for an unreleased bug. Closes #128959 Relates #119546

commit e6b8a64 Author: Lorenzo Dematte <[email protected]> Date: Tue Jun 10 12:02:16 2025 +0200 PR comments commit ad0902e Author: Moritz Mack <[email protected]> Date: Fri Jun 6 13:33:37 2025 +0200 Update ReproduceInfoPrinter to correctly print a reproduction for lucene / BC upgrade tests. Relates to ES-12005 commit 78b4168 Author: Alexander Spies <[email protected]> Date: Fri Jun 6 11:37:53 2025 +0200 ESQL: Throw ISE instead of IAE for illegal block in page (elastic#128960) IAE gets reported as a 400 status code, but that's inappropriate when inconsistent pages are always bugs, and should be reported with a 500. Throw ISE instead. commit 29e68bd Author: Aurélien FOUCRET <[email protected]> Date: Fri Jun 6 11:24:23 2025 +0200 [ES|QL] Fix test releases for telemetry. (elastic#128971) commit 1a76bc2 Author: Bogdan Pintea <[email protected]> Date: Fri Jun 6 11:01:49 2025 +0200 ESQL: Workaround for RLike handling of empty lang pattern (elastic#128895) Lucene's `org.apache.lucene.util.automaton.Operations#getSingleton` fails with an Automaton for a `REGEXP_EMPTY` `RegExp`. This adds a workaround for that, to check the type of automaton before calling into that failing method. Closes elastic#128813 commit e24fd32 Author: Aurélien FOUCRET <[email protected]> Date: Fri Jun 6 10:25:28 2025 +0200 [ES|QL] Enable the completion command as a tech preview feature (elastic#128948) commit 3f03775 Author: Niels Bauman <[email protected]> Date: Fri Jun 6 09:00:24 2025 +0200 Remove non-test usages of `Metadata.Builder#putCustom` (elastic#128801) This removes all non-test usages of ``` Metadata.Builder.putCustom(String type, ProjectCustom custom) ``` And replaces it with appropriate calls to the equivalent method on `ProjectMetadata.Builder`. In most cases this _does not_ make the code project aware, but does reduce the number of deprecated methods in use. commit 330d127 Author: Niels Bauman <[email protected]> Date: Fri Jun 6 08:07:59 2025 +0200 Make utility methods in `IndexLifecycleTransition` project-aware (elastic#128930) Modifies the methods to work with a project scope rather than a cluster scope. This is part of an iterative process to make ILM project-aware. commit 1b5720d Author: Aurélien FOUCRET <[email protected]> Date: Fri Jun 6 07:50:52 2025 +0200 [ES|QL] Fix test releases for LookupJoinTypesIT. (elastic#128985) commit 40cf2d3 Author: Tim Vernum <[email protected]> Date: Fri Jun 6 13:09:31 2025 +1000 Add "extension" attribute validation to IdP SPs (elastic#128805) This extends the change from elastic#128176 to validate the "custom attributes" on a per Service Provider basis. Each Service Provider (whether registered or wildcard based) has a field "attributes.extensions" which is a list of attribute names that may be provided by the caller of "/_idp/saml/init". Service Providers that have not be configured with extension attributes will reject any custom attributes in SAML init. This necessitates a new field in the service provider index (but only if the new `extensions` attribute is set). The template has been updated, but there is no data migration because the `saml-service-provider` index does not exist in any of the environments into which we wish to deploy this change. commit 496fb2d Author: Jordan Powers <[email protected]> Date: Thu Jun 5 19:50:09 2025 -0700 Skip UTF8 to UTF16 conversion during document indexing (elastic#126492) When parsing documents, we receive the document as UTF-8 encoded data which we then parse and convert the fields to java-native UTF-16 encoded Strings. We then convert these strings back to UTF-8 for storage in lucene. This patch skips the redundant conversion, instead passing lucene a direct reference to the received UTF-8 bytes when possible. commit c34f8b6 Author: Tim Vernum <[email protected]> Date: Fri Jun 6 12:03:25 2025 +1000 Improve cache invalidation in IdP SP cache (elastic#128890) The Identity Provider's Service Provider cache had two issues: 1. It checked for identity based on sequence numbers, but didn't include the `seq_no_primary_term` parameter on searches, which means the sequence would always by `-2` 2. It didn't track whether the index was deleted, which means it could be caching values from an old version of the index This commit fixes both of these issues. In practice neither issue was a problem because there are no deployments that use index-based service providers, however the 2nd issue did cause some challenges for testing. commit 923f029 Author: Nhat Nguyen <[email protected]> Date: Thu Jun 5 18:09:58 2025 -0700 Fix block loader with missing ignored source (elastic#129006) We miss appending null when ignored_source is not available. Our randomized tests already cover this case, but we do not check it when loading fields. I labelled this non-issue for an unreleased bug. Closes elastic#128959 Relates elastic#119546 commit 0f8178a Author: Bogdan Pintea <[email protected]> Date: Fri Jun 6 02:02:24 2025 +0200 ESQL: Forward port 8.19 RegexMatch serialization change version (elastic#128979) Fwd port ESQL_REGEX_MATCH_WITH_CASE_INSENSITIVITY_8_19. Related: elastic#128919. commit ee716f1 Author: Simon Chase <[email protected]> Date: Thu Jun 5 15:20:08 2025 -0700 transport: edit TransportConnectionListener for close exceptions (elastic#129015) The TransportConnectionListener interface has previously included the Transport.Connection being closed and unregistered in its onNodeDisconnected callback. This is not in use, and can be removed as it is also available in the onConnectionClosed callback. It is being replaced with a Nullable exception that caused the close. This is being used in pending work (ES-11448) to differentiate network issues from node restarts. Closes ES-12007 commit aceaf23 Merge: f18f4ee 159c57f Author: elasticsearchmachine <[email protected]> Date: Thu Jun 5 19:27:54 2025 +0000 Merge patch/serverless-fix into main commit 159c57f Author: Rene Groeschke <[email protected]> Date: Tue Jun 3 11:56:27 2025 +0200 Prepare serverless patch including elastic#128784 elastic#128740 (elastic#128807) * Change default for vector.rescoring.directio to false (elastic#128784) On serverless (and potentially elsewhere), direct IO is not available, which can cause BBQ shards to fail to read with org.apache.lucene.CorruptIndexException when this setting is true. * Optimize sparse vector stats collection (elastic#128740) This change improves the performance of sparse vector statistics gathering by using the document count of terms directly, rather than relying on the field name field to compute stats. By avoiding per-term disk/network reads and instead leveraging statistics already loaded into leaf readers at index opening, we expect to significantly reduce overhead. Relates to elastic#128583 --------- Co-authored-by: Dave Pifke <[email protected]> Co-authored-by: Jim Ferenczi <[email protected]>

We miss appending null when ignored_source is not available. Our randomized tests already cover this case, but we do not check it when loading fields. I labelled this non-issue for an unreleased bug. Closes elastic#128959 Relates elastic#119546

lkts requested review from dnhatn, kkrik-es and martijnvg January 3, 2025 22:40

lkts changed the title ~~Synthetic source block loader~~ Prototype FallbackSyntheticSourceBlockLoader Jan 3, 2025

elasticsearchmachine added the v9.0.0 label Jan 3, 2025

kkrik-es reviewed Jan 6, 2025

View reviewed changes

lkts and others added 3 commits January 6, 2025 12:06

Introduce BlockLoaderTestCase

f967b21

Prototype FallbackSyntheticSourceBlockLoader

db00265

[CI] Auto commit changes from spotless

ada0166

lkts force-pushed the synthetic_source_block_loader branch from d048615 to ada0166 Compare January 6, 2025 20:06

Getting closer to reading from parent object

4e2bff8

lkts force-pushed the synthetic_source_block_loader branch from 0a33390 to 4e2bff8 Compare January 7, 2025 00:16

elasticsearchmachine and others added 8 commits January 7, 2025 00:37

[CI] Auto commit changes from spotless

42839e2

iter

bfcf484

[CI] Auto commit changes from spotless

5fed411

iter

5d170ba

Merge branch 'main' into synthetic_source_block_loader

c936d4c

[CI] Auto commit changes from spotless

0a51e05

iter

af4614f

Merge branch 'main' into synthetic_source_block_loader

e3ec594

lkts commented Jan 9, 2025

View reviewed changes

Iterate based on new tests

91956b5

martijnvg reviewed Jan 10, 2025

View reviewed changes

lkts added 7 commits January 10, 2025 09:07

Merge branch 'main' into synthetic_source_block_loader

9209744

iter

39007f9

Rework tests

2d7fe91

Merge branch 'main' into synthetic_source_block_loader

e62d3ea

Merge branch 'main' into synthetic_source_block_loader

42841e1

fix compilation

bc2876f

fix tests

dbb082b

lkts added >enhancement :StorageEngine/Mapping The storage related side of mappings auto-backport Automatically create backport pull requests when merged v8.19.0 labels Feb 3, 2025

lkts and others added 2 commits February 3, 2025 12:52

Update docs/changelog/119546.yaml

8aa717b

Handle null_value for keywords in data generation tests

82d7cca

lkts requested a review from limotova February 4, 2025 00:47

lkts marked this pull request as ready for review February 4, 2025 01:26

elasticsearchmachine added the Team:StorageEngine label Feb 4, 2025

martijnvg approved these changes Feb 4, 2025

View reviewed changes

lkts added 2 commits February 4, 2025 14:24

Merge branch 'main' into synthetic_source_block_loader

60d5bfe

Merge branch 'main' into synthetic_source_block_loader

c6dff1d

dnhatn approved these changes Feb 4, 2025

View reviewed changes

lkts merged commit e885da1 into elastic:main Feb 5, 2025
17 checks passed

lkts deleted the synthetic_source_block_loader branch February 5, 2025 00:13

elasticsearchmachine added the backport pending label Feb 5, 2025

lkts mentioned this pull request Feb 5, 2025

[8.x] Introduce FallbackSyntheticSourceBlockLoader and apply it to keyword fields (#119546) #121800

Merged

martijnvg removed the backport pending label Feb 6, 2025

dnhatn mentioned this pull request Jun 5, 2025

Fix block loader with missing ignored source #129006

Merged

Introduce FallbackSyntheticSourceBlockLoader and apply it to keyword fields #119546

Introduce FallbackSyntheticSourceBlockLoader and apply it to keyword fields #119546

Uh oh!

Conversation

lkts commented Jan 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

elasticsearchmachine commented Feb 3, 2025

Uh oh!

elasticsearchmachine commented Feb 4, 2025

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnhatn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticsearchmachine commented Feb 5, 2025

💔 Backport failed

Uh oh!

lkts commented Feb 5, 2025

💚 All backports created successfully

Questions ?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants