-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Add an option to replace _source field values with synthetic references #113036
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @jimczi, I've created a changelog YAML for you. |
|
cc @martijnvg @kkrik-es, could you please review the approach proposed in da38044? As we discussed earlier, this option is meant to provide an alternative to a full synthetic source. For scenarios where quick retrieval of the source is crucial, this option allows us to remove values from a few fields that are expensive to store while preserving the original source for other fields. cc @benwtrent, the goal is to expose or apply this option transparently on dense vector fields with large number of dimensions. cc @carlosdelest @Mikep86, for the semantic text field, the goal is to use this option transparently to avoid paying the extra storage cost. |
c673ed3 to
1bb7c4d
Compare
|
Hi @jimczi, I've created a changelog YAML for you. |
e5a2fbe to
d96cb8d
Compare
|
Hey Jim, I may be wrong but this seems very similar to the existing mechanism for synthetic source using |
server/src/main/java/org/elasticsearch/index/mapper/PatchSourceUtils.java
Show resolved
Hide resolved
|
I made some modifications to the branch to improve the speed of patching. However, I still need to address the code duplication between synthetic loading and patch loading. Before tackling this, I wanted to assess the impact of my changes with a benchmark. The benchmark focuses on two use cases:
The former case is intended for scenarios where dense vectors are necessary (e.g., reindexing or debugging), which should be relatively rare. The latter case is for search operations, where avoiding the cost of transferring large vectors over the network and disk is important. Source Filters AdditionTo enhance filtering efficiency, I added an option to apply source filters directly during source loading. This allows each strategy—stored, synthetic, and patch—to choose the most efficient way to apply filters. Benchmark DetailsThe benchmark uses three Wikipedia documents of varying sizes:
The test measures performance by loading each document once from disk using the different strategies. Results for entire source loadingWhen retrieving dense vectors:
Results for filtered source loading (removing dense vectors)For cases involving source filtering:
Conclusion and Next StepsGiven that the synthetic strategy's performance is close to the patch strategy, the question becomes: is the patch mode necessary? Although full reliance on the synthetic mode for search use cases might be excessive, I still support maintaining the alternative patch strategy, as proposed in this PR. |
|
Hi @jimczi, I've created a changelog YAML for you. |
Spinoff of elastic#113036. This change introduces optional source filtering directly within source loaders (both synthetic and stored). The main benefit is seen in synthetic source loaders, as synthetic fields are stored independently. By filtering while loading the synthetic source, generating the source becomes linear in the number of fields that match the filter. This update also modifies the get document API to apply source filters earlier—directly through the source loader. The search API, however, is not affected in this change, since the loaded source is still used by other features (e.g., highlighting, fields, nested hits), and source filtering is always applied as the final step. A follow-up will be required to ensure careful handling of all search-related scenarios.
I am opening this PR to test a solution for removing field values from the stored _source. The goal is to let mapped fields remove their values from the original _source. Instead, a synthetic version of the source field will restore the original values during search and retrieval.
This branch presents a working solution with a new option. This option is available through
DocumentParserContext#addSourceFieldPatchat indexing time andMapping#getPatchFieldLoaderat search time.The current implementation replaces patched fields with an ID. This ID is stored as a numeric doc value linked to the patch field. Each field can only have one patch per document. Multi-valued patches are accessible only through a parent nested field.
At indexing time, if the field with the full path
foo.barhas a registered patch, the original _source:is rewritten to:
Here, 0 is the reference patch ID for the field. This patch ID is indexed as a numeric doc value in a metadata field named after the full path of the field.
At search time, the process reverses. Reference patch IDs are replaced with the original values retrieved by field mappers outside of the _source. Field mappers, such as those using synthetic field capabilities, handle this retrieval.
If this approach is approved, I will open smaller PRs to introduce the feature more gradually.
Here are the changes that could be made progressively:
semantic_textfield to always use this option on dense vector fields.