Add an option to replace _source field values with synthetic references #113036

jimczi · 2024-09-17T16:31:11Z

I am opening this PR to test a solution for removing field values from the stored _source. The goal is to let mapped fields remove their values from the original _source. Instead, a synthetic version of the source field will restore the original values during search and retrieval.

This branch presents a working solution with a new option. This option is available through DocumentParserContext#addSourceFieldPatch at indexing time and Mapping#getPatchFieldLoader at search time.

The current implementation replaces patched fields with an ID. This ID is stored as a numeric doc value linked to the patch field. Each field can only have one patch per document. Multi-valued patches are accessible only through a parent nested field.

At indexing time, if the field with the full path foo.bar has a registered patch, the original _source:

{
  "foo": {
    "bar": [0, 1, 2, 3]
  },
  "field": "value"
}

is rewritten to:

{
  "foo": {
    "bar": 0
  },
  "field": "value"
}

Here, 0 is the reference patch ID for the field. This patch ID is indexed as a numeric doc value in a metadata field named after the full path of the field.

At search time, the process reverses. Reference patch IDs are replaced with the original values retrieved by field mappers outside of the _source. Field mappers, such as those using synthetic field capabilities, handle this retrieval.

If this approach is approved, I will open smaller PRs to introduce the feature more gradually.
Here are the changes that could be made progressively:

Always create the source provider from the mapping (85dad44)
Add an option to remove field values from the stored source and provide a way to reconstruct the original source at search/retrieval time (da38044)
- Handle includes and excludes
- Handle peer recovery
Apply the option to the dense vector field mapper (discuss whether it should be exposed to user or not).
Adjust the semantic_text field to always use this option on dense vector fields.

…ource with a patch id

elasticsearchmachine · 2024-09-17T16:32:48Z

Hi @jimczi, I've created a changelog YAML for you.

jimczi · 2024-09-17T16:39:15Z

cc @martijnvg @kkrik-es, could you please review the approach proposed in da38044? As we discussed earlier, this option is meant to provide an alternative to a full synthetic source. For scenarios where quick retrieval of the source is crucial, this option allows us to remove values from a few fields that are expensive to store while preserving the original source for other fields.

cc @benwtrent, the goal is to expose or apply this option transparently on dense vector fields with large number of dimensions.

cc @carlosdelest @Mikep86, for the semantic text field, the goal is to use this option transparently to avoid paying the extra storage cost.

elasticsearchmachine · 2024-09-17T23:17:36Z

Hi @jimczi, I've created a changelog YAML for you.

kkrik-es · 2024-09-18T12:07:44Z

Hey Jim, I may be wrong but this seems very similar to the existing mechanism for synthetic source using _ignored_source. I was wondering if we can reuse the same mechanism instead of introducing a new one with very similar functionality. If there are shortcomings around _ignored_source, maybe we can address these instead?

server/src/main/java/org/elasticsearch/index/mapper/PatchSourceUtils.java

…from lookup

jimczi · 2024-09-24T11:19:49Z

I made some modifications to the branch to improve the speed of patching. However, I still need to address the code duplication between synthetic loading and patch loading. Before tackling this, I wanted to assess the impact of my changes with a benchmark.

The benchmark focuses on two use cases:

Loading the source with dense vectors (SourceProviderBenchmark.loadDoc)
Loading the source without dense vectors (SourceProviderBenchmark.loadDoc)

The former case is intended for scenarios where dense vectors are necessary (e.g., reindexing or debugging), which should be relatively rare. The latter case is for search operations, where avoiding the cost of transferring large vectors over the network and disk is important.

Source Filters Addition

To enhance filtering efficiency, I added an option to apply source filters directly during source loading. This allows each strategy—stored, synthetic, and patch—to choose the most efficient way to apply filters.

Benchmark Details

The benchmark uses three Wikipedia documents of varying sizes:

Small: 1 chunk
Medium: 50 chunks
Large: 200 chunks
Each chunk contains the text and a vector of 1024 float dimensions indexed as a nested field.
We compare different strategies to store and retrieve source:
stored: The default when storing the original source as is.
synthetic: Source in synthetic mode.
patch: Filters the dense vectors from the stored source and artificially add them at search time.
exclude: Filters the dense vectors from the stored source definitely.

The test measures performance by loading each document once from disk using the different strategies.

Results for entire source loading

Benchmark                              (docSize)     (mode)  Mode  Cnt      Score      Error  Units
SourceProviderBenchmark.loadDoc            small      patch  avgt    5     54.143 ±    1.290  us/op
SourceProviderBenchmark.loadDoc            small  synthetic  avgt    5     62.015 ±    1.216  us/op
SourceProviderBenchmark.loadDoc            small     stored  avgt    5     26.264 ±    0.511  us/op
SourceProviderBenchmark.loadDoc            small    exclude  avgt    5      3.961 ±    0.044  us/op
SourceProviderBenchmark.loadDoc           medium      patch  avgt    5   2613.940 ±   34.037  us/op
SourceProviderBenchmark.loadDoc           medium  synthetic  avgt    5   3807.634 ±   19.907  us/op
SourceProviderBenchmark.loadDoc           medium     stored  avgt    5   1620.184 ±  109.972  us/op
SourceProviderBenchmark.loadDoc           medium    exclude  avgt    5     97.087 ±    1.235  us/op
SourceProviderBenchmark.loadDoc            large      patch  avgt    5   7794.028 ±   31.632  us/op
SourceProviderBenchmark.loadDoc            large  synthetic  avgt    5  11948.199 ±  802.587  us/op
SourceProviderBenchmark.loadDoc            large     stored  avgt    5   4891.576 ±  136.645  us/op
SourceProviderBenchmark.loadDoc            large    exclude  avgt    5    319.531 ±   22.638  us/op

When retrieving dense vectors:

Stored strategy: This was the fastest, as expected, since there is no need to parse the _source. The dense vector can be returned directly.
Patch strategy: This was the second fastest. While parsing is required, it operates on a _source without dense vectors. The patching process adds about 50% to the overall time compared to the stored strategy.
Synthetic strategy: This was the slowest, but not by a significant margin. However, the actual cost of synthetic processing is hard to gauge since the number of fields in this case is low.

Results for filtered source loading (removing dense vectors)

Benchmark                              (docSize)     (mode)  Mode  Cnt      Score      Error  Units
SourceProviderBenchmark.loadFilterDoc      small      patch  avgt    5      9.619 ±    0.281  us/op
SourceProviderBenchmark.loadFilterDoc      small  synthetic  avgt    5     13.405 ±    0.457  us/op
SourceProviderBenchmark.loadFilterDoc      small     stored  avgt    5     58.580 ±    0.220  us/op
SourceProviderBenchmark.loadFilterDoc      small    exclude  avgt    5      3.926 ±    0.152  us/op
SourceProviderBenchmark.loadFilterDoc     medium      patch  avgt    5    189.760 ±    3.209  us/op
SourceProviderBenchmark.loadFilterDoc     medium  synthetic  avgt    5    130.188 ±    4.486  us/op
SourceProviderBenchmark.loadFilterDoc     medium     stored  avgt    5   3080.400 ±   32.025  us/op
SourceProviderBenchmark.loadFilterDoc     medium    exclude  avgt    5     97.398 ±    1.343  us/op
SourceProviderBenchmark.loadFilterDoc      large      patch  avgt    5    604.829 ±   32.721  us/op
SourceProviderBenchmark.loadFilterDoc      large  synthetic  avgt    5    321.268 ±    6.403  us/op
SourceProviderBenchmark.loadFilterDoc      large     stored  avgt    5  10832.213 ± 2073.713  us/op
SourceProviderBenchmark.loadFilterDoc      large    exclude  avgt    5    398.775 ±  216.675  us/op

For cases involving source filtering:

Patch strategy: This became the fastest, outperforming the stored strategy by an order of magnitude.
Synthetic strategy: This was also highly efficient, given the low number of fields and the absence of original source parsing.

Conclusion and Next Steps

Given that the synthetic strategy's performance is close to the patch strategy, the question becomes: is the patch mode necessary? Although full reliance on the synthetic mode for search use cases might be excessive, I still support maintaining the alternative patch strategy, as proposed in this PR.
I’m interested in hearing feedback from others on this approach.

elasticsearchmachine · 2024-09-24T18:11:37Z

Hi @jimczi, I've created a changelog YAML for you.

Spinoff of elastic#113036. This change introduces optional source filtering directly within source loaders (both synthetic and stored). The main benefit is seen in synthetic source loaders, as synthetic fields are stored independently. By filtering while loading the synthetic source, generating the source becomes linear in the number of fields that match the filter. This update also modifies the get document API to apply source filters earlier—directly through the source loader. The search API, however, is not affected in this change, since the loaded source is still used by other features (e.g., highlighting, fields, nested hits), and source filtering is always applied as the final step. A follow-up will be required to ensure careful handling of all search-related scenarios.

jimczi added 3 commits September 16, 2024 18:16

Always build SourceProvider from the mapping lookup

85dad44

license

be076e6

Add a way to replace the value for a specific field in the original s…

223d852

…ource with a patch id

jimczi added >feature :StorageEngine/Mapping The storage related side of mappings labels Sep 17, 2024

elasticsearchmachine added the v9.0.0 label Sep 17, 2024

fix license

1bb7c4d

jimczi force-pushed the patch_source_mapping branch from c673ed3 to 1bb7c4d Compare September 17, 2024 17:00

jimczi added 4 commits September 17, 2024 18:39

fix test

dd017ea

spotless

0d49121

fix tests

4754f47

fix another test

a5ebbcd

jimczi added 2 commits September 18, 2024 09:42

cosmetic

caea069

Merge remote-tracking branch 'upstream/main' into patch_source_mapping

d96cb8d

jimczi force-pushed the patch_source_mapping branch from e5a2fbe to d96cb8d Compare September 18, 2024 08:43

kkrik-es reviewed Sep 18, 2024

View reviewed changes

server/src/main/java/org/elasticsearch/index/mapper/PatchSourceUtils.java Show resolved Hide resolved

jimczi added 2 commits September 24, 2024 11:42

Add a benchmark and the support to filter source fields when loading …

ba6e5ca

…from lookup

Merge remote-tracking branch 'upstream/main' into patch_source_mapping

37e68fa

jimczi added 3 commits September 24, 2024 12:31

fix compil

1b1a352

spotless

136c699

Update docs/changelog/113036.yaml

2f71bec

jimczi mentioned this pull request Sep 30, 2024

Add Optional Source Filtering to Source Loaders #113827

Merged

elasticsearchmachine removed the v9.0.0 label Jan 30, 2025

elasticsearchmachine added the v9.1.0 label Jan 30, 2025

jimczi closed this Mar 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add an option to replace _source field values with synthetic references #113036

Add an option to replace _source field values with synthetic references #113036

Uh oh!

jimczi commented Sep 17, 2024 •

edited

Loading

Uh oh!

elasticsearchmachine commented Sep 17, 2024

Uh oh!

jimczi commented Sep 17, 2024

Uh oh!

elasticsearchmachine commented Sep 17, 2024

Uh oh!

kkrik-es commented Sep 18, 2024

Uh oh!

Uh oh!

jimczi commented Sep 24, 2024

Uh oh!

elasticsearchmachine commented Sep 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add an option to replace _source field values with synthetic references #113036

Add an option to replace _source field values with synthetic references #113036

Uh oh!

Conversation

jimczi commented Sep 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Sep 17, 2024

Uh oh!

jimczi commented Sep 17, 2024

Uh oh!

elasticsearchmachine commented Sep 17, 2024

Uh oh!

kkrik-es commented Sep 18, 2024

Uh oh!

Uh oh!

jimczi commented Sep 24, 2024

Source Filters Addition

Benchmark Details

Results for entire source loading

Results for filtered source loading (removing dense vectors)

Conclusion and Next Steps

Uh oh!

elasticsearchmachine commented Sep 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jimczi commented Sep 17, 2024 •

edited

Loading