Skip to content

Conversation

@jimczi
Copy link
Contributor

@jimczi jimczi commented Sep 17, 2024

I am opening this PR to test a solution for removing field values from the stored _source. The goal is to let mapped fields remove their values from the original _source. Instead, a synthetic version of the source field will restore the original values during search and retrieval.

This branch presents a working solution with a new option. This option is available through DocumentParserContext#addSourceFieldPatch at indexing time and Mapping#getPatchFieldLoader at search time.

The current implementation replaces patched fields with an ID. This ID is stored as a numeric doc value linked to the patch field. Each field can only have one patch per document. Multi-valued patches are accessible only through a parent nested field.

At indexing time, if the field with the full path foo.bar has a registered patch, the original _source:

{
  "foo": {
    "bar": [0, 1, 2, 3]
  },
  "field": "value"
}

is rewritten to:

{
  "foo": {
    "bar": 0
  },
  "field": "value"
}

Here, 0 is the reference patch ID for the field. This patch ID is indexed as a numeric doc value in a metadata field named after the full path of the field.

At search time, the process reverses. Reference patch IDs are replaced with the original values retrieved by field mappers outside of the _source. Field mappers, such as those using synthetic field capabilities, handle this retrieval.

If this approach is approved, I will open smaller PRs to introduce the feature more gradually.
Here are the changes that could be made progressively:

  • Always create the source provider from the mapping (85dad44)
  • Add an option to remove field values from the stored source and provide a way to reconstruct the original source at search/retrieval time (da38044)
    • Handle includes and excludes
    • Handle peer recovery
  • Apply the option to the dense vector field mapper (discuss whether it should be exposed to user or not).
  • Adjust the semantic_text field to always use this option on dense vector fields.

@jimczi jimczi added >feature :StorageEngine/Mapping The storage related side of mappings labels Sep 17, 2024
@elasticsearchmachine
Copy link
Collaborator

Hi @jimczi, I've created a changelog YAML for you.

@jimczi
Copy link
Contributor Author

jimczi commented Sep 17, 2024

cc @martijnvg @kkrik-es, could you please review the approach proposed in da38044? As we discussed earlier, this option is meant to provide an alternative to a full synthetic source. For scenarios where quick retrieval of the source is crucial, this option allows us to remove values from a few fields that are expensive to store while preserving the original source for other fields.

cc @benwtrent, the goal is to expose or apply this option transparently on dense vector fields with large number of dimensions.

cc @carlosdelest @Mikep86, for the semantic text field, the goal is to use this option transparently to avoid paying the extra storage cost.

@jimczi jimczi force-pushed the patch_source_mapping branch from c673ed3 to 1bb7c4d Compare September 17, 2024 17:00
@elasticsearchmachine
Copy link
Collaborator

Hi @jimczi, I've created a changelog YAML for you.

@jimczi jimczi force-pushed the patch_source_mapping branch from e5a2fbe to d96cb8d Compare September 18, 2024 08:43
@kkrik-es
Copy link
Contributor

Hey Jim, I may be wrong but this seems very similar to the existing mechanism for synthetic source using _ignored_source. I was wondering if we can reuse the same mechanism instead of introducing a new one with very similar functionality. If there are shortcomings around _ignored_source, maybe we can address these instead?

@jimczi
Copy link
Contributor Author

jimczi commented Sep 24, 2024

I made some modifications to the branch to improve the speed of patching. However, I still need to address the code duplication between synthetic loading and patch loading. Before tackling this, I wanted to assess the impact of my changes with a benchmark.

The benchmark focuses on two use cases:

  • Loading the source with dense vectors (SourceProviderBenchmark.loadDoc)
  • Loading the source without dense vectors (SourceProviderBenchmark.loadDoc)

The former case is intended for scenarios where dense vectors are necessary (e.g., reindexing or debugging), which should be relatively rare. The latter case is for search operations, where avoiding the cost of transferring large vectors over the network and disk is important.

Source Filters Addition

To enhance filtering efficiency, I added an option to apply source filters directly during source loading. This allows each strategy—stored, synthetic, and patch—to choose the most efficient way to apply filters.

Benchmark Details

The benchmark uses three Wikipedia documents of varying sizes:

  • Small: 1 chunk

  • Medium: 50 chunks

  • Large: 200 chunks
    Each chunk contains the text and a vector of 1024 float dimensions indexed as a nested field.
    We compare different strategies to store and retrieve source:

  • stored: The default when storing the original source as is.

  • synthetic: Source in synthetic mode.

  • patch: Filters the dense vectors from the stored source and artificially add them at search time.

  • exclude: Filters the dense vectors from the stored source definitely.

The test measures performance by loading each document once from disk using the different strategies.

Results for entire source loading

Benchmark                              (docSize)     (mode)  Mode  Cnt      Score      Error  Units
SourceProviderBenchmark.loadDoc            small      patch  avgt    5     54.143 ±    1.290  us/op
SourceProviderBenchmark.loadDoc            small  synthetic  avgt    5     62.015 ±    1.216  us/op
SourceProviderBenchmark.loadDoc            small     stored  avgt    5     26.264 ±    0.511  us/op
SourceProviderBenchmark.loadDoc            small    exclude  avgt    5      3.961 ±    0.044  us/op
SourceProviderBenchmark.loadDoc           medium      patch  avgt    5   2613.940 ±   34.037  us/op
SourceProviderBenchmark.loadDoc           medium  synthetic  avgt    5   3807.634 ±   19.907  us/op
SourceProviderBenchmark.loadDoc           medium     stored  avgt    5   1620.184 ±  109.972  us/op
SourceProviderBenchmark.loadDoc           medium    exclude  avgt    5     97.087 ±    1.235  us/op
SourceProviderBenchmark.loadDoc            large      patch  avgt    5   7794.028 ±   31.632  us/op
SourceProviderBenchmark.loadDoc            large  synthetic  avgt    5  11948.199 ±  802.587  us/op
SourceProviderBenchmark.loadDoc            large     stored  avgt    5   4891.576 ±  136.645  us/op
SourceProviderBenchmark.loadDoc            large    exclude  avgt    5    319.531 ±   22.638  us/op

When retrieving dense vectors:

  • Stored strategy: This was the fastest, as expected, since there is no need to parse the _source. The dense vector can be returned directly.
  • Patch strategy: This was the second fastest. While parsing is required, it operates on a _source without dense vectors. The patching process adds about 50% to the overall time compared to the stored strategy.
  • Synthetic strategy: This was the slowest, but not by a significant margin. However, the actual cost of synthetic processing is hard to gauge since the number of fields in this case is low.

Results for filtered source loading (removing dense vectors)

Benchmark                              (docSize)     (mode)  Mode  Cnt      Score      Error  Units
SourceProviderBenchmark.loadFilterDoc      small      patch  avgt    5      9.619 ±    0.281  us/op
SourceProviderBenchmark.loadFilterDoc      small  synthetic  avgt    5     13.405 ±    0.457  us/op
SourceProviderBenchmark.loadFilterDoc      small     stored  avgt    5     58.580 ±    0.220  us/op
SourceProviderBenchmark.loadFilterDoc      small    exclude  avgt    5      3.926 ±    0.152  us/op
SourceProviderBenchmark.loadFilterDoc     medium      patch  avgt    5    189.760 ±    3.209  us/op
SourceProviderBenchmark.loadFilterDoc     medium  synthetic  avgt    5    130.188 ±    4.486  us/op
SourceProviderBenchmark.loadFilterDoc     medium     stored  avgt    5   3080.400 ±   32.025  us/op
SourceProviderBenchmark.loadFilterDoc     medium    exclude  avgt    5     97.398 ±    1.343  us/op
SourceProviderBenchmark.loadFilterDoc      large      patch  avgt    5    604.829 ±   32.721  us/op
SourceProviderBenchmark.loadFilterDoc      large  synthetic  avgt    5    321.268 ±    6.403  us/op
SourceProviderBenchmark.loadFilterDoc      large     stored  avgt    5  10832.213 ± 2073.713  us/op
SourceProviderBenchmark.loadFilterDoc      large    exclude  avgt    5    398.775 ±  216.675  us/op

For cases involving source filtering:

  • Patch strategy: This became the fastest, outperforming the stored strategy by an order of magnitude.
  • Synthetic strategy: This was also highly efficient, given the low number of fields and the absence of original source parsing.

Conclusion and Next Steps

Given that the synthetic strategy's performance is close to the patch strategy, the question becomes: is the patch mode necessary? Although full reliance on the synthetic mode for search use cases might be excessive, I still support maintaining the alternative patch strategy, as proposed in this PR.
I’m interested in hearing feedback from others on this approach.

@elasticsearchmachine
Copy link
Collaborator

Hi @jimczi, I've created a changelog YAML for you.

jimczi added a commit to jimczi/elasticsearch that referenced this pull request Sep 30, 2024
Spinoff of elastic#113036.

This change introduces optional source filtering directly within source loaders (both synthetic and stored).
The main benefit is seen in synthetic source loaders, as synthetic fields are stored independently.
By filtering while loading the synthetic source, generating the source becomes linear in the number of fields that match the filter.

This update also modifies the get document API to apply source filters earlier—directly through the source loader.
The search API, however, is not affected in this change, since the loaded source is still used by other features (e.g., highlighting, fields, nested hits),
and source filtering is always applied as the final step.
A follow-up will be required to ensure careful handling of all search-related scenarios.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>feature :StorageEngine/Mapping The storage related side of mappings v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants