Skip to content

Proposal: Adaptive Hybrid Synthetic Source for Pruned _source Fields #133203

@Rassyan

Description

@Rassyan

Description

Hi Elasticsearch team,

First, kudos to @jimczi for the elegant work on synthetic vectors (#130382) and reindex handling (#130834)! 🙌 These PRs inspired me to explore a generalized solution for _source pruning.

Current Limitation

When using include/exclude in _source (docs), the existing include/exclude mechanism physically removes excluded fields from _source storage. This may create irreversible data loss. Operations like reindex, update, and update_by_query rely on intact _source but silently discard pruned fields. While #130834 patched this for vector fields, the problem persists for every other field type. Users reasonably expect document sources can be reconstructed via doc values (as we now do for vectors).

The synthetic_vectors workaround demonstrates the problem is solvable – but only shifts a hidden technical debt by creating:

  • Field-specific switches that need constant maintenance
  • Special-case logic that must be reimplemented per data type

Proposed Solution

Extend the hybrid model from #130382 to all field types by:

  1. Automatic mode detection

    • If _source is intact → use the traditional stored _source (Mode.STORED).
    • If _source is under synthetic mode → use synthetic source (Mode.SYNTHETIC).
    • If _source has include/exclude rules → Auto nable hybrid source (Mode.HYBRID).
    graph LR
    A[_source config] -->|has includes/excludes| B[Hybrid Mode]
    B --> C[Stored: included fields]
    B --> D[Synthetic: excluded fields from doc_values]
    
    Loading
  2. Universal field support
    Reconstruct ANY pruned field (not just vectors) using existing doc_values:

    • Numerics, dates, keywords → Direct from doc_values
    • Vectors → Existing vector reconstruction
    • Geos/nested → New reconstruction handlers
  3. Zero-config upgrade
    Fully backward compatible. Users get hybrid behavior automatically when pruning _source.

Advantages

  1. ​Generality: It solves the problem of missing pruned fields in reindex/update operations for any field type, not just vectors.
  2. ​Transparency: Users don't need to enable a separate setting (like synthetic_vectors) for specific fields. The behavior is automatically triggered by the _source pruning configuration.
  3. ​Consistency: It unifies the handling of pruned fields and synthetic source.

Benefits

Operation Current Behavior (with pruned _source) Proposed Hybrid Behavior
reindex ❌ Loses excluded fields ✅ Reconstructs via doc_values
update ❌ Cannot update pruned fields ✅ Full field access
update_by_query ❌ Cannot update pruned fields ✅ Full field access
  • Solves reindex/update issues for all field types (generalizes Ensure vectors are always included in reindex actions #130834)
  • Eliminates need for field-specific switches like synthetic_vectors
  • Unlocks new use cases:
    // Keep small metadata in _source, reconstruct heavy fields on demand
    "_source": { 
      "includes": ["meta/*"],
      "excludes": ["embeddings", "logs"] 
    }

Next Steps

I would like to get feedback from the team (@jimczi and others involved in the related PRs - cc @benwtrent @martijnvg ) on the feasibility and desirability of this approach.

If the team agrees this aligns with Elasticsearch's direction, I am willing to contribute. I plan to start by writing a unit test that demonstrates the problem (reindex fails for pruned non-vector fields) and then propose a solution following the hybrid model.
Looking forward to your thoughts!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions