-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Description
Hi Elasticsearch team,
First, kudos to @jimczi for the elegant work on synthetic vectors (#130382) and reindex handling (#130834)! 🙌 These PRs inspired me to explore a generalized solution for _source pruning.
Current Limitation
When using include/exclude in _source (docs), the existing include/exclude mechanism physically removes excluded fields from _source storage. This may create irreversible data loss. Operations like reindex, update, and update_by_query rely on intact _source but silently discard pruned fields. While #130834 patched this for vector fields, the problem persists for every other field type. Users reasonably expect document sources can be reconstructed via doc values (as we now do for vectors).
The synthetic_vectors workaround demonstrates the problem is solvable – but only shifts a hidden technical debt by creating:
- Field-specific switches that need constant maintenance
- Special-case logic that must be reimplemented per data type
Proposed Solution
Extend the hybrid model from #130382 to all field types by:
-
Automatic mode detection
- If
_sourceis intact → use the traditional stored _source (Mode.STORED). - If
_sourceis under synthetic mode → use synthetic source (Mode.SYNTHETIC). - If
_sourcehasinclude/excluderules → Auto nable hybrid source (Mode.HYBRID).
Loadinggraph LR A[_source config] -->|has includes/excludes| B[Hybrid Mode] B --> C[Stored: included fields] B --> D[Synthetic: excluded fields from doc_values]
- If
-
Universal field support
Reconstruct ANY pruned field (not just vectors) using existing doc_values:- Numerics, dates, keywords → Direct from doc_values
- Vectors → Existing vector reconstruction
- Geos/nested → New reconstruction handlers
-
Zero-config upgrade
Fully backward compatible. Users get hybrid behavior automatically when pruning_source.
Advantages
- Generality: It solves the problem of missing pruned fields in reindex/update operations for any field type, not just vectors.
- Transparency: Users don't need to enable a separate setting (like synthetic_vectors) for specific fields. The behavior is automatically triggered by the _source pruning configuration.
- Consistency: It unifies the handling of pruned fields and synthetic source.
Benefits
| Operation | Current Behavior (with pruned _source) | Proposed Hybrid Behavior |
|---|---|---|
| reindex | ❌ Loses excluded fields | ✅ Reconstructs via doc_values |
| update | ❌ Cannot update pruned fields | ✅ Full field access |
| update_by_query | ❌ Cannot update pruned fields | ✅ Full field access |
- Solves reindex/update issues for all field types (generalizes Ensure vectors are always included in reindex actions #130834)
- Eliminates need for field-specific switches like
synthetic_vectors - Unlocks new use cases:
// Keep small metadata in _source, reconstruct heavy fields on demand "_source": { "includes": ["meta/*"], "excludes": ["embeddings", "logs"] }
Next Steps
I would like to get feedback from the team (@jimczi and others involved in the related PRs - cc @benwtrent @martijnvg ) on the feasibility and desirability of this approach.
If the team agrees this aligns with Elasticsearch's direction, I am willing to contribute. I plan to start by writing a unit test that demonstrates the problem (reindex fails for pruned non-vector fields) and then propose a solution following the hybrid model.
Looking forward to your thoughts!