-
Notifications
You must be signed in to change notification settings - Fork 189
[FEATURE] Add Union command in PPL #5110
Copy link
Copy link
Open
Labels
PPLPiped processing languagePiped processing languageRFCRequest For CommentsRequest For CommentsenhancementNew feature or requestNew feature or requestfeature
Description
Proposal — Command Syntax, Options, and Precise Semantics
Command
union <dataset> [, <dataset>...]
Arguments
dataset (required, repeatable)
The dataset(s) whose results will be merged. Each dataset can be one of the following OpenSearch-supported sources:
1. Index / index pattern / alias
- Examples: logs-*, my-index, security-events-alias
2. Subsearch result set
- Example:
[search index=a | where status=200 | fields user action]
Multiple datasets must be separated by commas.
Syntax examples:
| union dataset_a, dataset_b, dataset_c
| union [search index=a], [search index=b]
| union logs-*, security-events-alias
Out of scope
- Saved searches as datasets
- Semantic data models
- Lookup tables as first-class dataset sources
Constraints
- At least one dataset must be specified
- Datasets must be resolvable at analysis time
Position in pipeline
As first command:
| union dataset_a, dataset_b
All datasets must be explicit arguments.
Mid-pipeline:
search index=a | where status=200 | union dataset_b
The upstream result set is implicitly included as the first dataset.
This is equivalent to:
| union [search index=a | where status=200], dataset_b
---
Precise Semantics
Dataset Processing
Input
- First command: only the explicitly specified datasets
- Mid-pipeline: upstream result set (implicit) + explicit datasets
Output
- A single merged result set containing all rows from all inputs
Schema Merging
Output schema is the union of all field names across all datasets
For each output row:
- Fields present in the originating dataset retain their values
- Fields absent in the originating dataset are set to null
Type resolution
- If types are compatible, coerce to a common supertype (e.g., int + float → float)
- Otherwise, coerce to string
Ordering Semantics
- No implicit ordering guarantees
- The output order is determined by dataset evaluation order
- Deterministic ordering requires an explicit sort command
Value Preservation
UNION ALL semantics
- No deduplication
- All rows from all inputs are preserved exactly once
- Output cardinality = sum of input cardinalities
Error Handling
The command fails if:
- No datasets are provided
- Any dataset cannot be resolved or accessed
- Schema resolution fails due to unrecoverable type conflicts
---
Examples
1. Basic union (same schema)
Input
Dataset A (upstream):
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ joe │ login │
├──────┼────────┤
│ sam │ logout │
└──────┴────────┘
Dataset B:
┌───────┬────────┐
│ user │ action │
├───────┼────────┤
│ alice │ login │
├───────┼────────┤
│ bob │ logout │
└───────┴────────┘
Query
... | union dataset_b
Output
┌───────┬────────┐
│ user │ action │
├───────┼────────┤
│ joe │ login │
├───────┼────────┤
│ sam │ logout │
├───────┼────────┤
│ alice │ login │
├───────┼────────┤
│ bob │ logout │
└───────┴────────┘
---
2. Union as first command
Input
Index A:
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ joe │ login │
├──────┼────────┤
│ sam │ logout │
└──────┴────────┘
Index B:
┌───────┬────────┐
│ user │ action │
├───────┼────────┤
│ alice │ login │
├───────┼────────┤
│ bob │ logout │
└───────┴────────┘
Query
| union [search index=a | fields user, action], [search index=b | fields user, action]
Output
┌───────┬────────┐
│ user │ action │
├───────┼────────┤
│ joe │ login │
├───────┼────────┤
│ sam │ logout │
├───────┼────────┤
│ alice │ login │
├───────┼────────┤
│ bob │ logout │
└───────┴────────┘
---
3. Non-overlapping schemas
Input
Dataset A (upstream):
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ joe │ login │
├──────┼────────┤
│ sam │ logout │
└──────┴────────┘
Dataset B:
┌──────┬────────┐
│ host │ status │
├──────┼────────┤
│ web1 │ 200 │
├──────┼────────┤
│ web2 │ 404 │
└──────┴────────┘
Query
... | union dataset_b
Output
┌──────┬────────┬──────┬────────┐
│ user │ action │ host │ status │
├──────┼────────┼──────┼────────┤
│ joe │ login │ null │ null │
├──────┼────────┼──────┼────────┤
│ sam │ logout │ null │ null │
├──────┼────────┼──────┼────────┤
│ null │ null │ web1 │ 200 │
├──────┼────────┼──────┼────────┤
│ null │ null │ web2 │ 404 │
└──────┴────────┴──────┴────────┘
---
4. Partial schema overlap
Input
Dataset A (upstream):
┌──────┬────────┬───────────┐
│ user │ action │ timestamp │
├──────┼────────┼───────────┤
│ joe │ login │ 100 │
└──────┴────────┴───────────┘
Dataset B:
┌──────┬──────┬────────┐
│ user │ host │ status │
├──────┼──────┼────────┤
│ joe │ web1 │ 200 │
└──────┴──────┴────────┘
Query
... | union dataset_b
Output
┌──────┬────────┬───────────┬──────┬────────┐
│ user │ action │ timestamp │ host │ status │
├──────┼────────┼───────────┼──────┼────────┤
│ joe │ login │ 100 │ null │ null │
├──────┼────────┼───────────┼──────┼────────┤
│ joe │ null │ null │ web1 │ 200 │
└──────┴────────┴───────────┴──────┴────────┘
---
5. Subsearch as dataset
Input
Dataset A (upstream):
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ joe │ login │
├──────┼────────┤
│ sam │ logout │
└──────┴────────┘
Index B:
┌───────┬────────┬────────┐
│ user │ action │ status │
├───────┼────────┼────────┤
│ alice │ login │ 200 │
├───────┼────────┼────────┤
│ bob │ logout │ 404 │
└───────┴────────┴────────┘
Query
... | union [search index=b | where status=200 | fields user, action]
Output
┌───────┬────────┐
│ user │ action │
├───────┼────────┤
│ joe │ login │
├───────┼────────┤
│ sam │ logout │
├───────┼────────┤
│ alice │ login │
└───────┴────────┘
---
6. Multiple datasets
Input
Dataset A (upstream):
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ joe │ login │
└──────┴────────┘
Dataset B:
┌───────┬────────┐
│ user │ action │
├───────┼────────┤
│ alice │ logout │
└───────┴────────┘
Dataset C:
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ bob │ login │
└──────┴────────┘
Query
... | union dataset_b, dataset_c
Output
┌───────┬────────┐
│ user │ action │
├───────┼────────┤
│ joe │ login │
├───────┼────────┤
│ alice │ logout │
├───────┼────────┤
│ bob │ login │
└───────┴────────┘
---
7. Duplicate rows preserved
Input
Dataset A (upstream):
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ joe │ login │
├──────┼────────┤
│ sam │ logout │
└──────┴────────┘
Dataset B:
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ joe │ login │
├──────┼────────┤
│ bob │ logout │
└──────┴────────┘
Query
... | union dataset_b
Output
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ joe │ login │
├──────┼────────┤
│ sam │ logout │
├──────┼────────┤
│ joe │ login │
├──────┼────────┤
│ bob │ logout │
└──────┴────────┘
Note: Duplicates remain; no deduplication is performed (row joe | login appears twice).
---
8. Empty dataset
Input
Dataset A (upstream):
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ joe │ login │
├──────┼────────┤
│ sam │ logout │
└──────┴────────┘
Dataset B (empty):
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
└──────┴────────┘
Query
... | union dataset_b
Output
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ joe │ login │
├──────┼────────┤
│ sam │ logout │
└──────┴────────┘
Note: If a dataset returns zero rows, it contributes nothing to the output.
---
9. Mid-pipeline union behavior
Input
Index A:
┌──────┬────────┬────────┐
│ user │ action │ status │
├──────┼────────┼────────┤
│ joe │ login │ 200 │
├──────┼────────┼────────┤
│ sam │ logout │ 200 │
├──────┼────────┼────────┤
│ bob │ login │ 404 │
└──────┴────────┴────────┘
Index B:
┌───────┬────────┐
│ user │ action │
├───────┼────────┤
│ alice │ login │
├───────┼────────┤
│ carol │ logout │
└───────┴────────┘
Query
search index=a | where status=200 | union [search index=b]
Explanation: The filtered result from index A (rows with status=200) is implicitly the first dataset, then unioned with all rows from index B.
Output
┌───────┬────────┬────────┐
│ user │ action │ status │
├───────┼────────┼────────┤
│ joe │ login │ 200 │
├───────┼────────┼────────┤
│ sam │ logout │ 200 │
├───────┼────────┼────────┤
│ alice │ login │ null │
├───────┼────────┼────────┤
│ carol │ logout │ null │
└───────┴────────┴────────┘
---
10. Union with explicit ordering
Input
Dataset A:
┌──────┬───────────┐
│ user │ timestamp │
├──────┼───────────┤
│ joe │ 200 │
├──────┼───────────┤
│ sam │ 100 │
└──────┴───────────┘
Dataset B:
┌───────┬───────────┐
│ user │ timestamp │
├───────┼───────────┤
│ alice │ 150 │
├───────┼───────────┤
│ bob │ 250 │
└───────┴───────────┘
Query
| union dataset_a, dataset_b | sort timestamp desc
Output
┌───────┬───────────┐
│ user │ timestamp │
├───────┼───────────┤
│ bob │ 250 │
├───────┼───────────┤
│ joe │ 200 │
├───────┼───────────┤
│ alice │ 150 │
├───────┼───────────┤
│ sam │ 100 │
└───────┴───────────┘
Note: Without explicit sort, output order is not guaranteed. The sort command provides deterministic ordering.
---
Implementation Notes (OpenSearch-specific)
Execution model
- Implemented as a logical UNION ALL operator in the Calcite plan
- No special execution routing or streaming classification
Ordering
- Ordering is not preserved or inferred
- Explicit sort is required for deterministic results
Performance
- Dataset queries are evaluated independently
- Results are appended in evaluation order
Field alignment
- NULL projection is applied to align schemas before union
Delimiter syntax
- Comma (,) is the standard delimiter between datasets
- Whitespace around commas is optional but recommended for readability
---
Non-Goals (Explicitly Out of Scope)
- Time-based result interleaving
- Implicit _time or @timestamp ordering
- Saved search datasets
- Semantic data models
- Lookup tables as dataset sources
- Subsearch-specific execution knobs (maxout, maxtime, caching)
These may be explored in future iterations.
---
Comparison
OpenSearch PPL union vs Splunk SPL union
┌───────────────────┬─────────────────┬─────────────────────┐
│ Feature │ Splunk SPL │ OpenSearch PPL │
├───────────────────┼─────────────────┼─────────────────────┤
│ Deduplication │ UNION ALL │ UNION ALL │
├───────────────────┼─────────────────┼─────────────────────┤
│ Schema handling │ Union of fields │ Union of fields │
├───────────────────┼─────────────────┼─────────────────────┤
│ Time interleaving │ Yes (streaming) │ No │
├───────────────────┼─────────────────┼─────────────────────┤
│ Implicit ordering │ _time DESC │ None │
├───────────────────┼─────────────────┼─────────────────────┤
│ Delimiter syntax │ Space or comma │ Comma only │
├───────────────────┼─────────────────┼─────────────────────┤
│ Data models │ Supported │ Not supported │
├───────────────────┼─────────────────┼─────────────────────┤
│ Saved searches │ Supported │ Not supported │
└───────────────────┴─────────────────┴─────────────────────┘
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
PPLPiped processing languagePiped processing languageRFCRequest For CommentsRequest For CommentsenhancementNew feature or requestNew feature or requestfeature
Type
Projects
Status
New
Status
Not Started
Status
In progress