[RFC] PPL `cluster` Command

## Problem Statement

Log analysis frequently requires identifying patterns in event data like what types of errors are occurring, which messages are common vs rare, and what distinct categories of events exist. Manually reviewing thousands of log lines is impractical. Users need a way to automatically group similar log events together to quickly triage issues and understand the shape of their data. Expected use cases involve filtering before clustering like clustering error logs, grouping similar events by sourcetype to understand what patterns exist, or finding rare events during an incident time window.
 
## Syntax

```
cluster [field] [t=threshold] [match=mode] [labelfield=name] [countfield=name] [labelonly=bool] [showcount=bool] [delims=chars]
```

**Parameters:**

- **field**: Field to cluster on (required, no default field to fall back to)
- **t**: Similarity threshold between 0.0-1.0 (default: 0.8)
- **match**: Vectorization mode - `termlist` (positional), `termset` (bag-of-words), or `ngramset` (character trigrams) (default: `termlist`)
- **labelfield**: Name for cluster label output field (default: `cluster_label`)
- **countfield**: Name for cluster count output field (default: `cluster_count`)
- **labelonly**: If true, only output cluster labels, not counts (default: true)
- **showcount**: If true, include cluster counts in output (default: true)
- **delims**: Delimiter characters for tokenization (default: everything except [0-9A-Za-z_])

## Text Similarity Approach

Uses lexical (token-based) similarity rather than semantic similarity. Logs are structurally predictable, making lexical matching sufficient and cheaper. Semantic can also be counterproductive , logs about similar issues on different services would embed closely but may be unrelated.

### Vectorization Modes
- termlist (default): Ordered token comparison. Fastest. Two events must share tokens in the same positions.
- termset: Unordered token comparison. Catches similarity when word order varies.
- ngramset: Character trigram comparison. Significantly slower.

## Clustering Algorithm

Greedy single-pass clustering. Each event is compared against existing cluster representatives. If similarity exceeds threshold `t`, the event joins that cluster. Otherwise a new cluster is created.

- Time: O(n×k), Space: O(k×d) where k=clusters, d=dimensions
- Order-dependent results (same data in different order can produce different clusters)
- Only stores cluster representatives, not full similarity matrix

### Processing Flow

1. Collect events in batches (following LogPatternAggFunction pattern)
2. Process each batch against existing global cluster state
3. Compare new events against cluster representatives only
4. Assign to existing cluster or create new one based on threshold

### Other Algorithm Options Considered

**Union-Find Connected Components**
- Compare all document pairs, union those above threshold
- Find globally optimal connected components
- Time: O(n²×α(n)), Space: O(n)
- Order-independent, but much slower for large datasets

**Hierarchical Agglomerative Clustering**
- Build distance matrix, iteratively merge closest clusters
- High quality results but O(n²) space and O(n³) time
- Impractical for large log datasets

## Implementation Plan

### Key Components

**1. AST Node (`core/src/main/java/org/opensearch/sql/ast/tree/Cluster.java`)**
- Store command parameters: field, threshold, match mode, output options
- Integrate with visitor pattern

**2. Clustering Engine (`common/src/main/java/org/opensearch/sql/common/cluster/TextSimilarityClustering.java`)**
- Three vectorization modes: termlist (positional), termset (bag-of-words), ngramset (character trigrams)
- Cosine similarity computation using `commons-text`
- Vector caching for performance

**3. Aggregate Function (`core/src/main/java/org/opensearch/sql/calcite/udf/udaf/ClusterLabelAggFunction.java`)**
- Buffered processing following LogPatternAggFunction pattern
- Thread-safe accumulator with synchronized collections
- Incremental cluster state management across batches

**4. Grammar Integration (`ppl/src/main/antlr/OpenSearchPPLParser.g4`, `OpenSearchPPLLexer.g4`)**
- Add CLUSTER_CMD token and clusterCommand rule
- Support parameters: t, match, field, labelfield, countfield, labelonly, showcount, delims

## Risks and Considerations

**Performance**: O(n×k) algorithm may struggle with datasets having many small clusters. Monitor cluster count growth.

**Memory**: Buffering strategy prevents unbounded growth, but large buffers still require significant memory.

**Thread Safety**: Synchronization overhead may impact performance in highly concurrent scenarios.

**Order Dependency**: Results depend on event processing order, which may surprise users expecting deterministic clustering.

## Limitations 

### Required Field
The `field` parameter is required. There is no default _raw text field in OpenSearch to fall back to like SPL, so users must always specify which field to cluster on.

### High Cardinality Performance
The clustering algorithm is O(n×k) where k is the number of distinct clusters. When k is small relative to n, performance scales linearly . 1M events with ~15 clusters completes in ~8s. When k approaches n (most events are unique), performance degrades: 10K events with 10K clusters takes ~23s, 20K takes ~4 minutes. A max cluster cap of 10,000 is enforced to bound memory usage.

poc: https://github.com/ritvibhatt/sql/commit/8f2804ed825c72b02d7d16a934af3114a0c10088

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] PPL `cluster` Command #5255

Problem Statement

Syntax

Text Similarity Approach

Vectorization Modes

Clustering Algorithm

Processing Flow

Other Algorithm Options Considered

Implementation Plan

Key Components

Risks and Considerations

Limitations

Required Field

High Cardinality Performance

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] PPL cluster Command #5255

Description

Problem Statement

Syntax

Text Similarity Approach

Vectorization Modes

Clustering Algorithm

Processing Flow

Other Algorithm Options Considered

Implementation Plan

Key Components

Risks and Considerations

Limitations

Required Field

High Cardinality Performance

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[RFC] PPL `cluster` Command #5255