-
Notifications
You must be signed in to change notification settings - Fork 189
[RFC] PPL cluster Command #5255
Description
Problem Statement
Log analysis frequently requires identifying patterns in event data like what types of errors are occurring, which messages are common vs rare, and what distinct categories of events exist. Manually reviewing thousands of log lines is impractical. Users need a way to automatically group similar log events together to quickly triage issues and understand the shape of their data. Expected use cases involve filtering before clustering like clustering error logs, grouping similar events by sourcetype to understand what patterns exist, or finding rare events during an incident time window.
Syntax
cluster [field] [t=threshold] [match=mode] [labelfield=name] [countfield=name] [labelonly=bool] [showcount=bool] [delims=chars]
Parameters:
- field: Field to cluster on (required, no default field to fall back to)
- t: Similarity threshold between 0.0-1.0 (default: 0.8)
- match: Vectorization mode -
termlist(positional),termset(bag-of-words), orngramset(character trigrams) (default:termlist) - labelfield: Name for cluster label output field (default:
cluster_label) - countfield: Name for cluster count output field (default:
cluster_count) - labelonly: If true, only output cluster labels, not counts (default: true)
- showcount: If true, include cluster counts in output (default: true)
- delims: Delimiter characters for tokenization (default: everything except [0-9A-Za-z_])
Text Similarity Approach
Uses lexical (token-based) similarity rather than semantic similarity. Logs are structurally predictable, making lexical matching sufficient and cheaper. Semantic can also be counterproductive , logs about similar issues on different services would embed closely but may be unrelated.
Vectorization Modes
- termlist (default): Ordered token comparison. Fastest. Two events must share tokens in the same positions.
- termset: Unordered token comparison. Catches similarity when word order varies.
- ngramset: Character trigram comparison. Significantly slower.
Clustering Algorithm
Greedy single-pass clustering. Each event is compared against existing cluster representatives. If similarity exceeds threshold t, the event joins that cluster. Otherwise a new cluster is created.
- Time: O(n×k), Space: O(k×d) where k=clusters, d=dimensions
- Order-dependent results (same data in different order can produce different clusters)
- Only stores cluster representatives, not full similarity matrix
Processing Flow
- Collect events in batches (following LogPatternAggFunction pattern)
- Process each batch against existing global cluster state
- Compare new events against cluster representatives only
- Assign to existing cluster or create new one based on threshold
Other Algorithm Options Considered
Union-Find Connected Components
- Compare all document pairs, union those above threshold
- Find globally optimal connected components
- Time: O(n²×α(n)), Space: O(n)
- Order-independent, but much slower for large datasets
Hierarchical Agglomerative Clustering
- Build distance matrix, iteratively merge closest clusters
- High quality results but O(n²) space and O(n³) time
- Impractical for large log datasets
Implementation Plan
Key Components
1. AST Node (core/src/main/java/org/opensearch/sql/ast/tree/Cluster.java)
- Store command parameters: field, threshold, match mode, output options
- Integrate with visitor pattern
2. Clustering Engine (common/src/main/java/org/opensearch/sql/common/cluster/TextSimilarityClustering.java)
- Three vectorization modes: termlist (positional), termset (bag-of-words), ngramset (character trigrams)
- Cosine similarity computation using
commons-text - Vector caching for performance
3. Aggregate Function (core/src/main/java/org/opensearch/sql/calcite/udf/udaf/ClusterLabelAggFunction.java)
- Buffered processing following LogPatternAggFunction pattern
- Thread-safe accumulator with synchronized collections
- Incremental cluster state management across batches
4. Grammar Integration (ppl/src/main/antlr/OpenSearchPPLParser.g4, OpenSearchPPLLexer.g4)
- Add CLUSTER_CMD token and clusterCommand rule
- Support parameters: t, match, field, labelfield, countfield, labelonly, showcount, delims
Risks and Considerations
Performance: O(n×k) algorithm may struggle with datasets having many small clusters. Monitor cluster count growth.
Memory: Buffering strategy prevents unbounded growth, but large buffers still require significant memory.
Thread Safety: Synchronization overhead may impact performance in highly concurrent scenarios.
Order Dependency: Results depend on event processing order, which may surprise users expecting deterministic clustering.
Limitations
Required Field
The field parameter is required. There is no default _raw text field in OpenSearch to fall back to like SPL, so users must always specify which field to cluster on.
High Cardinality Performance
The clustering algorithm is O(n×k) where k is the number of distinct clusters. When k is small relative to n, performance scales linearly . 1M events with ~15 clusters completes in ~8s. When k approaches n (most events are unique), performance degrades: 10K events with 10K clusters takes ~23s, 20K takes ~4 minutes. A max cluster cap of 10,000 is enforced to bound memory usage.
poc: ritvibhatt@8f2804e
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status