Skip to content

[RFC] PPL cluster Command #5255

@ritvibhatt

Description

@ritvibhatt

Problem Statement

Log analysis frequently requires identifying patterns in event data like what types of errors are occurring, which messages are common vs rare, and what distinct categories of events exist. Manually reviewing thousands of log lines is impractical. Users need a way to automatically group similar log events together to quickly triage issues and understand the shape of their data. Expected use cases involve filtering before clustering like clustering error logs, grouping similar events by sourcetype to understand what patterns exist, or finding rare events during an incident time window.

Syntax

cluster [field] [t=threshold] [match=mode] [labelfield=name] [countfield=name] [labelonly=bool] [showcount=bool] [delims=chars]

Parameters:

  • field: Field to cluster on (required, no default field to fall back to)
  • t: Similarity threshold between 0.0-1.0 (default: 0.8)
  • match: Vectorization mode - termlist (positional), termset (bag-of-words), or ngramset (character trigrams) (default: termlist)
  • labelfield: Name for cluster label output field (default: cluster_label)
  • countfield: Name for cluster count output field (default: cluster_count)
  • labelonly: If true, only output cluster labels, not counts (default: true)
  • showcount: If true, include cluster counts in output (default: true)
  • delims: Delimiter characters for tokenization (default: everything except [0-9A-Za-z_])

Text Similarity Approach

Uses lexical (token-based) similarity rather than semantic similarity. Logs are structurally predictable, making lexical matching sufficient and cheaper. Semantic can also be counterproductive , logs about similar issues on different services would embed closely but may be unrelated.

Vectorization Modes

  • termlist (default): Ordered token comparison. Fastest. Two events must share tokens in the same positions.
  • termset: Unordered token comparison. Catches similarity when word order varies.
  • ngramset: Character trigram comparison. Significantly slower.

Clustering Algorithm

Greedy single-pass clustering. Each event is compared against existing cluster representatives. If similarity exceeds threshold t, the event joins that cluster. Otherwise a new cluster is created.

  • Time: O(n×k), Space: O(k×d) where k=clusters, d=dimensions
  • Order-dependent results (same data in different order can produce different clusters)
  • Only stores cluster representatives, not full similarity matrix

Processing Flow

  1. Collect events in batches (following LogPatternAggFunction pattern)
  2. Process each batch against existing global cluster state
  3. Compare new events against cluster representatives only
  4. Assign to existing cluster or create new one based on threshold

Other Algorithm Options Considered

Union-Find Connected Components

  • Compare all document pairs, union those above threshold
  • Find globally optimal connected components
  • Time: O(n²×α(n)), Space: O(n)
  • Order-independent, but much slower for large datasets

Hierarchical Agglomerative Clustering

  • Build distance matrix, iteratively merge closest clusters
  • High quality results but O(n²) space and O(n³) time
  • Impractical for large log datasets

Implementation Plan

Key Components

1. AST Node (core/src/main/java/org/opensearch/sql/ast/tree/Cluster.java)

  • Store command parameters: field, threshold, match mode, output options
  • Integrate with visitor pattern

2. Clustering Engine (common/src/main/java/org/opensearch/sql/common/cluster/TextSimilarityClustering.java)

  • Three vectorization modes: termlist (positional), termset (bag-of-words), ngramset (character trigrams)
  • Cosine similarity computation using commons-text
  • Vector caching for performance

3. Aggregate Function (core/src/main/java/org/opensearch/sql/calcite/udf/udaf/ClusterLabelAggFunction.java)

  • Buffered processing following LogPatternAggFunction pattern
  • Thread-safe accumulator with synchronized collections
  • Incremental cluster state management across batches

4. Grammar Integration (ppl/src/main/antlr/OpenSearchPPLParser.g4, OpenSearchPPLLexer.g4)

  • Add CLUSTER_CMD token and clusterCommand rule
  • Support parameters: t, match, field, labelfield, countfield, labelonly, showcount, delims

Risks and Considerations

Performance: O(n×k) algorithm may struggle with datasets having many small clusters. Monitor cluster count growth.

Memory: Buffering strategy prevents unbounded growth, but large buffers still require significant memory.

Thread Safety: Synchronization overhead may impact performance in highly concurrent scenarios.

Order Dependency: Results depend on event processing order, which may surprise users expecting deterministic clustering.

Limitations

Required Field

The field parameter is required. There is no default _raw text field in OpenSearch to fall back to like SPL, so users must always specify which field to cluster on.

High Cardinality Performance

The clustering algorithm is O(n×k) where k is the number of distinct clusters. When k is small relative to n, performance scales linearly . 1M events with ~15 clusters completes in ~8s. When k approaches n (most events are unique), performance degrades: 10K events with 10K clusters takes ~23s, 20K takes ~4 minutes. A max cluster cap of 10,000 is enforced to bound memory usage.

poc: ritvibhatt@8f2804e

Metadata

Metadata

Assignees

No one assigned

    Labels

    PPLPiped processing languageRFCRequest For CommentsenhancementNew feature or request

    Type

    No type

    Projects

    Status

    New

    Status

    Not Started

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions