Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
142 changes: 136 additions & 6 deletions docs/development/extensions-contrib/spectator-histogram.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,8 +79,6 @@ Also see the [limitations](#limitations] of this extension.
* Supports positive long integer values within the range of [0, 2^53). Negatives are
coerced to 0.
* Does not support decimals.
* Does not support Druid SQL queries, only native queries.
* Does not support vectorized queries.
* Generates 276 fixed buckets with increasing bucket widths. In practice, the observed error of computed percentiles ranges from 0.1% to 3%, exclusive. See [Bucket boundaries](#histogram-bucket-boundaries) for the full list of bucket boundaries.

:::tip
Expand Down Expand Up @@ -134,7 +132,11 @@ To use SpectatorHistogram, make sure you [include](../../configuration/extension
druid.extensions.loadList=["druid-spectator-histogram"]
```

## Aggregators
## Native Query Components

The following sections describe the aggregators and post-aggregators for use with [native Druid queries](../../querying/querying.md).

### Aggregators

The result of the aggregation is a histogram that is built by ingesting numeric values from
the raw data, or from combining pre-aggregated histograms. The result is represented in
Expand Down Expand Up @@ -207,9 +209,9 @@ To get the population size (count of events contributing to the histogram):
| name | A String for the output (result) name of the aggregation. | yes |
| fieldName | A String for the name of the input field containing pre-aggregated histograms. | yes |

## Post Aggregators
### Post Aggregators

### Percentile (singular)
#### Percentile (singular)
This returns a single percentile calculation based on the distribution of the values in the aggregated histogram.

```
Expand All @@ -231,7 +233,7 @@ This returns a single percentile calculation based on the distribution of the va
| field | A field reference pointing to the aggregated histogram. | yes |
| percentile | A single decimal percentile between 0.0 and 100.0 | yes |

### Percentiles (multiple)
#### Percentiles (multiple)
This returns an array of percentiles corresponding to those requested.

```
Expand Down Expand Up @@ -272,6 +274,134 @@ array of percentiles.
| field | A field reference pointing to the aggregated histogram. | yes |
| percentiles | Non-empty array of decimal percentiles between 0.0 and 100.0 | yes |

#### Count Post-Aggregator

This returns the total count of observations (data points) that were recorded in the histogram.
This is useful for understanding the population size without needing a separate count metric.

```json
{
"type": "countSpectatorHistogram",
"name": "<output name>",
"field": {
"type": "fieldAccess",
"fieldName": "<name of aggregated SpectatorHistogram>"
}
}
```

| Property | Description | Required? |
|----------|------------------------------------------------------------|-----------|
| type | This String should always be "countSpectatorHistogram" | yes |
| name | A String for the output (result) name of the calculation. | yes |
| field | A field reference pointing to the aggregated histogram. | yes |

## SQL Functions

In addition to the native query aggregators and post-aggregators, this extension provides SQL functions for easier use in Druid SQL queries.

### SPECTATOR_COUNT

Returns the total count of observations (data points) in a Spectator histogram.

**Syntax:**
```sql
SPECTATOR_COUNT(expr)
```

**Arguments:**
- `expr`: A numeric column to aggregate into a histogram, or a pre-aggregated Spectator histogram column.

**Returns:** BIGINT - the total number of observations.

**Example:**
```sql
SELECT
SPECTATOR_COUNT(hist_added) AS total_count,
SPECTATOR_COUNT(added) AS total_count_from_raw
FROM wikipedia
```

### SPECTATOR_PERCENTILE

Computes approximate percentile values from a Spectator histogram. This function supports two forms: a single percentile or multiple percentiles.

#### Single Percentile

**Syntax:**
```sql
SPECTATOR_PERCENTILE(expr, percentile)
```

**Arguments:**
- `expr`: A numeric column to aggregate into a histogram, or a pre-aggregated Spectator histogram column.
- `percentile`: A decimal value between 0 and 100 representing the desired percentile.

**Returns:** DOUBLE - the approximate value at the specified percentile.

**Example:**
```sql
SELECT
SPECTATOR_PERCENTILE(hist_added, 50) AS median_added,
SPECTATOR_PERCENTILE(hist_added, 99) AS p99_added,
SPECTATOR_PERCENTILE(added, 95) AS p95_from_raw
FROM wikipedia
```

#### Multiple Percentiles (Array)

**Syntax:**
```sql
SPECTATOR_PERCENTILE(expr, ARRAY[p1, p2, ...])
```

**Arguments:**
- `expr`: A numeric column to aggregate into a histogram, or a pre-aggregated Spectator histogram column.
- `ARRAY[p1, p2, ...]`: An array of decimal values between 0 and 100 representing the desired percentiles.

**Returns:** DOUBLE ARRAY - an array of approximate values at the specified percentiles, in the same order as requested.

**Example:**
```sql
SELECT
SPECTATOR_PERCENTILE(hist_added, ARRAY[25, 50, 75, 99]) AS percentiles
FROM wikipedia
```

This returns an array like `[200.5, 341.0, 468.5, 675.9]` representing the 25th, 50th, 75th, and 99th percentiles.

Using the array form is more efficient than calling `SPECTATOR_PERCENTILE` multiple times for different percentiles, as the underlying histogram is only aggregated once.

### Combined Example

You can use both functions together in a single query. Multiple aggregations on the same column share the underlying histogram aggregator for efficiency:

```sql
SELECT
countryName,
SPECTATOR_COUNT(hist_added) AS observation_count,
SPECTATOR_PERCENTILE(hist_added, 50) AS median_added,
SPECTATOR_PERCENTILE(hist_added, 90) AS p90_added,
SPECTATOR_PERCENTILE(hist_added, 99) AS p99_added
FROM wikipedia
GROUP BY countryName
ORDER BY observation_count DESC
LIMIT 10
```

Or using the array form to get multiple percentiles in a single column:

```sql
SELECT
countryName,
SPECTATOR_COUNT(hist_added) AS observation_count,
SPECTATOR_PERCENTILE(hist_added, ARRAY[50, 90, 99]) AS percentiles
FROM wikipedia
GROUP BY countryName
ORDER BY observation_count DESC
LIMIT 10
```

## Examples

### Example Ingestion Spec
Expand Down
12 changes: 12 additions & 0 deletions docs/querying/sql-aggregations.md
Original file line number Diff line number Diff line change
Expand Up @@ -157,3 +157,15 @@ Load the T-Digest extension to use the following functions. See the [T-Digest ex
|--------|-----|-------|
|`TDIGEST_QUANTILE(expr, quantileFraction, [compression])`|Builds a T-Digest sketch on values produced by `expr` and returns the value for the quantile. Compression parameter (default value 100) determines the accuracy and size of the sketch. Higher compression means higher accuracy but more space to store sketches.|`Double.NaN`|
|`TDIGEST_GENERATE_SKETCH(expr, [compression])`|Builds a T-Digest sketch on values produced by `expr`. Compression parameter (default value 100) determines the accuracy and size of the sketch Higher compression means higher accuracy but more space to store sketches.|Empty base64 encoded T-Digest sketch STRING|

## Histogram functions

### Spectator Histogram

Load the [Spectator Histogram extension](../development/extensions-contrib/spectator-histogram.md) to use the following functions.

|Function|Notes|Default|
|--------|-----|-------|
|`SPECTATOR_COUNT(expr)`|Counts the total number of observations (data points) in a Spectator histogram. The `expr` can be either a numeric column (which will be aggregated into a histogram) or a pre-aggregated [Spectator histogram](../development/extensions-contrib/spectator-histogram.md) column.|`0`|
|`SPECTATOR_PERCENTILE(expr, percentile)`|Computes an approximate percentile value from a Spectator histogram. The `expr` can be either a numeric column (which will be aggregated into a histogram) or a pre-aggregated [Spectator histogram](../development/extensions-contrib/spectator-histogram.md) column. The `percentile` should be between 0 and 100.|`NaN`|
|`SPECTATOR_PERCENTILE(expr, ARRAY[p1, p2, ...])`|Computes multiple approximate percentile values from a Spectator histogram and returns them as a DOUBLE ARRAY. The `expr` can be either a numeric column (which will be aggregated into a histogram) or a pre-aggregated [Spectator histogram](../development/extensions-contrib/spectator-histogram.md) column. Each percentile value in the array should be between 0 and 100. This is more efficient than calling `SPECTATOR_PERCENTILE` multiple times for different percentiles.|`null`|
30 changes: 30 additions & 0 deletions extensions-contrib/spectator-histogram/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,36 @@
<artifactId>junit</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.junit.jupiter</groupId>
<artifactId>junit-jupiter-api</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.junit.jupiter</groupId>
<artifactId>junit-jupiter-engine</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.junit.jupiter</groupId>
<artifactId>junit-jupiter-migrationsupport</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.junit.jupiter</groupId>
<artifactId>junit-jupiter-params</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.junit.vintage</groupId>
<artifactId>junit-vintage-engine</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.reflections</groupId>
<artifactId>reflections</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.druid</groupId>
<artifactId>druid-processing</artifactId>
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

package org.apache.druid.spectator.histogram;

import com.fasterxml.jackson.annotation.JsonCreator;
import com.fasterxml.jackson.annotation.JsonProperty;
import com.google.common.base.Preconditions;
import com.google.common.primitives.Longs;
import org.apache.druid.query.aggregation.AggregatorFactory;
import org.apache.druid.query.aggregation.PostAggregator;
import org.apache.druid.query.aggregation.post.PostAggregatorIds;
import org.apache.druid.query.cache.CacheKeyBuilder;
import org.apache.druid.segment.ColumnInspector;
import org.apache.druid.segment.column.ColumnType;

import java.util.Comparator;
import java.util.Map;
import java.util.Objects;
import java.util.Set;

/**
* Post-aggregator that returns the total count of observations in a SpectatorHistogram.
* This is the sum of all bucket counts.
*/
public class SpectatorHistogramCountPostAggregator implements PostAggregator
{
private final String name;
private final PostAggregator field;

public static final String TYPE_NAME = "countSpectatorHistogram";

@JsonCreator
public SpectatorHistogramCountPostAggregator(
@JsonProperty("name") final String name,
@JsonProperty("field") final PostAggregator field
)
{
this.name = Preconditions.checkNotNull(name, "name is null");
this.field = Preconditions.checkNotNull(field, "field is null");
}

@Override
@JsonProperty
public String getName()
{
return name;
}

@Override
public ColumnType getType(ColumnInspector signature)
{
return ColumnType.LONG;
}

@JsonProperty
public PostAggregator getField()
{
return field;
}

@Override
public Object compute(final Map<String, Object> combinedAggregators)
{
final SpectatorHistogram sketch = (SpectatorHistogram) field.compute(combinedAggregators);
if (sketch == null) {
return null;
}
return sketch.getSum();
}

@Override
public Comparator<Long> getComparator()
{
return Longs::compare;
}

@Override
public Set<String> getDependentFields()
{
return field.getDependentFields();
}

@Override
public String toString()
{
return getClass().getSimpleName() + "{" +
"name='" + name + '\'' +
", field=" + field +
"}";
}

@Override
public byte[] getCacheKey()
{
return new CacheKeyBuilder(
PostAggregatorIds.SPECTATOR_HISTOGRAM_SKETCH_COUNT_CACHE_TYPE_ID)
.appendCacheable(field)
.build();
}

@Override
public boolean equals(Object o)
{
if (this == o) {
return true;
}
if (o == null || getClass() != o.getClass()) {
return false;
}
SpectatorHistogramCountPostAggregator that = (SpectatorHistogramCountPostAggregator) o;
return Objects.equals(name, that.name) &&
Objects.equals(field, that.field);
}

@Override
public int hashCode()
{
return Objects.hash(name, field);
}

@Override
public PostAggregator decorate(final Map<String, AggregatorFactory> map)
{
return this;
}
}
Loading
Loading