Skip to content
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions docs/ingestion/ingestion-spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -396,6 +396,46 @@ The `filter` conditionally filters input rows during ingestion. Only rows that p
ingested. Any of Druid's standard [query filters](../querying/filters.md) can be used. Note that within a
`transformSpec`, the `transforms` are applied before the `filter`, so the filter can refer to a transform.

### Projections

Projections are pre-aggregated segments that can speed up queries by reducing the number of rows that need to be processed. Use the `projectionsSpec` block to define projections for your data during ingestion or [create them afterwards](../querying/projections.md#after-ingestion).

Note that any projections you define becomes a dimension for your datasource. To remove a projection from your datasource, you need to reingest the data with the projection removed. Alternatively, you can use a query context parameter to not use projections for a specific query.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wording seems off on this, let me think on it and get back to you

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that any projections you define becomes a dimension for your datasource.

Are we overloading what dimension means here?

To remove a projection from your datasource, you need to reingest the data with the projection removed.

This is confusing and also out of scope of this PR I think: you can add projections as part of compaction, but you cannot remove projections as part of a compaction? Which is not ideal. But we should call it out here.


```json
"projectionsSpec": {
"projections": [
{
"name": "daily_channel_summary",
"dimensions": [
"channel"
],
"granularity": "DAY",
"metrics": [
{
"type": "longSum",
"name": "total_added",
"fieldName": "added"
},
{ "type": "longSum",
"name": "total_deleted",
"fieldName": "deleted"
},
{ "type": "longSum",
"name": "total_delta",
"fieldName": "delta"
},
{
"type": "cardinality",
"name": "distinct_users",
"fieldName": "user"
}
]
}
]
}
```

### Legacy `dataSchema` spec

:::info
Expand Down
181 changes: 181 additions & 0 deletions docs/querying/projections.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
---
id: projections
title: Query projections
sidebar_label: Projections
description: .
---

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->

Projections are a type of aggregation that is computed and stored as part of a segment. The pre-aggregated data can speed up queries by reducing the number of rows that need to be processed for any query shape that matches a projection.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a noob, I wonder how are they different from rollup? Why would you want to use this instead of rollup?

Suggested change
Projections are a type of aggregation that is computed and stored as part of a segment. The pre-aggregated data can speed up queries by reducing the number of rows that need to be processed for any query shape that matches a projection.
Projections are a type of aggregation that Druid computes and stores as part of a segment. The pre-aggregated data reduces the number of rows for the query engine to process. This can speed up queries for query shapes that match a projection.

What type of "part" of the segment? Like a column? Or its own thing?
What does it mean to "match the projection?" An example might be in order.


## Create a projection

A projection has three components:

- Virtual columns (`spec.projections.virtualColumns`) that are used to compute a projection. The source data for the virtual columns must exist in your datasource.
- Grouping columns (`spec.projections.groupingColumns`) that are used to group a projection. They must either already exist in your datasource or be defined in `virtualColumns`. The order in which you define your grouping columns equates to the order in which data is sorted in the projection, always ascending.
- Aggregators (`spec.projections.aggregators`) that define the columns you want to create projections for and which aggregator to use for that column. They must either already exist in your datasource or be defined in `virtualColumns`.

The aggregators are what Druid attempts to match when you run a query. If an aggregator in a query matches an aggregator you defined in your projection, Druid uses it.

You can either create a projection at ingestion time or after the datasource is created.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be at line 34. It follows directly from the heading. This also doesn't make sense b/c ingestion time and after the datasource is created are not exclusive.
Would it make sense to say a new datasource vs/ existing datasource?

You can create a projection:

  • in the ingestion spec/query in an new datasource
  • in the catalalog for an existing data source
  • in the compaction spec in an existing datas source.

Note that the catalog is preferred over compaction spec for existing data sources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new/existing datasource doesn't make sense cause you can technically run the ingestion for a datasource again with the projection defined i think.

What about As part of your ingestion and Manually add a projection


Note that any projection dimension you create becomes part of your datasource. To remove a projection from your datasource, you need to reingest the data. Alternatively, you can use a query context parameter to not use projections for a specific query.



### At ingestion time

To create a projection at ingestion time, use the [`projectionsSpec` block in your ingestion spec](../ingestion/ingestion-spec.md#projections).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this just native json ingestion? I don't think of MSQ ingestion having an "ingestion spec"

I feel like both of these need examples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, there's an example in the linked page. i can copy it over. an msq example is on my to do list


### After ingestion

:::info

To create a projection for an existing datasource, you must have the `druid-catalog` extension loaded.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, this isn't strictly true - while i think in the future we want to recommend the catalog as the way to do things, it is also possible to define the projection specs in an 'inline' compaction spec as well in a projections property (where inline is what we call the class, the non-catalog based compaction spec).

It is also worth mentioning that the catalog compaction spec is not as fully featured as the inline compaction spec in terms of functionality, for example it can not change the schema of the base table like can be done with an inline spec, and some other things too, i forget off the top of my head.

The catalog is required to build projections for MSQ inserts/replaces though, so that should probably be


:::

`POST /druid/coordinator/v1/catalog/schemas/druid/tables/{datasource}` where the payload includes a `properties.projections` block like the following:

<details>
<summary>View the payload</summary>

```json {11,19,39} showLineNumbers
{
"type": "datasource",
"columns": [],
"properties": {
"segmentGranularity": "PT1H",
"projections": [
{
"spec": {
"name": "channel_page_hourly_distinct_user_added_deleted",
"type": "aggregate",
"virtualColumns": [
{
"type": "expression",
"name": "__gran",
"expression": "timestamp_floor(__time, 'PT1H')",
"outputType": "LONG"
}
],
"groupingColumns": [
{
"type": "long",
"name": "__gran",
"multiValueHandling": "SORTED_ARRAY",
"createBitmapIndex": false
},
{
"type": "string",
"name": "channel",
"multiValueHandling": "SORTED_ARRAY",
"createBitmapIndex": true
},
{
"type": "string",
"name": "page",
"multiValueHandling": "SORTED_ARRAY",
"createBitmapIndex": true
}
],
"aggregators": [
{
"type": "HLLSketchBuild",
"name": "distinct_users",
"fieldName": "user",
"lgK": 12,
"tgtHllType": "HLL_4"
},
{
"type": "longSum",
"name": "sum_added",
"fieldName": "added"
},
{
"type": "longSum",
"name": "sum_deleted",
"fieldName": "deleted"
}
]
}
}
]
}
}
```

</details>

In this example, Druid aggregates data into `distinct_user`, `sum_added`, and `sum_deleted` dimensions based on the aggregator that's specified and a source dimension. These aggregations are grouped by the columns you define in `groupingColumns`.

## Use a projection

Druid automatically uses a projection if your query matches a projection you've defined. There are some query context parameters that give you some control on how projections are used and Druid's behavior:

- `useProjection`: The name of a projection you defined. The query engine must use that projection and will fail the query if the projection does not match the query.
- `forceProjections` `true` or `false`. The query engine must use a projection and will fail the query if there isn't a matching projection.
- `noProjections`: `true` or `false`. The query engine won't use any projections.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `noProjections`: `true` or `false`. The query engine won't use any projections.
- `noProjections`: Set to `true` to prevent the query engine from using projections altogether. Defaults to `false`.

seriously this should be useProjections which defaults to true and you could set to false to disable. @clintropolis :/


## Compaction

To use compaction on a datasource that includes projections, you need to set the type to catalog: `spec.type: catalog`:

<Tabs>
<TabItem value="Coordinator duties">

```json
{
"type": "catalog",
"dataSource": YOUR_DATASOURCE,
"engine": "native",
"skipOffsetFromLatest": "PT0H",
"taskPriority": 25,
"inputSegmentSizeBytes": 100000000000000,
"taskContext": null
}
```

</TabItem>
<TabItem value="Supervisors">

```json
{
"type": "autocompact",
"spec": {
"type": "catalog",
"dataSource": YOUR_DATASOURCE,
"engine": "native",
"skipOffsetFromLatest": "PT0H",
"taskPriority": 25,
"inputSegmentSizeBytes": 100000000000000,
"taskContext": null
},
"suspended": true
}
```

</TabItem>
</Tabs>
1 change: 1 addition & 0 deletions website/sidebars.json
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,7 @@
},
"querying/querying",
"querying/query-processing",
"querying/projections",
"querying/query-execution",
"querying/troubleshooting",
{
Expand Down