docs: add docs for projections#18056
Conversation
docs/querying/projections.md
Outdated
|
|
||
| :::info | ||
|
|
||
| To create a projection for an existing datasource, you must have the `druid-catalog` extension loaded. |
There was a problem hiding this comment.
fwiw, this isn't strictly true - while i think in the future we want to recommend the catalog as the way to do things, it is also possible to define the projection specs in an 'inline' compaction spec as well in a projections property (where inline is what we call the class, the non-catalog based compaction spec).
It is also worth mentioning that the catalog compaction spec is not as fully featured as the inline compaction spec in terms of functionality, for example it can not change the schema of the base table like can be done with an inline spec, and some other things too, i forget off the top of my head.
The catalog is required to build projections for MSQ inserts/replaces though, so that should probably be
docs/ingestion/ingestion-spec.md
Outdated
|
|
||
| Projections are pre-aggregated segments that can speed up queries by reducing the number of rows that need to be processed. Use the `projectionsSpec` block to define projections for your data during ingestion or [create them afterwards](../querying/projections.md#after-ingestion). | ||
|
|
||
| Note that any projections you define becomes a dimension for your datasource. To remove a projection from your datasource, you need to reingest the data with the projection removed. Alternatively, you can use a query context parameter to not use projections for a specific query. |
There was a problem hiding this comment.
wording seems off on this, let me think on it and get back to you
|
to do:
|
techdocsmith
left a comment
There was a problem hiding this comment.
Fair amount of commentary on this one. Shouldn't be too hard to incorporate. Mostly stylistic. I haven't tried the functionality.
docs/querying/projections.md
Outdated
| ~ under the License. | ||
| --> | ||
|
|
||
| Projections are a type of aggregation that is computed and stored as part of a segment. The pre-aggregated data can speed up queries by reducing the number of rows that need to be processed for any query shape that matches a projection. |
There was a problem hiding this comment.
As a noob, I wonder how are they different from rollup? Why would you want to use this instead of rollup?
| Projections are a type of aggregation that is computed and stored as part of a segment. The pre-aggregated data can speed up queries by reducing the number of rows that need to be processed for any query shape that matches a projection. | |
| Projections are a type of aggregation that Druid computes and stores as part of a segment. The pre-aggregated data reduces the number of rows for the query engine to process. This can speed up queries for query shapes that match a projection. |
What type of "part" of the segment? Like a column? Or its own thing?
What does it mean to "match the projection?" An example might be in order.
docs/querying/projections.md
Outdated
|
|
||
| The aggregators are what Druid attempts to match when you run a query. If an aggregator in a query matches an aggregator you defined in your projection, Druid uses it. | ||
|
|
||
| You can either create a projection at ingestion time or after the datasource is created. |
There was a problem hiding this comment.
This should be at line 34. It follows directly from the heading. This also doesn't make sense b/c ingestion time and after the datasource is created are not exclusive.
Would it make sense to say a new datasource vs/ existing datasource?
You can create a projection:
- in the ingestion spec/query in an new datasource
- in the catalalog for an existing data source
- in the compaction spec in an existing datas source.
Note that the catalog is preferred over compaction spec for existing data sources.
There was a problem hiding this comment.
new/existing datasource doesn't make sense cause you can technically run the ingestion for a datasource again with the projection defined i think.
What about As part of your ingestion and Manually add a projection
docs/querying/projections.md
Outdated
|
|
||
| - `useProjection`: The name of a projection you defined. The query engine must use that projection and will fail the query if the projection does not match the query. | ||
| - `forceProjections` `true` or `false`. The query engine must use a projection and will fail the query if there isn't a matching projection. | ||
| - `noProjections`: `true` or `false`. The query engine won't use any projections. |
There was a problem hiding this comment.
| - `noProjections`: `true` or `false`. The query engine won't use any projections. | |
| - `noProjections`: Set to `true` to prevent the query engine from using projections altogether. Defaults to `false`. |
seriously this should be useProjections which defaults to true and you could set to false to disable. @clintropolis :/
|
This pull request has been marked as stale due to 60 days of inactivity. |
|
This pull request has been marked as stale due to 60 days of inactivity. |
|
@317brian @clintropolis will you continue working on this PR to get it merged? |
|
@techdocsmith could you please take another look? |
kgyrtkirk
left a comment
There was a problem hiding this comment.
seems like this PR is falling off the table.
unfortunately the review cycle have stopped - but I think it adds valuable docs so its better to get it in
I wait 24 hours before merging
SatyaKuppam
left a comment
There was a problem hiding this comment.
lgtm, couple of nitpicks. We should include this in 36.0.0 so folks can start using it.
docs/ingestion/ingestion-spec.md
Outdated
|
|
||
| Define projections for a new data source in the `projectionsSpec` block during ingestion. To add projections to an existing data source, see [create them afterwards](../querying/projections.md#manually-add-a-projection). | ||
|
|
||
| Note that any projections you define becomes a dimension for your datasource. To remove a projection from your datasource, you need to reingest the data with the projection removed. Alternatively, you can use a query context parameter to not use projections for a specific query. |
There was a problem hiding this comment.
Note that any projections you define becomes a dimension for your datasource.
Are we overloading what dimension means here?
To remove a projection from your datasource, you need to reingest the data with the projection removed.
This is confusing and also out of scope of this PR I think: you can add projections as part of compaction, but you cannot remove projections as part of a compaction? Which is not ideal. But we should call it out here.
docs/ingestion/ingestion-spec.md
Outdated
|
|
||
| ### Projections | ||
|
|
||
| Projections are a type of aggregation that Druid computes and stores as part of a segment. The pre-aggregated data reduces the number of rows the query engine needs to process when you run a query. This can speed up queries for query shapes that match a projection. |
There was a problem hiding this comment.
I think we should change this to:
Projections are a type of aggregation that Druid computes and stores as part of a segment.
Projections are ingestion/compaction time aggregations that Druid computes on a subset of dimensions and metrics of a segment. They are stored within a segment.
(cherry picked from commit ab450f2)
* Implement a fingerprinting mechanism to track compaction states in a more efficient manner (apache#18844) * meatadata store bits part 1 * annotate segments with compaction fingerprint before persist * Add ability to generate compaction state fingerprint * add fingerprint to task context and make legacy last compaction state storage configurable * update embedded tests for compaction supervisors to flex fingerprints * checkpoint with persisting compaction states * add duty to clean up unused compaction states * take fingerprints into account in CompactionStatus * Add and improve tests * get rid of some todo comments * fix checkstyle * cleanup some more TODO * Add some docs * update web console * make cache size configurable and fix some spelling * fixup use of deprecated builder * fix checktyle * fix coordinator compactsegments duty and respond to self review comments * fix spellchecker * predates is a word * improve some javadocs * simplify some test assertions based on review * better naming * controller impl cleanup * For compaction supervisors, take persisting pending compaction states out of hot path * use Configs.valueOrDefault helper in data segment * Refactor where fingerprinting happens and how the object mapper is wired up * refactor CompactionStateManager into an interface with a persisted and heap impl * remove fingerprinting support from the coordinator compact segments duty * Move on heap compaction state manager to test sources * CompactionStateManager is now overlord only * Refactor how the compaction state fingerprint cache is wired up * prettify * small changes after self-review * Cleanup CompactionStateCache per review * compactionstatemanager to compactionstatestorage plus refactor * Add compaction state added and deleted metrics * improve queries for compaction state cache sync * clean up doc wording * Miscl. cleanup from review * some metadata store code cleanup * refactor id out of the compaction states table as it is superflous * Some CompactionStatus cleanup * Migrate the location of creating a compaction state from config * More refactoring per review * refactor to remove duplicate fingerprint generator code * Do some consolidation of fingerprint related classes to clean up code * minor cleanup * fix fobidden api use * Improvements and cleanup to the fingerprint and state persist + cache * Refactor where in the code compaction fingerprints are generated * Formalize unique constraint exception check in sqlmetadataconnector and db specific impls * some naming cleanup * Migrate the compaction state cleanup duty to the overlord * Blow up the compaction supervisor scheduler if incremental caching is disabled * add some strict input sanitization in upserting compaction fingerprints * cleanup test class * Add pending flag to compaction state to prevent potentially destructive early cleanup * Refactor database naming to use indexingState instead of compactionState * Refactor naming to IndexingState for the metadata cleanup duty * refresh some docs * fixup tests * Refactoring name of CompactionStateCache to IndexingStateCache * Rename CompactionStateStorage to IndexingStateStorage * Refactor compactionStateFingerprint out of the code in favor of indexingStateFingerprint * Refactor FingerprintMapper name to remove compaction for indexing state * refactorings after self review * fixup a few things post merge with master * Cleanup and refactor after code review round Batch marking of indexing states as active to avoid chained updates where only one is needed Build segments table missing columns error column by column refactor how we are configuring and executing the ol metadata cleanup duties. fix missed naming refactor Improve readability of upsertIndexingState Fixup SqlIndexingStateStorage constructor drop default impl of isUniqueConstraintViolation Refactor how the deterministic mapper is handled for reindexing * cleanup * use effective state for dimspec and indexspec for reindexing fingerprinting * Only call into running checks if there are unknown states to check * Update milestone on PR close and ensure they are visible for the originally desired milestone (apache#18935) * SegmentLocalCacheManagerConcurrencyTest: Use tempDir for temp files. (apache#18937) The tests should use temporary directories rather than the current working directory. * Update to testcontainers 2.x and update various images. (apache#18945) This patch updates to testcontainers 2.x, which improves compatibility with newer versions of Docker. It also updates most images to the latest versions available. PostgreSQL and MariaDB remain on 16 and 11, however. * Max metrics for group by queries (apache#18934) Added metrics mergeBuffer/maxAcquisitionTimeNs, groupBy/maxSpilledBytes and groupBy/maxMergeDictionarySize to track peak resource usage per query. * fix json column isNumeric check to properly consider array element selector types (apache#18948) * Add sys.queries table. (apache#18923) The sys.queries table provides insight into currently-running queries. It provides the same information as the /druid/v2/sql/queries API. As such, it currently only works with Dart. In this patch the table is documented, but off by default. It can be enabled by setting druid.sql.planner.enableSysQueriesTable = true. This patch additionally adds an "includeComplete" parameter to /druid/v2/sql/queries, which is used by the implementation of the sys.queries table, to allow it to show information for recently-completed queries. * Bump kubernetes-client to latest and level vertx with what kubernetes-client uses (apache#18947) * Adjust cost-based autoscaler algorithm (apache#18936) * use includeComplete (apache#18940) * Add configurable option to scale-down during task run time for cost-based autoscaler (apache#18958) * Add configurable option to scale-down during task run time for cost-based autoscaler * Docs * Address review comments, compress tests a bit * remove custom json serde for DataNodeService (apache#18961) * Faster bucket search in ByteBufferHashTable (apache#18952) Adds hash code comparison for large enough keys to ByteBufferHashTable#findBucket(). Also, changes key comparison to use long/int/byte instead of byte-only comparison (thus, the comparison is now closer to HashTableUtils#memoryEquals() used in MemoryOpenHashTable). These changes are aimed to speed-up bucket search in ByteBufferHashTable, especially in high-collision cases. * Allow failing on residual for Iceberg filters on non-partition cols (apache#18953) Currently Iceberg ingest extension may ingest more data than is necessary due to residual data occurring from an Iceberg filter on non-partition columns. This adds an option to ignore + log a warning or fail on filters that result in residual so users are aware of this extra data and can action on it. * Rely on `taskCountMin` in `computeValidTaskCounts`; correct the embedded test for cost-based-autoscaler (apache#18963) This patch fixes a behaviour where computeValidTaskCounts took care of upper bound (taskCountMax), but did not care about taskCountMin. Also it fixes a flaky embedded test. * Web console: Server props dialog (apache#18960) * Init server props table * Add trim starts * reformat * Update `TableInputSpec` to be able to handle specific segments. (apache#18922) * input * format and deprecate * allow non-complete segments * test * SQL: Add rule for merging nested Aggregates. (apache#18498) The rule is adapted from Calcite's AggregateMergeRule, with two changes: 1) Includes a workaround for https://issues.apache.org/jira/browse/CALCITE-7162 2) Includes the ability to merge two Aggregate with a Project between them, by pushing the Project below the new merged Aggregate. * Speed up TopNQueryRunnerTest. (apache#18955) Takes the runtime from ~3 minutes to 10 seconds by reducing the number of test runs by 32x. There are two changes: 1) Instead of parameterizing for every possible combination of monomorphic specialization flags, only parameterize for all-on and all-off. The specializations handle different cases anyway, so they wouldn't trigger on the same queries. Reduces number of test runs by 16x. 2) Remove the parameterization on duplicateSingleAggregatorQueries. Only a handful of tests used it. Instead of parameterizing the entire suite, that handful of tests is expanded to include _duplicateAggregators versions. Reduces number of test runs by 2x. * Fix Hadoop multi-value string null value handling to match native batch (apache#18944) Doing some more digging, I found another unfortunate data difference between native batch (on-cluster) and Hadoop batch ingest. Ingesting a multi-value string ["a","b",null] with Hadoop is treated as ["a","b","null"] and in native batch, this correctly ingests to ["a","b",null]. This is difference appears to be a bug in all Druid versions(even latest). While this will not affect the current null handling migration, this will affect the future Hadoop -> native batch ingestion migration that will also need to take place. Hadoop doesn't allow for all-null columns in segments, it simply excludes them from the segment. I've updated the Hadoop job to support running druid.indexer.task.storeEmptyColumns=true, which allows us to store all NULL columns (how native/streaming ingest work today). BREAKING CHANGES 1. Hadoop ingests will now process multi-value string inputs like ["a","b",null] -> ["a","b",null] instead of ["a","b","null"] to match native batch ingestion. 2. Hadoop ingests will now by default keep columns with all NULL values, instead of excluding them from the segment. useStringValueOfNullInLists parameter in RowBasedColumnSelectorFactory.java has been removed. * modify ExprEvalBindingVector to use current vector size instead of array length when coercing values, cache coercion arrays (apache#18967) * modify ExprEvalBindingVector to use current vector size instead of array length when coercing values, cache coercsion arrays expression vector binding improvements changes: * split ExpressionEvalBindingVector into ExpressionEvalNumericBindingVector and ExpressionEvalObjectBindingVector * modify ExpressionEvalNumericBindingVector and ExpressionEvalObjectBindingVector to use current vector size instead of input array size when coercing values * modify ExpressionEvalNumericBindingVector and ExpressionEvalObjectBindingVector to use externally managed object array caches for value coercion instead of recreating each time * benchmarks * SQL: Use specialized virtual columns for expression filters. (apache#18965) This patch adjusts planning for expression filters to use specialized virtual columns when they exist. This allows them to take advantage of optimizations, such as the ones that are available for JSON_VALUE, even when the overall expression is complicated. * add tier/storage/capacity metric to make actual tier disk size metrics available for historicals in vsf mode (apache#18962) * Adjust costs for burst scaleup during heavy lag for cost-based autoscaler (apache#18969) * udpate copyright year to 2026 (apache#18972) * Bump diff from 4.0.1 to 4.0.4 in /web-console (apache#18933) * docs: add docs for projections (apache#18056) * Better query error classification for user errors (apache#18949) This change checks instanceof before casting RexLiteral.value() to Number in SQL aggregators. When users pass invalid queries (e.g., a string literal '99.99' where numeric literals are expected), InvalidSqlInput exception is thrown, which returns 400 (USER/INVALID_INPUT) instead of 500 (ADMIN/UNCATEGORIZED). This improves error diagnostics for invalid queries. * changes related to 36 release (apache#18975) * add vsf AcquireSegmentResult metrics to ChannelCounters (apache#18971) * Migrate query integration tests to embedded framework (apache#18978) Changes --------- - Move `ITBroadcastJoinQueryTest` to embedded framework - Remove `ITWikipediaQueryTest` - Add `QueryLaningTest` which was the only useful assertion being done in the wikipedia test * Upgrade compiler version to JDK 17 (apache#18977) Upgrade compiler version to JDK 17. This removes compiler compatibility for indexing-hadoop (no longer supported extension). * add storage_size to sys.servers (apache#18979) * bugfix: Fix bug that could lead to illegal k8s label ending in non-alphanumeric (apache#18981) * Remove experimental flag from multi-supervisor docs (apache#18983) Multi-supervisor support has been in 2 major versions (with v36 being the 3rd). I think the implementation is stable enough for marking as non-experimental. * Add groupby max metrics to prometheus config (apache#18970) * Add metrics and improve logging for row signature flapping. (apache#18966) Add segment/schemaCache/rowSignature/changed and segment/schemaCache/rowSignature/column/count metrics to get visibility into when the Broker's segment metadata cache's row signature for each datasource is initialized and updated. The rationale for these metrics and logging enhancements is that we noticed row signatures flapping (columns reordered spuriously) that can cause SQL queries to be translated to incorrect native queries because the signatures flapped. This can cause sporadic missing data when the queries are incorrectly planned and is noticeable in environments with high QPS. * bugfix: Create tombstones when needed while doing REPLACE mode with range partitioning plus parallel indexing (apache#18938) * Create tombstones for range and hashed partitioning when everything has been filtered out * MSQ compaction doesn't support hash partitioning * cleanup test file * Cleanup verbose comments in test code * Hashed partitioning doesn't actually need the special handling * fix checkstyle * test coverage * fix vsf load time to be actual load time and not include wait time (apache#18988) * Update guice to 6.0.0 (apache#18986) * Update surefire to 3.5.4 ; upgrade NestedDataScanQueryTest to use junit5 (apache#18847) * Add optional plugins to basic cost function in CostBasedAutoScaler (apache#18976) Changes: - separate the logic of pure cost function, making all additional logic opt-in in config; - `scaleDownBarrier` has been changed to `minScaleDownDelay`, which is now `Duration`; - changes to high lag fast scaleup: logarithmic scaling formula for idle decay on high lag and task boundaries. Details: This change replaces the sqrt-based scaling formula with a logarithmic formula that provides more aggressive emergency recovery at low task counts and millions of lag. Idle decay: ` ln(lagSeverity) / ln(maxSeverity)`. Less aggressive, scales well with lag growth. Formula `K = P/(6.4*sqrt(C))` means small task counts get massive K values (emergency recovery), while large task counts get smaller K values (stability). * docs: update zookeeper version (apache#18836) * docs: update zookeeper version * add link to zk release page * Fix MSQ compaction state and native interval locking, add test coverage (apache#18950) * MSQ compaction runner run test * fix test * fix test 2 * lock input interval * test * test coverage * allowNonAlignedInterval and forceDropExisting * fix test * Update indexing-service/src/main/java/org/apache/druid/indexing/common/task/CompactionTask.java Co-authored-by: Lucas Capistrant <capistrant@users.noreply.github.com> * Update indexing-service/src/main/java/org/apache/druid/indexing/common/task/CompactionTask.java Co-authored-by: Lucas Capistrant <capistrant@users.noreply.github.com> * update * style * drop-existing * Apply suggestion from @kfaraz Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> * format * aligned * build * mis-aligned * format * test * lock-interval * lock * test * force drop existing, revert non-aligned, deprecated allowNonAlignedInterval * revert THREE_HOUR * revert format change * test * comment * use-queue * reduce test * batchSegmentAllocation --------- Co-authored-by: Lucas Capistrant <capistrant@users.noreply.github.com> Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> * Update assertj-core for CVE-2026-24400 (apache#18994) Co-authored-by: Ashwin Tumma <ashwin.tumma@salesforce.com> --------- Co-authored-by: Lucas Capistrant <capistrant@users.noreply.github.com> Co-authored-by: Gian Merlino <gianmerlino@gmail.com> Co-authored-by: Virushade <phuaguanwei99@gmail.com> Co-authored-by: Clint Wylie <cwylie@apache.org> Co-authored-by: Sasha Syrotenko <alexander.syrotenko@imply.io> Co-authored-by: Vadim Ogievetsky <vadim@ogievetsky.com> Co-authored-by: Andrei Pechkurov <37772591+puzpuzpuz@users.noreply.github.com> Co-authored-by: jtuglu1 <jtuglu@netflix.com> Co-authored-by: Cece Mei <yingqian.mei@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> Co-authored-by: mshahid6 <maryam.shahid1299@gmail.com> Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> Co-authored-by: aho135 <andrewho135@gmail.com> Co-authored-by: Abhishek Radhakrishnan <abhishek.rb19@gmail.com> Co-authored-by: Ashwin Tumma <ashwin.tumma23@gmail.com> Co-authored-by: Ashwin Tumma <ashwin.tumma@salesforce.com>
Build preview: https://druid-bh3aur83f-317brians-projects.vercel.app/docs/latest/querying/projections
This PR has: