Skip to content

Commit 0959bf8

Browse files
docs: DH-20056: Update docs for new predicate pushdown operations. (#7720)
Co-authored-by: margaretkennedy <82049573+margaretkennedy@users.noreply.github.com>
1 parent 4687ba4 commit 0959bf8

File tree

6 files changed

+54
-12
lines changed

6 files changed

+54
-12
lines changed

docs/groovy/how-to-guides/predicate-pushdown.md

Lines changed: 23 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,9 @@ Filters are prioritized for pushdown in the following order (from highest to low
1414

1515
- Filtering [single value column sources](#single-value-column-sources).
1616
- Range and match filtering of Parquet data columns using [row group metadata](#parquet-row-group-metadata).
17-
- Filtering columns with an existing Deephaven [data index](#deephaven-data-indexes).
17+
- Filtering columns with a cached (already loaded in memory) Deephaven [data index](#deephaven-data-indexes).
18+
- Filtering columns with dictionary encoding.
19+
- Filtering columns with an un-cached Deephaven [data index](#deephaven-data-indexes).
1820

1921
> [!IMPORTANT]
2022
> Where multiple filters have the same pushdown priority, the user-supplied order will generally be maintained. Stateful filters and filter barriers are always respected during pushdown operations.
@@ -43,11 +45,26 @@ result = source.where("Test1 > 90")
4345

4446
Parquet metadata is optional, and not all Parquet files will have it. If the metadata is not available, Deephaven will fall back to scanning the data in the row groups to apply the filter. If Parquet metadata is malformed or incorrect, it may lead to incorrect filtering results. In such cases, you can [disable](#disabling-predicate-pushdown-features) this optimization.
4547

48+
## Parquet dictionary encoding
49+
50+
When storing `string` data, Parquet may create a dictionary encoding for the column where a dictionary of unique values is stored separately from the column data. The column data contains references (e.g., integer indices) to the dictionary entries instead of the actual string values. This can significantly reduce storage space and improve performance for columns with many repeated values. This additionally allows for efficient filtering on these columns, as the engine can check the unique dictionary values against the filter to determine matches without scanning the entire column. If matches are found, the engine will note which integer indices in the column data correspond to the matching dictionary values and filter the data much more efficiently than loading each string value and applying the filter. Additionally, if no matches are found, the engine can skip the entire column without scanning any of the row data.
51+
52+
Nearly all single-column filters can be optimized using the dictionary encoding.
53+
54+
```groovy order=source,result
55+
import io.deephaven.parquet.table.ParquetTools
56+
57+
source = ParquetTools.readTable("/data/examples/ParquetExamples/grades/grades.parquet")
58+
result = source.where("Class = `Math`")
59+
```
60+
61+
If desired, you can [disable](#disabling-predicate-pushdown-features) the use of dictionary encoding during pushdown operations:
62+
4663
## Deephaven data indexes
4764

48-
Deephaven allows users to create data indexes when writing data to storage as Parquet files. These indexes can speed up filtering operations by applying the filter to the index instead of the larger table.
65+
Deephaven allows users to create data indexes for any table. These indexes can be retained in memory or written to storage and can speed up filtering operations significantly by applying the filter to the index instead of the larger table.
4966

50-
Starting in Deephaven v0.40.0, the predicate pushdown framework enables data indexes to be used with most filter types (not just exact matches). When a materialized (in-memory) data index exists, the engine can leverage it during `where` operations. To avoid unexpected memory usage, filter operations do not automatically materialize deferred (disk-based) data indexes.
67+
Starting in Deephaven v0.40.0, the predicate pushdown framework enables data indexes to be used with most filter types (not just exact matches). When a materialized (in-memory) data index for a table exists, the engine can leverage it during `where` operations. To avoid unexpected memory usage, filter operations do not automatically materialize table-level data indexes. However, if an individual file-level data index is available on disk, the engine will use it to filter data without loading the entire index into memory.
5168

5269
This technique is effective even if only a subset of the data files are indexed. The engine will filter non-indexed files using the standard method.
5370

@@ -79,8 +96,10 @@ If desired, you can [disable](#disabling-predicate-pushdown-features) the use of
7996

8097
Under certain circumstances, you may want to disable specific predicate pushdown features. These settings are global and will affect all pushdown operations across the Deephaven engine. The following properties affect pushdown:
8198

99+
- [`QueryTable.useDataIndexForWhere`](https://docs.deephaven.io/core/javadoc/io/deephaven/engine/table/impl/QueryTable.html#USE_DATA_INDEX_FOR_WHERE) – enables the use of Deephaven table-level data indexes when filtering.
82100
- [`QueryTable.disableWherePushdownParquetRowGroupMetadata`](https://docs.deephaven.io/core/javadoc/io/deephaven/engine/table/impl/QueryTable.html#DISABLE_WHERE_PUSHDOWN_PARQUET_ROW_GROUP_METADATA) – disables consideration of Parquet row group metadata when filtering.
83-
- [`QueryTable.disableWherePushdownDataIndex`](https://docs.deephaven.io/core/javadoc/io/deephaven/engine/table/impl/QueryTable.html#DISABLE_WHERE_PUSHDOWN_DATA_INDEX) – disables the use of Deephaven data indexes when filtering.
101+
- [`QueryTable.disableWherePushdownDataIndex`](https://docs.deephaven.io/core/javadoc/io/deephaven/engine/table/impl/QueryTable.html#DISABLE_WHERE_PUSHDOWN_DATA_INDEX) – disables the use of file-level Deephaven data indexes when filtering.
102+
- [`QueryTable.disableWherePushdownParquetDictionary`](https://docs.deephaven.io/core/javadoc/io/deephaven/engine/table/impl/QueryTable.html#DISABLE_WHERE_PUSHDOWN_PARQUET_DICTIONARY) – disables the use of dictionary encoding when filtering.
84103

85104
For more information, see the [Query table configuration](../conceptual/query-table-configuration.md) documentation.
86105

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
{"file":"how-to-guides/predicate-pushdown.md","objects":{"source":{"type":"Table","data":{"columns":[{"name":"Name","type":"java.lang.String"},{"name":"Class","type":"java.lang.String"},{"name":"Test1","type":"int"},{"name":"Test2","type":"int"}],"rows":[[{"value":"Ashley"},{"value":"Math"},{"value":"92"},{"value":"94"}],[{"value":"Jeff"},{"value":"Math"},{"value":"78"},{"value":"88"}],[{"value":"Rita"},{"value":"Math"},{"value":"87"},{"value":"81"}],[{"value":"Zach"},{"value":"Math"},{"value":"74"},{"value":"70"}],[{"value":"Ashley"},{"value":"Science"},{"value":"87"},{"value":"91"}],[{"value":"Jeff"},{"value":"Science"},{"value":"90"},{"value":"83"}],[{"value":"Rita"},{"value":"Science"},{"value":"99"},{"value":"95"}],[{"value":"Zach"},{"value":"Science"},{"value":"80"},{"value":"78"}],[{"value":"Ashley"},{"value":"History"},{"value":"82"},{"value":"88"}],[{"value":"Jeff"},{"value":"History"},{"value":"87"},{"value":"92"}],[{"value":"Rita"},{"value":"History"},{"value":"84"},{"value":"85"}],[{"value":"Zach"},{"value":"History"},{"value":"76"},{"value":"78"}]]}},"result":{"type":"Table","data":{"columns":[{"name":"Name","type":"java.lang.String"},{"name":"Class","type":"java.lang.String"},{"name":"Test1","type":"int"},{"name":"Test2","type":"int"}],"rows":[[{"value":"Ashley"},{"value":"Math"},{"value":"92"},{"value":"94"}],[{"value":"Jeff"},{"value":"Math"},{"value":"78"},{"value":"88"}],[{"value":"Rita"},{"value":"Math"},{"value":"87"},{"value":"81"}],[{"value":"Zach"},{"value":"Math"},{"value":"74"},{"value":"70"}]]}}}}

docs/python/conceptual/query-table-configuration.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -75,10 +75,13 @@ A Deephaven [DataIndex](../how-to-guides/data-indexes.md) is an index that can i
7575

7676
Pushdown predicates refer to the mechanism whereby filtering conditions are applied as early as possible, ideally at the data source (e.g., Parquet or other columnar formats), before loading data into the system. By annotating source reads with predicates, the engine pulls in only the rows that satisfy the conditions, significantly reducing I/O and improving performance.
7777

78-
| Property Name | Default Value | Description |
79-
| -------------------------------------------------------- | ------------- | ----------------------------------------------------------------------------------------------------- |
80-
| `QueryTable.disableWherePushdownDataIndex` | false | Disables the use of [data index](../how-to-guides/data-indexes.md) within where's pushdown predicates |
81-
| `QueryTable.disableWherePushdownParquetRowGroupMetadata` | false | Disables the usage of Parquet row group metadata during push-down filtering |
78+
| Property Name | Default Value | Description |
79+
| -------------------------------------------------------- | ------------- | --------------------------------------------------------------------------------------------------------- |
80+
| `QueryTable.useDataIndexForWhere` | true | Enables the uses of table-level [data index](../how-to-guides/data-indexes.md) during `where` operations. |
81+
| `QueryTable.disableWherePushdownDataIndex` | false | Disables the use of [data index](../how-to-guides/data-indexes.md) within `where`'s predicate pushdown. |
82+
| `QueryTable.disableWherePushdownParquetRowGroupMetadata` | false | Disables the usage of Parquet row group metadata during push-down filtering. |
83+
| `QueryTable.disableWherePushdownMergedTables` | false | Disable predicate pushdown when filtering merged tables. |
84+
| `QueryTable.disableWherePushdownParquetDictionary` | false | Disables dictionary-encoding predicate pushdown operations. |
8285

8386
## Parallel processing with `where`
8487

docs/python/how-to-guides/predicate-pushdown.md

Lines changed: 22 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -46,11 +46,27 @@ result = source.where(filters="Test1 > 90")
4646

4747
Parquet metadata is optional, and not all Parquet files will have it. If the metadata is not available, Deephaven will fall back to scanning the data in the row groups to apply the filter. If Parquet metadata is malformed or incorrect, it may lead to incorrect filtering results. In such cases, you can [disable](#disabling-predicate-pushdown-features) this optimization.
4848

49+
## Parquet dictionary encoding
50+
51+
When storing `string` data, Parquet may create a dictionary encoding for the column where a dictionary of unique values is stored separately from the column data. The column data contains references (e.g., integer indices) to the dictionary entries instead of the actual string values. This can significantly reduce storage space and improve performance for columns with many repeated values. This additionally allows for efficient filtering on these columns, as the engine can check the unique dictionary values against the filter to determine matches without scanning the entire column. If matches are found, the engine will note which integer indices in the column data correspond to the matching dictionary values and filter the data much more efficiently than loading each string value and applying the filter. Additionally, if no matches are found, the engine can skip the entire column without scanning any of the row data.
52+
53+
Nearly all single-column filters can be optimized using the dictionary encoding.
54+
55+
```python order=source,result
56+
from deephaven import parquet
57+
58+
# pass the path of the local Parquet file to `read`
59+
source = parquet.read(path="/data/examples/ParquetExamples/grades/grades.parquet")
60+
result = source.where(filters="Class = `Math`")
61+
```
62+
63+
If desired, you can [disable](#disabling-predicate-pushdown-features) the use of dictionary encoding during pushdown operations:
64+
4965
## Deephaven data indexes
5066

51-
Deephaven allows users to create data indexes when writing data to storage as Parquet files. These indexes can speed up filtering operations by applying the filter to the index instead of the larger table.
67+
Deephaven allows users to create data indexes for any table. These indexes can be retained in memory or written to storage and can speed up filtering operations significantly by applying the filter to the index instead of the larger table.
5268

53-
Starting in Deephaven v0.40.0, the predicate pushdown framework enables data indexes to be used with most filter types (not just exact matches). When a materialized (in-memory) data index exists, the engine can leverage it during `where` operations. To avoid unexpected memory usage, filter operations do not automatically materialize deferred (disk-based) data indexes.
69+
Starting in Deephaven v0.40.0, the predicate pushdown framework enables data indexes to be used with most filter types (not just exact matches). When a materialized (in-memory) data index for a table exists, the engine can leverage it during `where` operations. To avoid unexpected memory usage, filter operations do not automatically materialize table-level data indexes. However, if an individual file-level data index is available on disk, the engine will use it to filter data without loading the entire index into memory.
5470

5571
This technique is effective even if only a subset of the data files are indexed. The engine will filter non-indexed files using the standard method.
5672

@@ -80,8 +96,11 @@ If desired, you can [disable](#disabling-predicate-pushdown-features) the use of
8096

8197
Under certain circumstances, you may want to disable specific predicate pushdown features. These settings are global and will affect all pushdown operations across the Deephaven engine. The following properties affect pushdown:
8298

99+
- [`QueryTable.useDataIndexForWhere`](https://docs.deephaven.io/core/javadoc/io/deephaven/engine/table/impl/QueryTable.html#USE_DATA_INDEX_FOR_WHERE) – enables the use of Deephaven table-level data indexes when filtering.
83100
- [`QueryTable.disableWherePushdownParquetRowGroupMetadata`](https://docs.deephaven.io/core/javadoc/io/deephaven/engine/table/impl/QueryTable.html#DISABLE_WHERE_PUSHDOWN_PARQUET_ROW_GROUP_METADATA) – disables consideration of Parquet row group metadata when filtering.
84-
- [`QueryTable.disableWherePushdownDataIndex`](https://docs.deephaven.io/core/javadoc/io/deephaven/engine/table/impl/QueryTable.html#DISABLE_WHERE_PUSHDOWN_DATA_INDEX) – disables the use of Deephaven data indexes when filtering.
101+
- [`QueryTable.disableWherePushdownDataIndex`](https://docs.deephaven.io/core/javadoc/io/deephaven/engine/table/impl/QueryTable.html#DISABLE_WHERE_PUSHDOWN_DATA_INDEX) – disables the use of file-level Deephaven data indexes when filtering.
102+
- [`QueryTable.disableWherePushdownParquetDictionary`](https://docs.deephaven.io/core/javadoc/io/deephaven/engine/table/impl/QueryTable.html#DISABLE_WHERE_PUSHDOWN_PARQUET_DICTIONARY) – disables the use of dictionary encoding when filtering.
103+
-
85104

86105
For more information, see the [Query table configuration](../conceptual/query-table-configuration.md) documentation.
87106

0 commit comments

Comments
 (0)