|
| 1 | +# Materialized Views |
| 2 | + |
| 3 | +Materialized views in Apache Drill provide a mechanism to store pre-computed query results for improved query performance. Unlike regular views which are virtual and execute the underlying query each time they are accessed, materialized views persist the query results as physical data that can be queried directly. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +Materialized views are useful for: |
| 8 | +- Accelerating frequently executed queries with complex aggregations or joins |
| 9 | +- Reducing compute resources for repetitive analytical workloads |
| 10 | +- Providing consistent snapshots of data at a point in time |
| 11 | + |
| 12 | +Drill's materialized view implementation includes: |
| 13 | +- SQL syntax for creating, dropping, and refreshing materialized views |
| 14 | +- Automatic query rewriting using Calcite's SubstitutionVisitor |
| 15 | +- Integration with Drill Metastore for centralized metadata management |
| 16 | +- Parquet-based data storage for efficient columnar access |
| 17 | + |
| 18 | +## SQL Syntax |
| 19 | + |
| 20 | +### CREATE MATERIALIZED VIEW |
| 21 | + |
| 22 | +```sql |
| 23 | +CREATE MATERIALIZED VIEW [schema.]view_name AS select_statement |
| 24 | +CREATE OR REPLACE MATERIALIZED VIEW [schema.]view_name AS select_statement |
| 25 | +CREATE MATERIALIZED VIEW IF NOT EXISTS [schema.]view_name AS select_statement |
| 26 | +``` |
| 27 | + |
| 28 | +Examples: |
| 29 | + |
| 30 | +```sql |
| 31 | +-- Create a materialized view with aggregations |
| 32 | +CREATE MATERIALIZED VIEW dfs.tmp.sales_summary AS |
| 33 | +SELECT region, product_category, SUM(amount) as total_sales, COUNT(*) as num_transactions |
| 34 | +FROM dfs.`/data/sales` |
| 35 | +GROUP BY region, product_category; |
| 36 | + |
| 37 | +-- Create or replace an existing materialized view |
| 38 | +CREATE OR REPLACE MATERIALIZED VIEW dfs.tmp.customer_stats AS |
| 39 | +SELECT customer_id, COUNT(*) as order_count, AVG(order_total) as avg_order |
| 40 | +FROM dfs.`/data/orders` |
| 41 | +GROUP BY customer_id; |
| 42 | + |
| 43 | +-- Create only if it doesn't exist |
| 44 | +CREATE MATERIALIZED VIEW IF NOT EXISTS dfs.tmp.daily_metrics AS |
| 45 | +SELECT date_col, SUM(value) as daily_total |
| 46 | +FROM dfs.`/data/metrics` |
| 47 | +GROUP BY date_col; |
| 48 | +``` |
| 49 | + |
| 50 | +### DROP MATERIALIZED VIEW |
| 51 | + |
| 52 | +```sql |
| 53 | +DROP MATERIALIZED VIEW [schema.]view_name |
| 54 | +DROP MATERIALIZED VIEW IF EXISTS [schema.]view_name |
| 55 | +``` |
| 56 | + |
| 57 | +Examples: |
| 58 | + |
| 59 | +```sql |
| 60 | +-- Drop a materialized view (error if not exists) |
| 61 | +DROP MATERIALIZED VIEW dfs.tmp.sales_summary; |
| 62 | + |
| 63 | +-- Drop only if it exists (no error if not exists) |
| 64 | +DROP MATERIALIZED VIEW IF EXISTS dfs.tmp.old_view; |
| 65 | +``` |
| 66 | + |
| 67 | +### REFRESH MATERIALIZED VIEW |
| 68 | + |
| 69 | +```sql |
| 70 | +REFRESH MATERIALIZED VIEW [schema.]view_name |
| 71 | +``` |
| 72 | + |
| 73 | +The REFRESH command re-executes the underlying query and replaces the stored data with fresh results. |
| 74 | + |
| 75 | +Example: |
| 76 | + |
| 77 | +```sql |
| 78 | +-- Refresh the materialized view with current data |
| 79 | +REFRESH MATERIALIZED VIEW dfs.tmp.sales_summary; |
| 80 | +``` |
| 81 | + |
| 82 | +## Query Rewriting |
| 83 | + |
| 84 | +Drill supports automatic query rewriting where queries against base tables can be transparently rewritten to use materialized views when appropriate. This feature leverages Apache Calcite's SubstitutionVisitor for structural query matching. |
| 85 | + |
| 86 | +### Enabling Query Rewriting |
| 87 | + |
| 88 | +Query rewriting is controlled by the `planner.enable_materialized_view_rewrite` option: |
| 89 | + |
| 90 | +```sql |
| 91 | +-- Enable materialized view rewriting (enabled by default) |
| 92 | +SET `planner.enable_materialized_view_rewrite` = true; |
| 93 | + |
| 94 | +-- Disable materialized view rewriting |
| 95 | +SET `planner.enable_materialized_view_rewrite` = false; |
| 96 | +``` |
| 97 | + |
| 98 | +### How Rewriting Works |
| 99 | + |
| 100 | +When query rewriting is enabled, Drill's query planner: |
| 101 | + |
| 102 | +1. Discovers all available materialized views in accessible schemas |
| 103 | +2. Filters candidates to those with COMPLETE refresh status |
| 104 | +3. For each candidate, parses the MV's defining SQL and converts it to a relational expression |
| 105 | +4. Uses Calcite's SubstitutionVisitor to check if the MV's query structure matches part or all of the user's query |
| 106 | +5. If a match is found, substitutes the matching portion with a scan of the materialized view data |
| 107 | +6. Selects the rewritten plan if it offers better performance characteristics |
| 108 | + |
| 109 | +### Rewriting Scenarios |
| 110 | + |
| 111 | +Query rewriting can apply in several scenarios: |
| 112 | + |
| 113 | +**Exact Match**: The user's query exactly matches the MV definition. |
| 114 | + |
| 115 | +```sql |
| 116 | +-- MV definition |
| 117 | +CREATE MATERIALIZED VIEW dfs.tmp.region_totals AS |
| 118 | +SELECT r_regionkey, COUNT(*) as cnt FROM cp.`region.json` GROUP BY r_regionkey; |
| 119 | + |
| 120 | +-- This query will use the MV |
| 121 | +SELECT r_regionkey, COUNT(*) as cnt FROM cp.`region.json` GROUP BY r_regionkey; |
| 122 | +``` |
| 123 | + |
| 124 | +**Partial Match with Additional Filters**: The user's query adds filters on top of the MV. |
| 125 | + |
| 126 | +```sql |
| 127 | +-- This query may use the MV and apply the filter |
| 128 | +SELECT r_regionkey, cnt FROM dfs.tmp.region_totals WHERE cnt > 10; |
| 129 | +``` |
| 130 | + |
| 131 | +**Aggregate Rollup**: Higher-level aggregations computed from MV aggregates. |
| 132 | + |
| 133 | +### Viewing the Execution Plan |
| 134 | + |
| 135 | +Use EXPLAIN to see if a materialized view is being used: |
| 136 | + |
| 137 | +```sql |
| 138 | +EXPLAIN PLAN FOR |
| 139 | +SELECT r_regionkey, COUNT(*) FROM cp.`region.json` GROUP BY r_regionkey; |
| 140 | +``` |
| 141 | + |
| 142 | +If the MV is used, the plan will show a scan of the materialized view data location rather than the original table. |
| 143 | + |
| 144 | +## Storage Architecture |
| 145 | + |
| 146 | +### Definition Storage |
| 147 | + |
| 148 | +Materialized view definitions are stored as JSON files with the `.materialized_view.drill` extension in the workspace directory. This follows the same pattern as regular Drill views (`.view.drill` files). |
| 149 | + |
| 150 | +The definition file contains: |
| 151 | +- View name |
| 152 | +- Defining SQL statement |
| 153 | +- Field names and types |
| 154 | +- Workspace schema path |
| 155 | +- Data storage path |
| 156 | +- Last refresh timestamp |
| 157 | +- Refresh status (PENDING or COMPLETE) |
| 158 | + |
| 159 | +Example definition file structure: |
| 160 | + |
| 161 | +```json |
| 162 | +{ |
| 163 | + "name": "sales_summary", |
| 164 | + "sql": "SELECT region, SUM(amount) as total FROM sales GROUP BY region", |
| 165 | + "fields": [ |
| 166 | + {"name": "region", "type": "VARCHAR"}, |
| 167 | + {"name": "total", "type": "DOUBLE"} |
| 168 | + ], |
| 169 | + "workspaceSchemaPath": ["dfs", "tmp"], |
| 170 | + "dataStoragePath": "sales_summary", |
| 171 | + "lastRefreshTime": 1706900000000, |
| 172 | + "refreshStatus": "COMPLETE" |
| 173 | +} |
| 174 | +``` |
| 175 | + |
| 176 | +### Data Storage |
| 177 | + |
| 178 | +Materialized view data is stored as Parquet files in a directory named `{view_name}_mv_data` within the workspace. Parquet format provides: |
| 179 | +- Efficient columnar storage |
| 180 | +- Compression |
| 181 | +- Predicate pushdown support |
| 182 | +- Schema evolution capabilities |
| 183 | + |
| 184 | +For a materialized view named `sales_summary` in `dfs.tmp`, the storage structure would be: |
| 185 | + |
| 186 | +``` |
| 187 | +/tmp/ |
| 188 | + sales_summary.materialized_view.drill # Definition file |
| 189 | + sales_summary_mv_data/ # Data directory |
| 190 | + 0_0_0.parquet # Data files |
| 191 | + 0_0_1.parquet |
| 192 | + ... |
| 193 | +``` |
| 194 | + |
| 195 | +## Metastore Integration |
| 196 | + |
| 197 | +When Drill Metastore is enabled, materialized view metadata is automatically synchronized to the central metastore. This provides: |
| 198 | +- Centralized metadata management across the cluster |
| 199 | +- Better discoverability of materialized views |
| 200 | +- Integration with metadata-driven query optimization |
| 201 | + |
| 202 | +### Enabling Metastore Integration |
| 203 | + |
| 204 | +Set the `metastore.enabled` option to enable metastore integration: |
| 205 | + |
| 206 | +```sql |
| 207 | +SET `metastore.enabled` = true; |
| 208 | +``` |
| 209 | + |
| 210 | +When enabled, the following operations automatically sync to the metastore: |
| 211 | +- CREATE MATERIALIZED VIEW: Stores MV metadata in metastore |
| 212 | +- DROP MATERIALIZED VIEW: Removes MV metadata from metastore |
| 213 | +- REFRESH MATERIALIZED VIEW: Updates MV metadata in metastore |
| 214 | + |
| 215 | +### Metastore Schema |
| 216 | + |
| 217 | +The MaterializedViewMetadataUnit stored in the metastore contains: |
| 218 | + |
| 219 | +| Field | Type | Description | |
| 220 | +|-------|------|-------------| |
| 221 | +| storagePlugin | String | Storage plugin name (e.g., "dfs") | |
| 222 | +| workspace | String | Workspace name (e.g., "tmp") | |
| 223 | +| name | String | Materialized view name | |
| 224 | +| owner | String | Owner username | |
| 225 | +| sql | String | Defining SQL statement | |
| 226 | +| workspaceSchemaPath | List | Schema path components | |
| 227 | +| dataLocation | String | Path to data directory | |
| 228 | +| refreshStatus | String | PENDING or COMPLETE | |
| 229 | +| lastRefreshTime | Long | Timestamp of last refresh | |
| 230 | +| lastModifiedTime | Long | Timestamp of last modification | |
| 231 | + |
| 232 | +## Configuration Options |
| 233 | + |
| 234 | +| Option | Default | Description | |
| 235 | +|--------|---------|-------------| |
| 236 | +| `planner.enable_materialized_view_rewrite` | true | Enables automatic query rewriting to use materialized views | |
| 237 | +| `metastore.enabled` | false | Enables Drill Metastore for centralized metadata storage | |
| 238 | + |
| 239 | +## Lifecycle Management |
| 240 | + |
| 241 | +### Creating a Materialized View |
| 242 | + |
| 243 | +1. Parse and validate the SQL statement |
| 244 | +2. Create the data directory in the workspace |
| 245 | +3. Execute the defining query and write results as Parquet |
| 246 | +4. Write the definition file with COMPLETE status |
| 247 | +5. Sync metadata to metastore (if enabled) |
| 248 | + |
| 249 | +### Refreshing a Materialized View |
| 250 | + |
| 251 | +1. Read the existing definition file |
| 252 | +2. Delete the existing data directory |
| 253 | +3. Re-execute the defining query and write new results |
| 254 | +4. Update the definition file with new refresh timestamp |
| 255 | +5. Sync updated metadata to metastore (if enabled) |
| 256 | + |
| 257 | +### Dropping a Materialized View |
| 258 | + |
| 259 | +1. Delete the definition file |
| 260 | +2. Delete the data directory and all contents |
| 261 | +3. Remove metadata from metastore (if enabled) |
| 262 | + |
| 263 | +## Limitations |
| 264 | + |
| 265 | +Current limitations of the materialized view implementation: |
| 266 | + |
| 267 | +1. **Full Refresh Only**: Incremental refresh is not yet supported. Each refresh completely replaces the stored data. |
| 268 | + |
| 269 | +2. **No Automatic Refresh**: Materialized views must be manually refreshed. There is no automatic refresh mechanism based on source data changes. |
| 270 | + |
| 271 | +3. **Single Workspace**: Materialized views can only be created in file-system based workspaces that support write operations. |
| 272 | + |
| 273 | +4. **No Partitioning**: Materialized view data is not partitioned. All data is stored in a single directory. |
| 274 | + |
| 275 | +5. **Query Rewriting Scope**: Query rewriting works best for exact or near-exact matches. Complex transformations may not be recognized. |
| 276 | + |
| 277 | +## Implementation Details |
| 278 | + |
| 279 | +### Key Classes |
| 280 | + |
| 281 | +| Class | Package | Description | |
| 282 | +|-------|---------|-------------| |
| 283 | +| MaterializedView | org.apache.drill.exec.dotdrill | Data model for MV definition | |
| 284 | +| DrillMaterializedViewTable | org.apache.drill.exec.planner.logical | TranslatableTable implementation | |
| 285 | +| MaterializedViewHandler | org.apache.drill.exec.planner.sql.handlers | SQL handler for CREATE/DROP/REFRESH | |
| 286 | +| MaterializedViewRewriter | org.apache.drill.exec.planner.logical | Query rewriting using Calcite | |
| 287 | +| SqlCreateMaterializedView | org.apache.drill.exec.planner.sql.parser | SQL parser for CREATE | |
| 288 | +| SqlDropMaterializedView | org.apache.drill.exec.planner.sql.parser | SQL parser for DROP | |
| 289 | +| SqlRefreshMaterializedView | org.apache.drill.exec.planner.sql.parser | SQL parser for REFRESH | |
| 290 | + |
| 291 | +### Parser Grammar |
| 292 | + |
| 293 | +The SQL parser grammar for materialized views is defined in `parserImpls.ftl`. The grammar supports: |
| 294 | +- CREATE [OR REPLACE] MATERIALIZED VIEW [IF NOT EXISTS] |
| 295 | +- DROP MATERIALIZED VIEW [IF EXISTS] |
| 296 | +- REFRESH MATERIALIZED VIEW |
| 297 | + |
| 298 | +### Query Rewriting Process |
| 299 | + |
| 300 | +The MaterializedViewRewriter class implements query rewriting: |
| 301 | + |
| 302 | +1. **Discovery**: Scans all accessible schemas for materialized views |
| 303 | +2. **Filtering**: Selects candidates with COMPLETE refresh status |
| 304 | +3. **Matching**: Uses Calcite's SubstitutionVisitor to match query structures |
| 305 | +4. **Substitution**: Replaces matched portions with MV scans |
| 306 | +5. **Selection**: Returns the first successful substitution |
| 307 | + |
| 308 | +The SubstitutionVisitor performs structural matching by: |
| 309 | +- Comparing relational expression trees |
| 310 | +- Identifying equivalent subexpressions |
| 311 | +- Handling column renaming and reordering |
| 312 | +- Supporting partial matches with residual predicates |
| 313 | + |
| 314 | +## Testing |
| 315 | + |
| 316 | +### Unit Tests |
| 317 | + |
| 318 | +- `TestMaterializedViewSqlParser`: Parser syntax validation |
| 319 | +- `TestMaterializedView`: Data model serialization tests |
| 320 | + |
| 321 | +### Integration Tests |
| 322 | + |
| 323 | +- `TestMaterializedViewSupport`: End-to-end CREATE/DROP/REFRESH tests |
| 324 | +- `TestMaterializedViewRewriting`: Query rewriting scenarios |
| 325 | + |
| 326 | +Run the tests with: |
| 327 | + |
| 328 | +```bash |
| 329 | +mvn test -pl exec/java-exec -Dtest='TestMaterializedView*' |
| 330 | +``` |
| 331 | + |
| 332 | +## Future Enhancements |
| 333 | + |
| 334 | +Planned improvements for future releases: |
| 335 | + |
| 336 | +1. **Incremental Refresh**: Support for refreshing only changed data based on source table modifications. |
| 337 | + |
| 338 | +2. **Automatic Refresh**: Scheduled or trigger-based automatic refresh mechanisms. |
| 339 | + |
| 340 | +3. **Partitioned Storage**: Partition materialized view data for better query performance. |
| 341 | + |
| 342 | +4. **Cost-Based Selection**: When multiple MVs match, select based on estimated query cost. |
| 343 | + |
| 344 | +5. **Staleness Tracking**: Track source table changes to identify stale materialized views. |
| 345 | + |
| 346 | +6. **INFORMATION_SCHEMA Integration**: Expose materialized views in INFORMATION_SCHEMA tables. |
0 commit comments