Skip to content

Commit 2ef36d4

Browse files
committed
Final work
1 parent 4f70b7e commit 2ef36d4

File tree

23 files changed

+2005
-57
lines changed

23 files changed

+2005
-57
lines changed

docs/dev/DevDocs.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,3 +23,7 @@ For more info about the use of maven see [Maven.md](Maven.md)
2323
## Jetty 12 Migration
2424

2525
For information about the Jetty 12 upgrade, known limitations, and developer guidelines see [Jetty12Migration.md](Jetty12Migration.md)
26+
27+
## Materialized Views
28+
29+
For information about materialized view support, including SQL syntax, query rewriting, and metastore integration, see [MaterializedViews.md](MaterializedViews.md)

docs/dev/MaterializedViews.md

Lines changed: 346 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,346 @@
1+
# Materialized Views
2+
3+
Materialized views in Apache Drill provide a mechanism to store pre-computed query results for improved query performance. Unlike regular views which are virtual and execute the underlying query each time they are accessed, materialized views persist the query results as physical data that can be queried directly.
4+
5+
## Overview
6+
7+
Materialized views are useful for:
8+
- Accelerating frequently executed queries with complex aggregations or joins
9+
- Reducing compute resources for repetitive analytical workloads
10+
- Providing consistent snapshots of data at a point in time
11+
12+
Drill's materialized view implementation includes:
13+
- SQL syntax for creating, dropping, and refreshing materialized views
14+
- Automatic query rewriting using Calcite's SubstitutionVisitor
15+
- Integration with Drill Metastore for centralized metadata management
16+
- Parquet-based data storage for efficient columnar access
17+
18+
## SQL Syntax
19+
20+
### CREATE MATERIALIZED VIEW
21+
22+
```sql
23+
CREATE MATERIALIZED VIEW [schema.]view_name AS select_statement
24+
CREATE OR REPLACE MATERIALIZED VIEW [schema.]view_name AS select_statement
25+
CREATE MATERIALIZED VIEW IF NOT EXISTS [schema.]view_name AS select_statement
26+
```
27+
28+
Examples:
29+
30+
```sql
31+
-- Create a materialized view with aggregations
32+
CREATE MATERIALIZED VIEW dfs.tmp.sales_summary AS
33+
SELECT region, product_category, SUM(amount) as total_sales, COUNT(*) as num_transactions
34+
FROM dfs.`/data/sales`
35+
GROUP BY region, product_category;
36+
37+
-- Create or replace an existing materialized view
38+
CREATE OR REPLACE MATERIALIZED VIEW dfs.tmp.customer_stats AS
39+
SELECT customer_id, COUNT(*) as order_count, AVG(order_total) as avg_order
40+
FROM dfs.`/data/orders`
41+
GROUP BY customer_id;
42+
43+
-- Create only if it doesn't exist
44+
CREATE MATERIALIZED VIEW IF NOT EXISTS dfs.tmp.daily_metrics AS
45+
SELECT date_col, SUM(value) as daily_total
46+
FROM dfs.`/data/metrics`
47+
GROUP BY date_col;
48+
```
49+
50+
### DROP MATERIALIZED VIEW
51+
52+
```sql
53+
DROP MATERIALIZED VIEW [schema.]view_name
54+
DROP MATERIALIZED VIEW IF EXISTS [schema.]view_name
55+
```
56+
57+
Examples:
58+
59+
```sql
60+
-- Drop a materialized view (error if not exists)
61+
DROP MATERIALIZED VIEW dfs.tmp.sales_summary;
62+
63+
-- Drop only if it exists (no error if not exists)
64+
DROP MATERIALIZED VIEW IF EXISTS dfs.tmp.old_view;
65+
```
66+
67+
### REFRESH MATERIALIZED VIEW
68+
69+
```sql
70+
REFRESH MATERIALIZED VIEW [schema.]view_name
71+
```
72+
73+
The REFRESH command re-executes the underlying query and replaces the stored data with fresh results.
74+
75+
Example:
76+
77+
```sql
78+
-- Refresh the materialized view with current data
79+
REFRESH MATERIALIZED VIEW dfs.tmp.sales_summary;
80+
```
81+
82+
## Query Rewriting
83+
84+
Drill supports automatic query rewriting where queries against base tables can be transparently rewritten to use materialized views when appropriate. This feature leverages Apache Calcite's SubstitutionVisitor for structural query matching.
85+
86+
### Enabling Query Rewriting
87+
88+
Query rewriting is controlled by the `planner.enable_materialized_view_rewrite` option:
89+
90+
```sql
91+
-- Enable materialized view rewriting (enabled by default)
92+
SET `planner.enable_materialized_view_rewrite` = true;
93+
94+
-- Disable materialized view rewriting
95+
SET `planner.enable_materialized_view_rewrite` = false;
96+
```
97+
98+
### How Rewriting Works
99+
100+
When query rewriting is enabled, Drill's query planner:
101+
102+
1. Discovers all available materialized views in accessible schemas
103+
2. Filters candidates to those with COMPLETE refresh status
104+
3. For each candidate, parses the MV's defining SQL and converts it to a relational expression
105+
4. Uses Calcite's SubstitutionVisitor to check if the MV's query structure matches part or all of the user's query
106+
5. If a match is found, substitutes the matching portion with a scan of the materialized view data
107+
6. Selects the rewritten plan if it offers better performance characteristics
108+
109+
### Rewriting Scenarios
110+
111+
Query rewriting can apply in several scenarios:
112+
113+
**Exact Match**: The user's query exactly matches the MV definition.
114+
115+
```sql
116+
-- MV definition
117+
CREATE MATERIALIZED VIEW dfs.tmp.region_totals AS
118+
SELECT r_regionkey, COUNT(*) as cnt FROM cp.`region.json` GROUP BY r_regionkey;
119+
120+
-- This query will use the MV
121+
SELECT r_regionkey, COUNT(*) as cnt FROM cp.`region.json` GROUP BY r_regionkey;
122+
```
123+
124+
**Partial Match with Additional Filters**: The user's query adds filters on top of the MV.
125+
126+
```sql
127+
-- This query may use the MV and apply the filter
128+
SELECT r_regionkey, cnt FROM dfs.tmp.region_totals WHERE cnt > 10;
129+
```
130+
131+
**Aggregate Rollup**: Higher-level aggregations computed from MV aggregates.
132+
133+
### Viewing the Execution Plan
134+
135+
Use EXPLAIN to see if a materialized view is being used:
136+
137+
```sql
138+
EXPLAIN PLAN FOR
139+
SELECT r_regionkey, COUNT(*) FROM cp.`region.json` GROUP BY r_regionkey;
140+
```
141+
142+
If the MV is used, the plan will show a scan of the materialized view data location rather than the original table.
143+
144+
## Storage Architecture
145+
146+
### Definition Storage
147+
148+
Materialized view definitions are stored as JSON files with the `.materialized_view.drill` extension in the workspace directory. This follows the same pattern as regular Drill views (`.view.drill` files).
149+
150+
The definition file contains:
151+
- View name
152+
- Defining SQL statement
153+
- Field names and types
154+
- Workspace schema path
155+
- Data storage path
156+
- Last refresh timestamp
157+
- Refresh status (PENDING or COMPLETE)
158+
159+
Example definition file structure:
160+
161+
```json
162+
{
163+
"name": "sales_summary",
164+
"sql": "SELECT region, SUM(amount) as total FROM sales GROUP BY region",
165+
"fields": [
166+
{"name": "region", "type": "VARCHAR"},
167+
{"name": "total", "type": "DOUBLE"}
168+
],
169+
"workspaceSchemaPath": ["dfs", "tmp"],
170+
"dataStoragePath": "sales_summary",
171+
"lastRefreshTime": 1706900000000,
172+
"refreshStatus": "COMPLETE"
173+
}
174+
```
175+
176+
### Data Storage
177+
178+
Materialized view data is stored as Parquet files in a directory named `{view_name}_mv_data` within the workspace. Parquet format provides:
179+
- Efficient columnar storage
180+
- Compression
181+
- Predicate pushdown support
182+
- Schema evolution capabilities
183+
184+
For a materialized view named `sales_summary` in `dfs.tmp`, the storage structure would be:
185+
186+
```
187+
/tmp/
188+
sales_summary.materialized_view.drill # Definition file
189+
sales_summary_mv_data/ # Data directory
190+
0_0_0.parquet # Data files
191+
0_0_1.parquet
192+
...
193+
```
194+
195+
## Metastore Integration
196+
197+
When Drill Metastore is enabled, materialized view metadata is automatically synchronized to the central metastore. This provides:
198+
- Centralized metadata management across the cluster
199+
- Better discoverability of materialized views
200+
- Integration with metadata-driven query optimization
201+
202+
### Enabling Metastore Integration
203+
204+
Set the `metastore.enabled` option to enable metastore integration:
205+
206+
```sql
207+
SET `metastore.enabled` = true;
208+
```
209+
210+
When enabled, the following operations automatically sync to the metastore:
211+
- CREATE MATERIALIZED VIEW: Stores MV metadata in metastore
212+
- DROP MATERIALIZED VIEW: Removes MV metadata from metastore
213+
- REFRESH MATERIALIZED VIEW: Updates MV metadata in metastore
214+
215+
### Metastore Schema
216+
217+
The MaterializedViewMetadataUnit stored in the metastore contains:
218+
219+
| Field | Type | Description |
220+
|-------|------|-------------|
221+
| storagePlugin | String | Storage plugin name (e.g., "dfs") |
222+
| workspace | String | Workspace name (e.g., "tmp") |
223+
| name | String | Materialized view name |
224+
| owner | String | Owner username |
225+
| sql | String | Defining SQL statement |
226+
| workspaceSchemaPath | List | Schema path components |
227+
| dataLocation | String | Path to data directory |
228+
| refreshStatus | String | PENDING or COMPLETE |
229+
| lastRefreshTime | Long | Timestamp of last refresh |
230+
| lastModifiedTime | Long | Timestamp of last modification |
231+
232+
## Configuration Options
233+
234+
| Option | Default | Description |
235+
|--------|---------|-------------|
236+
| `planner.enable_materialized_view_rewrite` | true | Enables automatic query rewriting to use materialized views |
237+
| `metastore.enabled` | false | Enables Drill Metastore for centralized metadata storage |
238+
239+
## Lifecycle Management
240+
241+
### Creating a Materialized View
242+
243+
1. Parse and validate the SQL statement
244+
2. Create the data directory in the workspace
245+
3. Execute the defining query and write results as Parquet
246+
4. Write the definition file with COMPLETE status
247+
5. Sync metadata to metastore (if enabled)
248+
249+
### Refreshing a Materialized View
250+
251+
1. Read the existing definition file
252+
2. Delete the existing data directory
253+
3. Re-execute the defining query and write new results
254+
4. Update the definition file with new refresh timestamp
255+
5. Sync updated metadata to metastore (if enabled)
256+
257+
### Dropping a Materialized View
258+
259+
1. Delete the definition file
260+
2. Delete the data directory and all contents
261+
3. Remove metadata from metastore (if enabled)
262+
263+
## Limitations
264+
265+
Current limitations of the materialized view implementation:
266+
267+
1. **Full Refresh Only**: Incremental refresh is not yet supported. Each refresh completely replaces the stored data.
268+
269+
2. **No Automatic Refresh**: Materialized views must be manually refreshed. There is no automatic refresh mechanism based on source data changes.
270+
271+
3. **Single Workspace**: Materialized views can only be created in file-system based workspaces that support write operations.
272+
273+
4. **No Partitioning**: Materialized view data is not partitioned. All data is stored in a single directory.
274+
275+
5. **Query Rewriting Scope**: Query rewriting works best for exact or near-exact matches. Complex transformations may not be recognized.
276+
277+
## Implementation Details
278+
279+
### Key Classes
280+
281+
| Class | Package | Description |
282+
|-------|---------|-------------|
283+
| MaterializedView | org.apache.drill.exec.dotdrill | Data model for MV definition |
284+
| DrillMaterializedViewTable | org.apache.drill.exec.planner.logical | TranslatableTable implementation |
285+
| MaterializedViewHandler | org.apache.drill.exec.planner.sql.handlers | SQL handler for CREATE/DROP/REFRESH |
286+
| MaterializedViewRewriter | org.apache.drill.exec.planner.logical | Query rewriting using Calcite |
287+
| SqlCreateMaterializedView | org.apache.drill.exec.planner.sql.parser | SQL parser for CREATE |
288+
| SqlDropMaterializedView | org.apache.drill.exec.planner.sql.parser | SQL parser for DROP |
289+
| SqlRefreshMaterializedView | org.apache.drill.exec.planner.sql.parser | SQL parser for REFRESH |
290+
291+
### Parser Grammar
292+
293+
The SQL parser grammar for materialized views is defined in `parserImpls.ftl`. The grammar supports:
294+
- CREATE [OR REPLACE] MATERIALIZED VIEW [IF NOT EXISTS]
295+
- DROP MATERIALIZED VIEW [IF EXISTS]
296+
- REFRESH MATERIALIZED VIEW
297+
298+
### Query Rewriting Process
299+
300+
The MaterializedViewRewriter class implements query rewriting:
301+
302+
1. **Discovery**: Scans all accessible schemas for materialized views
303+
2. **Filtering**: Selects candidates with COMPLETE refresh status
304+
3. **Matching**: Uses Calcite's SubstitutionVisitor to match query structures
305+
4. **Substitution**: Replaces matched portions with MV scans
306+
5. **Selection**: Returns the first successful substitution
307+
308+
The SubstitutionVisitor performs structural matching by:
309+
- Comparing relational expression trees
310+
- Identifying equivalent subexpressions
311+
- Handling column renaming and reordering
312+
- Supporting partial matches with residual predicates
313+
314+
## Testing
315+
316+
### Unit Tests
317+
318+
- `TestMaterializedViewSqlParser`: Parser syntax validation
319+
- `TestMaterializedView`: Data model serialization tests
320+
321+
### Integration Tests
322+
323+
- `TestMaterializedViewSupport`: End-to-end CREATE/DROP/REFRESH tests
324+
- `TestMaterializedViewRewriting`: Query rewriting scenarios
325+
326+
Run the tests with:
327+
328+
```bash
329+
mvn test -pl exec/java-exec -Dtest='TestMaterializedView*'
330+
```
331+
332+
## Future Enhancements
333+
334+
Planned improvements for future releases:
335+
336+
1. **Incremental Refresh**: Support for refreshing only changed data based on source table modifications.
337+
338+
2. **Automatic Refresh**: Scheduled or trigger-based automatic refresh mechanisms.
339+
340+
3. **Partitioned Storage**: Partition materialized view data for better query performance.
341+
342+
4. **Cost-Based Selection**: When multiple MVs match, select based on estimated query cost.
343+
344+
5. **Staleness Tracking**: Track source table changes to identify stale materialized views.
345+
346+
6. **INFORMATION_SCHEMA Integration**: Expose materialized views in INFORMATION_SCHEMA tables.

0 commit comments

Comments
 (0)