Skip to content

Commit da8ed33

Browse files
authored
Integration test (#10)
* wip integration test * working integration test * fix license header + update readme
1 parent e1ba254 commit da8ed33

File tree

4 files changed

+600
-9
lines changed

4 files changed

+600
-9
lines changed

README.md

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22

33
An implementation of incremental view maintenance & query rewriting for materialized views in DataFusion.
44

5-
A **materialized view** is a view whose query has been pre-computed and saved for later use. This can drastically speed up workloads by pre-computing at least a large fragment of a user-provided query. Furthermore, by implementing a _view matching_ algorithm, we can implement an optimizer that rewrites queries to automatically make use of materialized views where possible and beneficial, a concept known as *query rewriting*.
5+
A **materialized view** is a view whose query has been pre-computed and saved for later use. This can drastically speed up workloads by pre-computing at least a large fragment of a user-provided query. Furthermore, by implementing a _view matching_ algorithm, we can implement an optimizer that rewrites queries to automatically make use of materialized views where possible and beneficial, a concept known as _query rewriting_.
66

7-
Efficiently maintaining the up-to-dateness of a materialized view is a problem known as *incremental view maintenance*. It is a hard problem in general, but we make some simplifying assumptions:
7+
Efficiently maintaining the up-to-dateness of a materialized view is a problem known as _incremental view maintenance_. It is a hard problem in general, but we make some simplifying assumptions:
88

99
* Data is stored as Hive-partitioned files in object storage.
1010
* The smallest unit of data that can be updated is a single file.
@@ -14,7 +14,7 @@ This is a typical pattern with DataFusion, as files in object storage usually ar
1414
## Example
1515

1616
Here we walk through a hypothetical example of setting up a materialized view, to illustrate
17-
what this library offers. The core of the incremental view maintenance implementation is a UDTF (User-Defined Table Function),
17+
what this library offers. The core of the incremental view maintenance implementation is a UDTF (User-Defined Table Function),
1818
called `mv_dependencies`, that outputs a build graph for a materialized view. This gives users the information they need to determine
1919
when partitions of the materialized view need to be recomputed.
2020

@@ -52,3 +52,15 @@ SELECT * FROM mv_dependencies('m1');
5252
+--------------------+----------------------+---------------------+-------------------+--------------------------------------+----------------------+
5353
```
5454

55+
## More detailed example (with code)
56+
57+
As of now, actually implementing materialized views is somewhat complicated, as the library is initially focused on providing a minimal kernel of functionality that can be shared across multiple implementations of materialized views. Broadly, the process includes these steps:
58+
59+
* Define a custom `MaterializedListingTable` type that implements `Materialized`
60+
* Register the type globally using the `register_materialized` global function
61+
* Initialize the `FileMetadata` component
62+
* Initialize the `RowMetadataRegistry`
63+
* Register the `mv_dependencies` and `stale_files` UDTFs (User Defined Table Functions) in your DataFusion `SessionContext`
64+
* Periodically regenerate directories marked as stale by `stale_files`
65+
66+
A full walkthrough of this process including implementation can be seen in an integration test, under [`tests/materialized_listing_table.rs`](tests/materialized_listing_table.rs).

src/materialized/dependencies.rs

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -200,12 +200,9 @@ impl TableFunctionImpl for StaleFilesUdtf {
200200
col("expected_target").alias("target"),
201201
col("target_last_modified"),
202202
col("sources_last_modified"),
203-
coalesce(vec![
204-
col("target_last_modified"),
205-
lit(ScalarValue::TimestampNanosecond(Some(0), None)),
206-
])
207-
.lt(col("sources_last_modified"))
208-
.alias("is_stale"),
203+
nvl(col("target_last_modified"), lit(0))
204+
.lt(col("sources_last_modified"))
205+
.alias("is_stale"),
209206
])?
210207
.build()?;
211208

src/materialized/row_metadata.rs

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ use super::{file_metadata::FileMetadata, hive_partition::hive_partition, META_CO
3030
#[derive(Default)]
3131
pub struct RowMetadataRegistry {
3232
metadata_sources: DashMap<String, Arc<dyn RowMetadataSource>>,
33+
default_source: Option<Arc<dyn RowMetadataSource>>,
3334
}
3435

3536
impl std::fmt::Debug for RowMetadataRegistry {
@@ -48,6 +49,17 @@ impl std::fmt::Debug for RowMetadataRegistry {
4849
}
4950

5051
impl RowMetadataRegistry {
52+
/// Initializes this `RowMetadataRegistry` with a default `RowMetadataSource`
53+
/// to be used if a table has not been explicitly registered with a specific source.
54+
///
55+
/// Typically the [`FileMetadata`] source should be used as the default.
56+
pub fn new_with_default_source(default_source: Arc<dyn RowMetadataSource>) -> Self {
57+
Self {
58+
default_source: Some(default_source),
59+
..Default::default()
60+
}
61+
}
62+
5163
/// Registers a metadata source for a specific table.
5264
/// Returns the previously registered source for this table, if any
5365
pub fn register_source(
@@ -63,6 +75,7 @@ impl RowMetadataRegistry {
6375
self.metadata_sources
6476
.get(&table.to_string())
6577
.map(|o| Arc::clone(o.value()))
78+
.or_else(|| self.default_source.clone())
6679
.ok_or_else(|| DataFusionError::Internal(format!("No metadata source for {}", table)))
6780
}
6881
}

0 commit comments

Comments
 (0)