-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Summary
Add External Tables that read data in place from object storage or local files (Parquet/ORC/JSONL) and optionally via Iceberg/Delta table metadata—no ingest required. Features include predicate/column pushdown, adaptive file pruning, stats caching, and alignment with LatticeDB’s bitemporal semantics (querying external snapshots AS OF a commit/timestamp). This enables “lakehouse” interoperability while keeping LatticeDB as the query/transaction/lineage brain.
Motivation
- Zero-copy analytics: Query data lakes without ETL. Faster time-to-first-query and cheaper storage.
- Federated pipelines: Mix external fact tables with internal transactional dimensions.
- Temporal truth: Ask “what did the lake contain at commit C or time T?”—consistent with LatticeDB time-travel.
- Ops simplicity: Avoid duplicate copies; cache only what helps performance.
Proposed Design
1) SQL Surface
-- Register a location-backed external table (Parquet)
CREATE EXTERNAL TABLE ext_sales (
id BIGINT,
region TEXT,
amount DOUBLE,
ts TIMESTAMP
)
STORED AS PARQUET
LOCATION 's3://data/sales/2025/'
WITH (
credentials = 'aws_profile:analytics',
partitioning = 'region, dt',
cache = 'metadata', -- none | metadata | data
format_options = 'parquet.read_batch=8192'
);
-- Iceberg/Delta catalogs
CREATE EXTERNAL TABLE ext_events
USING ICEBERG
CATALOG 'glue'
NAMESPACE 'prod.analytics'
TABLE 'events';
-- Time-travel (metadata-backed)
SELECT COUNT(*)
FROM ext_events FOR SYSTEM_TIME AS OF '2025-09-01T00:00:00Z';
-- Join external with internal
SELECT e.user_id, SUM(e.spend)
FROM ext_events e
JOIN users u ON u.id = e.user_id
WHERE e.region='NA' AND e.ts >= NOW() - INTERVAL '7 days';2) Engine & Planner
-
Connectors: Parquet/ORC/JSONL readers; Iceberg/Delta metadata resolvers (manifest/txn log).
-
Pruning: Partition + statistics pruning (min/max, Bloom), file/row-group skipping.
-
Pushdown: Projection, predicate pushdown; lazy decompression; vectorized scans.
-
Statistics cache: Background job to cache schema, row-group stats, and partition maps in the catalog.
-
Temporal alignment:
- Iceberg: use snapshot/sequence numbers to resolve
AS OF. - Delta: use version or timestamp to resolve log state.
- Raw files: optional watermark checkpoints recorded in LatticeDB to emulate a snapshot boundary.
- Iceberg: use snapshot/sequence numbers to resolve
-
IO & parallelism: Multi-file parallel reads; adaptive concurrency based on file sizes & S3 throttling hints.
-
Security: Credential objects (KMS-encrypted), row/column masking and DP policies applied post-scan.
3) GUI/UX
- External Table Wizard (select source, schema inference preview, partition hints).
- File/partition browser with row-group stats and projected cost.
- Snapshot picker for Iceberg/Delta (by timestamp or snapshot id).
- Diagnostics: % rows pruned, # files touched, scan throughput, cache hit rate.
4) Caching Strategy
- Metadata cache in catalog tables with TTL/invalidations on source change.
- Optional data cache: hot row-groups to local SSD with ref-counts; checksum to detect staleness.
5) Error Handling & Correctness
- Schema evolution: soft-fail on additive columns; mapping rules for type widening; explicit policy for incompatible changes.
- File drift: Detect deletes/overwrites; surface “inconsistent snapshot” warnings unless resolved via
FOR SYSTEM_TIME.
Milestones
- MVP
- Parquet reader (local + S3), projection/predicate pushdown, partition pruning, metadata cache.
CREATE EXTERNAL TABLE ... STORED AS PARQUET/JSONL.
- Iceberg/Delta
- Snapshot resolution, commit/time
AS OFqueries, manifest/txn-log pruning.
- Performance & Ops
- Parallel readers, adaptive batch sizes, optional data cache, GUI wizard & diagnostics.
- Advanced
- ORC support, Bloom filters, schema evolution tooling, credential managers (AWS/GCP/Azure).
Acceptance Criteria
- Query on a 10TB partitioned Parquet dataset touches ≤5% of files when predicates are selective (demonstrated in diagnostics).
FOR SYSTEM_TIME AS OFon Iceberg/Delta returns rows consistent with the referenced snapshot/version.- Pushdown verified via
EXPLAIN(scan nodes show projected columns and pushed predicates). - Metadata cache improves planning latency by ≥3× on second query to same dataset.
Checklist
- Parquet reader + projection/predicate pushdown
- Partition & stats pruning + catalog cache
- S3/GCS/Azure credential objects
- Iceberg snapshot resolution
- Delta log reader & versioned queries
- GUI wizard + diagnostics panel
- Tests: correctness, pruning efficacy, schema evolution
- Docs & examples (lake join with internal tables)
Metadata
Metadata
Assignees
Labels
Projects
Status