Feature: Zero-Copy External Tables for Parquet/JSON/ORC (+ Iceberg/Delta metadata) with Time-Travel Alignment

### Summary

Add **External Tables** that read data in place from object storage or local files (Parquet/ORC/JSONL) and optionally via **Iceberg/Delta** table metadata—no ingest required. Features include predicate/column pushdown, adaptive file pruning, stats caching, and alignment with LatticeDB’s **bitemporal** semantics (querying external snapshots **AS OF** a commit/timestamp). This enables “lakehouse” interoperability while keeping LatticeDB as the query/transaction/lineage brain.

---

### Motivation

* **Zero-copy analytics:** Query data lakes without ETL. Faster time-to-first-query and cheaper storage.
* **Federated pipelines:** Mix external fact tables with internal transactional dimensions.
* **Temporal truth:** Ask “what did the lake contain at commit C or time T?”—consistent with LatticeDB time-travel.
* **Ops simplicity:** Avoid duplicate copies; cache only what helps performance.

---

### Proposed Design

#### 1) SQL Surface

```sql
-- Register a location-backed external table (Parquet)
CREATE EXTERNAL TABLE ext_sales (
  id BIGINT,
  region TEXT,
  amount DOUBLE,
  ts TIMESTAMP
)
STORED AS PARQUET
LOCATION 's3://data/sales/2025/'
WITH (
  credentials = 'aws_profile:analytics',
  partitioning = 'region, dt',
  cache = 'metadata',             -- none | metadata | data
  format_options = 'parquet.read_batch=8192'
);

-- Iceberg/Delta catalogs
CREATE EXTERNAL TABLE ext_events
USING ICEBERG
CATALOG 'glue'
NAMESPACE 'prod.analytics'
TABLE 'events';

-- Time-travel (metadata-backed)
SELECT COUNT(*)
FROM ext_events FOR SYSTEM_TIME AS OF '2025-09-01T00:00:00Z';

-- Join external with internal
SELECT e.user_id, SUM(e.spend)
FROM ext_events e
JOIN users u ON u.id = e.user_id
WHERE e.region='NA' AND e.ts >= NOW() - INTERVAL '7 days';
```

#### 2) Engine & Planner

* **Connectors**: Parquet/ORC/JSONL readers; Iceberg/Delta metadata resolvers (manifest/txn log).
* **Pruning**: Partition + statistics pruning (min/max, Bloom), file/row-group skipping.
* **Pushdown**: Projection, predicate pushdown; **lazy decompression**; vectorized scans.
* **Statistics cache**: Background job to cache schema, row-group stats, and partition maps in the catalog.
* **Temporal alignment**:

  * Iceberg: use snapshot/sequence numbers to resolve `AS OF`.
  * Delta: use version or timestamp to resolve log state.
  * Raw files: optional **watermark checkpoints** recorded in LatticeDB to emulate a snapshot boundary.
* **IO & parallelism**: Multi-file parallel reads; adaptive concurrency based on file sizes & S3 throttling hints.
* **Security**: Credential objects (KMS-encrypted), row/column masking and DP policies applied post-scan.

#### 3) GUI/UX

* **External Table Wizard** (select source, schema inference preview, partition hints).
* **File/partition browser** with row-group stats and projected cost.
* **Snapshot picker** for Iceberg/Delta (by timestamp or snapshot id).
* **Diagnostics**: % rows pruned, # files touched, scan throughput, cache hit rate.

#### 4) Caching Strategy

* **Metadata cache** in catalog tables with TTL/invalidations on source change.
* Optional **data cache**: hot row-groups to local SSD with ref-counts; checksum to detect staleness.

#### 5) Error Handling & Correctness

* Schema evolution: soft-fail on additive columns; mapping rules for type widening; explicit policy for incompatible changes.
* File drift: Detect deletes/overwrites; surface “inconsistent snapshot” warnings unless resolved via `FOR SYSTEM_TIME`.

---

### Milestones

1. **MVP**

* Parquet reader (local + S3), projection/predicate pushdown, partition pruning, metadata cache.
* `CREATE EXTERNAL TABLE ... STORED AS PARQUET/JSONL`.

2. **Iceberg/Delta**

* Snapshot resolution, commit/time `AS OF` queries, manifest/txn-log pruning.

3. **Performance & Ops**

* Parallel readers, adaptive batch sizes, optional data cache, GUI wizard & diagnostics.

4. **Advanced**

* ORC support, Bloom filters, schema evolution tooling, credential managers (AWS/GCP/Azure).

---

### Acceptance Criteria

* Query on a 10TB partitioned Parquet dataset touches ≤5% of files when predicates are selective (demonstrated in diagnostics).
* `FOR SYSTEM_TIME AS OF` on Iceberg/Delta returns rows consistent with the referenced snapshot/version.
* Pushdown verified via `EXPLAIN` (scan nodes show projected columns and pushed predicates).
* Metadata cache improves planning latency by ≥3× on second query to same dataset.

---

### Checklist

* [ ] Parquet reader + projection/predicate pushdown
* [ ] Partition & stats pruning + catalog cache
* [ ] S3/GCS/Azure credential objects
* [ ] Iceberg snapshot resolution
* [ ] Delta log reader & versioned queries
* [ ] GUI wizard + diagnostics panel
* [ ] Tests: correctness, pruning efficacy, schema evolution
* [ ] Docs & examples (lake join with internal tables)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Zero-Copy External Tables for Parquet/JSON/ORC (+ Iceberg/Delta metadata) with Time-Travel Alignment #20

Summary

Motivation

Proposed Design

1) SQL Surface

2) Engine & Planner

3) GUI/UX

4) Caching Strategy

5) Error Handling & Correctness

Milestones

Acceptance Criteria

Checklist

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Feature: Zero-Copy External Tables for Parquet/JSON/ORC (+ Iceberg/Delta metadata) with Time-Travel Alignment #20

Description

Summary

Motivation

Proposed Design

1) SQL Surface

2) Engine & Planner

3) GUI/UX

4) Caching Strategy

5) Error Handling & Correctness

Milestones

Acceptance Criteria

Checklist

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions