Disable BigQuery tests if key is not present

abhi-airspace-intelligence · abhi-airspace-intelligence · commit 346f94e94d62 · 2025-08-27T10:32:34.000-04:00
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -84,18 +84,30 @@ jobs:
         uses: taiki-e/install-action@cargo-llvm-cov
 
       - name: Set up BigQuery environment variables and credentials
+        if: ${{ secrets.TESTS_BIGQUERY_SA_KEY_JSON != '' && secrets.TESTS_BIGQUERY_PROJECT_ID != '' }}
         run: |
           printf '%s' '${{ secrets.TESTS_BIGQUERY_SA_KEY_JSON }}' > /tmp/bigquery-sa-key.json
           echo "TESTS_BIGQUERY_PROJECT_ID=${{ secrets.TESTS_BIGQUERY_PROJECT_ID }}" >> $GITHUB_ENV
           echo "TESTS_BIGQUERY_SA_KEY_PATH=/tmp/bigquery-sa-key.json" >> $GITHUB_ENV
+          echo "BIGQUERY_TESTS_ENABLED=true" >> $GITHUB_ENV
 
       - name: Generate code coverage
         id: coverage
         run: |
-          cargo llvm-cov test \
-            --workspace --no-fail-fast \
-            --all-features \
-            --lcov --output-path lcov.info
+          if [ "$BIGQUERY_TESTS_ENABLED" = "true" ]; then
+            echo "Running all tests including BigQuery integration tests"
+            cargo llvm-cov test \
+              --workspace --no-fail-fast \
+              --all-features \
+              --lcov --output-path lcov.info
+          else
+            echo "Running tests excluding BigQuery integration tests (credentials not available)"
+            cargo llvm-cov test \
+              --workspace --no-fail-fast \
+              --all-features \
+              --lcov --output-path lcov.info \
+              -- --skip bigquery
+          fi
 
       - name: Upload coverage to Coveralls
         uses: coverallsapp/github-action@v2
diff --git a/PLAN.md b/PLAN.md
@@ -0,0 +1,191 @@
+# Delta Lake Destination — Implementation Plan
+
+## Goals
+- Implement a `DeltaDestination` that satisfies `etl::destination::Destination` for:
+  - Initial table sync (`truncate_table`, `write_table_rows`)
+  - CDC (`write_events`: inserts, updates, deletes, truncates)
+- Preserve correctness and ordering (LSN last-wins) with durable, atomic commits.
+- Avoid micro-batches with sensible batching and file sizing; support compaction.
+- Support schema evolution (additive) and configurable partitioning.
+- Keep idempotency and crash-safety consistent with current pipeline semantics.
+
+## Scope and Non-Goals
+- Scope: Write-path only (append/merge/delete to a Delta table), optional compaction.
+- Non-goals: Reader/query engine; complex schema rewrites (rename/drop); cross-table transactions.
+
+## Architecture
+
+- New module: `etl-destinations/src/delta/`
+  - `mod.rs`: re-exports
+  - `core.rs`: `DeltaDestination<S>` implementation of `Destination`
+  - `client.rs`: thin wrapper over `delta-rs` ops and object store setup
+  - `schema.rs`: mapping from `etl::types::TableSchema` to Delta/Arrow schema
+  - `encoding.rs`: `TableRow`/`Cell` → Arrow arrays/RecordBatch
+  - `validation.rs`: config validation (paths, partitions)
+- Dependencies (feature-gated, e.g. `delta`):
+  - `delta` (delta-rs), `arrow` (aligned with delta-rs), `object_store`, `parquet`, `bytes`, `serde_json`
+- Config additions in `etl-config`:
+  - `DestinationConfig::Delta { base_uri, warehouse: Option<String>, table_prefix: Option<String>, partition_columns: Option<Vec<String>>, max_concurrent_writes: usize, target_file_size_mb: usize, enable_cdf: bool, optimize_after_commits: Option<u64> }`
+- Wiring in `etl-replicator/src/core.rs`:
+  - Add `DestinationConfig::Delta` arm that constructs `DeltaDestination` with `StateStore + SchemaStore`
+
+## Semantics
+
+- Table naming and mapping
+  - Destination table path per source `TableId`:
+    - `<base_uri>/<table_prefix>/<schema>__<table>` (escape `_` as needed; mirror BigQuery naming rule)
+  - Persist mapping in `SchemaStore/StateStore` (`table_mappings`) to remain stable across restarts.
+
+- Schema mapping
+  - `TableSchema` → Arrow/Delta schema:
+    - Scalars: bool/int/float/text/timestamp/uuid/json → Arrow equivalents
+    - Numeric/decimal: if precision/scale unknown, map to string (match current practice)
+    - Arrays: Arrow List
+    - Always include PK columns (from source schema metadata)
+  - For additive changes: `ALTER TABLE ADD COLUMN` (nullable). Do not drop/rename in v1.
+
+- Initial table sync
+  - `truncate_table`:
+    - Prefer atomic empty snapshot: commit with “remove all” (delete predicate `true`) or recreate table version (depending on delta-rs capability).
+  - `write_table_rows(table_id, rows)`:
+    - Convert to Arrow `RecordBatch` in chunks sized to `target_file_size_mb`.
+    - Use delta-rs writer with `append` mode. Optional partitioning by configured columns.
+    - Parallelize per-table if caller invokes concurrently; respect `max_concurrent_writes` for internal splits.
+
+- CDC (`write_events`)
+  - Group events by `table_id`. For each table:
+    - Build three in-memory sets from the batch:
+      - Upserts (Insert + Update): last-wins by PK using LSN order within the batch.
+      - Deletes: by PK.
+      - Track affected PK set = upsert_keys ∪ delete_keys.
+    - Transactional commit:
+      1) Delete all rows with PK IN affected set (Delta delete predicate; for composite PK, build predicate disjunction or write a temporary helper file + merge path).
+      2) Append deduped upsert rows as new files.
+    - Ordering/idempotency:
+      - Last-wins inside batch via LSN; across batches the pipeline guarantees ordered delivery and only advances LSN after successful commit.
+      - Optional: use Delta `txn` app-level id for extra dedupe safety with `appId="etl-<pipeline>-<table>"` and a monotonic version (e.g., per-table sequence stored in `StateStore`).
+  - Truncate events:
+    - Handle same as `truncate_table` and continue.
+
+- Partitioning
+  - Default: no partitioning.
+  - Optional per-table partition columns from config (validate existence and low cardinality).
+  - Warn if PK chosen as partition key (can cause skew and small files).
+
+- Micro-batch mitigation and file sizing
+  - Accumulate rows into writer-side batches to target ~ `target_file_size_mb` (e.g., 128–256MB).
+  - During low throughput, still flush on `batch.max_fill_ms` from pipeline to prevent latency, but coalesce inside destination before closing Parquet files when possible.
+  - Optional background compaction: run `OPTIMIZE` (small-file coalescing) every N commits if `optimize_after_commits` is set.
+
+- Schema evolution during CDC
+  - On schema cache change (from `SchemaStore`), reconcile:
+    - Add missing columns as nullable in Delta.
+    - Fill absent values as null/default on write.
+
+- Error handling and retries
+  - Destination writes are idempotent given delete-then-append per affected PK set.
+  - Commit failures: no LSN advance; batch will be retried; repeat delete+append results in same final state.
+  - Surface structured errors; include commit metrics.
+
+- Metrics
+  - Counters: rows written, deletes applied, commits, optimized files.
+  - Gauges: file sizes, rows per file, commit duration.
+  - Logs: per-table commit stats; bytes written.
+
+## Pseudocode
+
+- write_table_rows
+```rust
+fn write_table_rows(table_id, table_rows):
+  ensure_table_exists_and_schema(table_id)
+  batches = chunk_into_record_batches(table_rows, target_file_size_mb)
+  for batch in batches:
+    delta_ops(table_id).append(batch).await
+```
+
+- write_events
+```rust
+fn write_events(events):
+  events_by_table = group_by_table(events)
+
+  for (table_id, evs) in events_by_table:
+    ensure_table_exists_and_schema(table_id)
+
+    // Deduplicate by PK with last-wins using (commit_lsn, start_lsn)
+    upserts_by_pk = HashMap<PK, Row>
+    delete_pks = HashSet<PK>
+    for e in evs.in_order():
+      match e {
+        Insert|Update => upserts_by_pk.insert(pk(e.row), e.row) // overwrite last
+        Delete       => { upserts_by_pk.remove(&pk(e)); delete_pks.insert(pk(e)); }
+        Truncate     => handle_truncate(table_id)
+      }
+
+    affected_pks = union(keys(upserts_by_pk), delete_pks)
+
+    // One transaction: delete + append
+    begin_tx(table_id, app_id, maybe_txn_version)
+      if !affected_pks.is_empty():
+        delete_where_pk_in(affected_pks)
+      if !upserts_by_pk.is_empty():
+        record_batches = chunk_into_record_batches(values(upserts_by_pk), target_file_size_mb)
+        for rb in record_batches:
+          append(rb)
+    commit_tx()
+```
+
+## Integration Points
+
+- `etl-config`:
+  - Add `Delta` variant in `DestinationConfig` + serde.
+  - Validation for URIs and partition columns.
+
+- `etl-replicator/src/core.rs`:
+  - Handle `DestinationConfig::Delta` creation.
+
+- `etl-destinations`:
+  - New `delta` module; feature flag; export `DeltaDestination`.
+
+- `etl`:
+  - Reuse existing batching; no changes required.
+  - Optional: a small helper to get PK metadata from `TableSchema` if not already exposed.
+
+## Testing Plan
+
+- Unit tests (file:// local object store):
+  - Create table; append rows; verify snapshot row count.
+  - CDC last-wins semantics within a batch; across multiple batches.
+  - Deletes only; upserts+deletes on same key.
+  - Truncate behavior.
+  - Schema add column: write with/without new column present.
+
+- Integration tests:
+  - Run pipeline with memory source into Delta file store; verify final state.
+  - Idempotency: inject failure after write, before LSN advance; rerun.
+
+- Performance/sizing tests:
+  - Validate file sizes approach target.
+  - Validate compaction reduces small files.
+
+## Milestones
+
+- M1: Scaffolding and config; create/append; initial sync end-to-end.
+- M2: CDC write path (delete+append), last-wins, idempotency; truncate.
+- M3: Schema evolution (add column), partitioning, metrics.
+- M4: Compaction/OPTIMIZE, tuning, docs, examples.
+
+## Risks and Mitigations
+
+- Delete predicate scalability for large PK sets:
+  - Mitigate by chunked delete predicates or temporary helper table + merge (if delta-rs supports).
+- Delta merge support maturity:
+  - Start with delete+append; add merge path when stable.
+- Small files during low throughput:
+  - Larger writer buffers; periodic compaction; configurable flush.
+- Schema drift:
+  - Additive only in v1; strict validation and logging on incompatible changes.
+
+## Documentation
+
+- Add `docs/tutorials` entry for Delta destination setup and configuration.
+- Example configs with S3/GCS/Azure using `object_store` env/creds.