Support time travel queries

from @sfc-gh-mslot 
```
Since all data and metadata is immutable, Iceberg snapshots remain queryable until explicitly deleted. In principle, this allows querying the table in a past state (time travel).

We could add enable time travel through a dummy filter like `WHERE as_of('2024-01-01 00:00') and ...` which would trigger planner/executor hooks to use an older snapshot.

Time travel also is a form of restoring from backup.
```

from @sfc-gh-abozkurt 
```
Should we support branching and tagging as well? There are 2 ways to time travel query an Iceberg table. `AS OF <branch or tag or snapshotid>` or `AS OF <timestamp>`
```

from @sfc-gh-mslot 
```
Since we already support schema changes, we should probably consider that past snapshots/branches will have different schemas and cannot be queries via a filter.

It seems preferable to introduce something like:
```sql
create table branch ()
using iceberg
with (branch_from = 'iceberg_table');

create table past_branch ()
using iceberg
with (branch_from = 'iceberg_table', as_of = '2024-11-01 03:00:00');
```

which are wired into to the same metadata JSON, probably via a new catalog table that maps between PostgreSQL tables and branch names.
```

from @sfc-gh-okalaci 
```

> which are wired into to the same metadata JSON, 
> probably via a new catalog table that maps between PostgreSQL tables and branch names.

I think this makes sense (and very smart 💯 ).  It (probably) saves us from a lot of trouble. For example, we could be able to keep the `relationId->currentSnapshot` mapping in the code (e.g., `GetCurrentSnapshot`). So, majority of the code flow would not change.

Still, `GetCurrentSnapshot` should be updated, such that for tables originated from a branch, we should find the snapshot that is associated with the branch in `refs` such as:
```
  "refs": {
    "test_branch": {
      "snapshot-id": 3638654959223475932,
      "type": "branch"
    },
    "main": {
      "snapshot-id": 2774914780655428277,
      "type": "branch"
    }
```

And, in case we use `metadata.json` -> `current-schema-id`, we should now find the `schema-id` of the given `snapshot`:

```
"snapshots": [
    {
      "sequence-number": 1,
      "snapshot-id": 3638654959223475932,
      "timestamp-ms": 1732190894178,
      "summary": {....
      "manifest-list": "s3://testbucketcdw/spark_test/public/sample2/metadata/snap-3638654959223475932-1-91fa9309-7fa0-4a1b-aa9f-35942d756766.avro",
      "schema-id": 0
    },
```

Some things to be careful about:

- Snapshot Expiration - Deletion queue: We currently add files to deletion queue when a file is referred in expired snapshots, but not in non-expired snapshots. Now, we should probably expand the definition of expired/non-expired snapshots and follow this algorithm: https://iceberg.apache.org/spec/#snapshot-retention-policy

- Locking: Now that we have multiple tables modifying the same `metadata.json`, we should be careful with locking. Though, the `select .. for update` that we already have while accessing `iceberg_tables` might be good enough. Still, let's be careful about concurrent DML/DDL/SELECTs etc.

-  `iceberg_files()` UDF?



```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support time travel queries #179

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support time travel queries #179

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions