Table Sampling in the Iceberg Connector

## Background

Query optimizers use table statistics in order to make cost-based decisions at optimization time on the output size of particular operators such as joins and filters. Having accurate statistics is imperative for the optimizer to make the most correct query planning decisions. The downside is that for large tables, such as those on the scale of tens to thousands of terabytes, scanning the entire table for statistics is costly.

There are ways to reduce the cost of generating statistics. One such way is through table samples. Table samples are a useful way to allow DBMS systems to calculate statistics without resorting to full table scans every time the statistics need to be updated. A sample is essentially a subset of rows from the original table. Each row from the original table should have equal likelihood of being chosen to be in the sample to have the most representative sample[^4]. Recently, we've added support for fixed-size bernoulli sampling with #21296.

In this issue I'm proposing we add a full set of features to support statistics via sampling in the Iceberg connector. The details are described in the following sections

## Table Sampling in Iceberg

The first section describes the general user experience. The following section describes some of the implementation details.

### UX

- The overall usage would resemble the following steps:
	- Introduce a boolean configuration property in the iceberg connector `iceberg.use_sample_tables` (to be used later)
	- Iceberg users run a procedure which generates (but does not populate) a sample table in their directory existing table's structure. The generated sample table is also an iceberg table.
		- ```SQL
		  CALL {catalog}.system.create_sample_table('{table}')
		  ```
		- If the iceberg table's location exists at `/warehouse/database/table`, then a new iceberg table is created at `/warehouse/database/table/samples`. Because iceberg tables only use files if they are specifically listed in a manifest, we can muck around with the directory structure around the table without incurring any changes in query behavior to the main table.
	- Iceberg users populate sample tables with a specific query using the recently-added `reservoir_sample` function:
		- ```sql
		   INSERT INTO "{schema}.{table}$samples" 
		   SELECT reservoir_sample(...) 
		   FROM {schema}.{table};
		   ```
		- Currently, `reservoir_sample` needs to process every single row to guarantee a [bernoulli sample](https://en.wikipedia.org/wiki/Bernoulli_sampling). However, if a perfect sample is not a requirement, it may be desirable to generate the reservoir sample from an even smaller subset of data to prevent scanning a large amount of data. e.g.
			- ```SQL
			  INSERT INTO "{schema}.{table}$samples" 
			        SELECT reservoir_sample(...) 
			        FROM {schema}.{table}$sample TABLESAMPLE SYSTEM 1
			  ```
		- This will have the effect of producing a sample with very little resources at the cost of accuracy. It may only be desirable for tables which might be extremely large (e.g. 1000s of TB)
	- Users run `ANALYZE` command on all desired tables. 
		- The connector will selects the sample table handle rather than the true table handle when executing the `ANALYZE` query if `iceberg.use_sample_tables` is set to true.
		- The resulting stats are then stored alongside the sample table, completely independent of the real table
	- A new query using one or more of the iceberg tables comes in:
		- ```SQL
		  SELECT ... FROM {table};
		  ```
		- when the optimizer tries to call `getTableStatistics` on `{table}` it will check `iceberg.use_sample_table` and if the sample table exists, it will attempt to read the statistics from the sample table rather than the real table.

### Implementation Details

There are four cases where we need to inject logic in the existing operation of the iceberg connector in order to support sample tables. All other changes should logically separate.

1. Using the sample table handle when the user executes `ANALYZE` when the configuration is enabled.
2. Returning statistics from the sample when the optimizer calls to `getTableStatistics`
3. Generating splits from the sample table from the `IcebergSplitManager` when reading from the sample table.
4. Writing pages to the sample table when populating data within the `IcebergPageSinkProvider`

These should each be straightforward as it is just a matter of checking the feature flag of `iceberg.use_sample_statistics` first and checking whether the sample table exists. If either of the above checks return false, then the connector will continue using its existing logic.

The following sections flesh out a few more areas where additional details for the implementation may be desirable.

#### Sample Table Handles

Within the Iceberg connector, there is an enum for `IcebergTableType`. We would add a new entry to the enum named `SAMPLES`. A table handle could have this set as the table type. When generating splits, inserting data, or retrieving statistics the table type on the handle should be checked, and logic adjusted accordingly to utilize the iceberg table stored at the sample path, rather than the true table.

One benefit of creating the sample tables as an iceberg table is that we can re-use the infrastructure of the iceberg connector to read the tables. All that is necessary is to properly pass a correctly constructued `IcebergTableHandle` and then pass it to the necessary functions. 

#### Correctly Generating Stats through ANALYZE

When using samples, generally all of the stats except for distinct values can map 1:1 onto the stats for the full table. So, there needs to be some additional logic to estimate the number of distinct values in the real dataset based on the sample[^2].

A recent PR, #20993, introduced the ability for connectors to specify the function used for a specific column statistic aggregation. When analyzing a sample table, we can replace the `approx_distinct` function with a separate function which uses one of the sample NDV estimators. We'll then write the distinct values from the result of that function into the iceberg table. Initially we can use the Chao Estimator[^1] as it's easy to implement, but can replace it with different estimators as we see fit.

#### Statistics for Partitioned Tables

Currently, iceberg does not fully support support partition-level statistics[^3]. Once partitions statistics are officially released, the iceberg connector should be updated to support collecting and reading the partition-level stats. As long as the sample table creation code ensures the sample table's layout is the exact same as the true table we should not need to change the stats calculation code path in the connector's implementation of `getTableStatistics`.

#### Sample Maintenance

Sample maintenance can be performed through a set of SQL statements which replace records in the sample with updates, deletes, or inserts that occur on the real table. With the introduction of Iceberg's changelog table in #20937 we can efficiently update the sample after the table has been updated without needing to re-sample the entire table.

Incremental sample maintenance will be out of scope for the main PR, but all of the infrastructure for sample maintenance is available already once the samples are generated. The biggest hurdle for users will be that sample maintenance will need to be done manually. We can document the maintenance process for samples, but there currently isn't a way to schedule or automatically run the correct set of maintenance queries for the sample tables. 

Further discussion on sample maintenance can be done in another issue.

### Prototype

A prototype containing most of the implementation above is at #21955 

### References

[^1]: [Chao Estimator](https://www.jstor.org/stable/2290471)
[^2]: [Sample-Based NDV Estimation](https://vldb.org/conf/1995/P311.PDF)
[^3]: Iceberg Partition Stats Issue Tracker - Issue 8450
[^4]: [Bernoulli Sampling](https://en.wikipedia.org/wiki/Bernoulli_sampling)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Table Sampling in the Iceberg Connector #21963

Background

Table Sampling in Iceberg

UX

Implementation Details

Sample Table Handles

Correctly Generating Stats through ANALYZE

Statistics for Partitioned Tables

Sample Maintenance

Prototype

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Table Sampling in the Iceberg Connector #21963

Description

Background

Table Sampling in Iceberg

UX

Implementation Details

Sample Table Handles

Correctly Generating Stats through ANALYZE

Statistics for Partitioned Tables

Sample Maintenance

Prototype

References

Footnotes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions