Skip to content

Table Sampling in the Iceberg Connector #21963

@ZacBlanco

Description

@ZacBlanco

Background

Query optimizers use table statistics in order to make cost-based decisions at optimization time on the output size of particular operators such as joins and filters. Having accurate statistics is imperative for the optimizer to make the most correct query planning decisions. The downside is that for large tables, such as those on the scale of tens to thousands of terabytes, scanning the entire table for statistics is costly.

There are ways to reduce the cost of generating statistics. One such way is through table samples. Table samples are a useful way to allow DBMS systems to calculate statistics without resorting to full table scans every time the statistics need to be updated. A sample is essentially a subset of rows from the original table. Each row from the original table should have equal likelihood of being chosen to be in the sample to have the most representative sample1. Recently, we've added support for fixed-size bernoulli sampling with #21296.

In this issue I'm proposing we add a full set of features to support statistics via sampling in the Iceberg connector. The details are described in the following sections

Table Sampling in Iceberg

The first section describes the general user experience. The following section describes some of the implementation details.

UX

  • The overall usage would resemble the following steps:
    • Introduce a boolean configuration property in the iceberg connector iceberg.use_sample_tables (to be used later)
    • Iceberg users run a procedure which generates (but does not populate) a sample table in their directory existing table's structure. The generated sample table is also an iceberg table.
      • CALL {catalog}.system.create_sample_table('{table}')
      • If the iceberg table's location exists at /warehouse/database/table, then a new iceberg table is created at /warehouse/database/table/samples. Because iceberg tables only use files if they are specifically listed in a manifest, we can muck around with the directory structure around the table without incurring any changes in query behavior to the main table.
    • Iceberg users populate sample tables with a specific query using the recently-added reservoir_sample function:
      •  INSERT INTO "{schema}.{table}$samples" 
         SELECT reservoir_sample(...) 
         FROM {schema}.{table};
      • Currently, reservoir_sample needs to process every single row to guarantee a bernoulli sample. However, if a perfect sample is not a requirement, it may be desirable to generate the reservoir sample from an even smaller subset of data to prevent scanning a large amount of data. e.g.
        • INSERT INTO "{schema}.{table}$samples" 
                SELECT reservoir_sample(...) 
                FROM {schema}.{table}$sample TABLESAMPLE SYSTEM 1
      • This will have the effect of producing a sample with very little resources at the cost of accuracy. It may only be desirable for tables which might be extremely large (e.g. 1000s of TB)
    • Users run ANALYZE command on all desired tables.
      • The connector will selects the sample table handle rather than the true table handle when executing the ANALYZE query if iceberg.use_sample_tables is set to true.
      • The resulting stats are then stored alongside the sample table, completely independent of the real table
    • A new query using one or more of the iceberg tables comes in:
      • SELECT ... FROM {table};
      • when the optimizer tries to call getTableStatistics on {table} it will check iceberg.use_sample_table and if the sample table exists, it will attempt to read the statistics from the sample table rather than the real table.

Implementation Details

There are four cases where we need to inject logic in the existing operation of the iceberg connector in order to support sample tables. All other changes should logically separate.

  1. Using the sample table handle when the user executes ANALYZE when the configuration is enabled.
  2. Returning statistics from the sample when the optimizer calls to getTableStatistics
  3. Generating splits from the sample table from the IcebergSplitManager when reading from the sample table.
  4. Writing pages to the sample table when populating data within the IcebergPageSinkProvider

These should each be straightforward as it is just a matter of checking the feature flag of iceberg.use_sample_statistics first and checking whether the sample table exists. If either of the above checks return false, then the connector will continue using its existing logic.

The following sections flesh out a few more areas where additional details for the implementation may be desirable.

Sample Table Handles

Within the Iceberg connector, there is an enum for IcebergTableType. We would add a new entry to the enum named SAMPLES. A table handle could have this set as the table type. When generating splits, inserting data, or retrieving statistics the table type on the handle should be checked, and logic adjusted accordingly to utilize the iceberg table stored at the sample path, rather than the true table.

One benefit of creating the sample tables as an iceberg table is that we can re-use the infrastructure of the iceberg connector to read the tables. All that is necessary is to properly pass a correctly constructued IcebergTableHandle and then pass it to the necessary functions.

Correctly Generating Stats through ANALYZE

When using samples, generally all of the stats except for distinct values can map 1:1 onto the stats for the full table. So, there needs to be some additional logic to estimate the number of distinct values in the real dataset based on the sample2.

A recent PR, #20993, introduced the ability for connectors to specify the function used for a specific column statistic aggregation. When analyzing a sample table, we can replace the approx_distinct function with a separate function which uses one of the sample NDV estimators. We'll then write the distinct values from the result of that function into the iceberg table. Initially we can use the Chao Estimator3 as it's easy to implement, but can replace it with different estimators as we see fit.

Statistics for Partitioned Tables

Currently, iceberg does not fully support support partition-level statistics4. Once partitions statistics are officially released, the iceberg connector should be updated to support collecting and reading the partition-level stats. As long as the sample table creation code ensures the sample table's layout is the exact same as the true table we should not need to change the stats calculation code path in the connector's implementation of getTableStatistics.

Sample Maintenance

Sample maintenance can be performed through a set of SQL statements which replace records in the sample with updates, deletes, or inserts that occur on the real table. With the introduction of Iceberg's changelog table in #20937 we can efficiently update the sample after the table has been updated without needing to re-sample the entire table.

Incremental sample maintenance will be out of scope for the main PR, but all of the infrastructure for sample maintenance is available already once the samples are generated. The biggest hurdle for users will be that sample maintenance will need to be done manually. We can document the maintenance process for samples, but there currently isn't a way to schedule or automatically run the correct set of maintenance queries for the sample tables.

Further discussion on sample maintenance can be done in another issue.

Prototype

A prototype containing most of the implementation above is at #21955

References

Footnotes

  1. Bernoulli Sampling

  2. Sample-Based NDV Estimation

  3. Chao Estimator

  4. Iceberg Partition Stats Issue Tracker - Issue 8450

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions