diff --git a/RFC-0001.md b/RFC-0001.md new file mode 100644 index 00000000..0258f7db --- /dev/null +++ b/RFC-0001.md @@ -0,0 +1,230 @@ +# **RFC1 for Presto** + +## [Table Sampling in the Iceberg Connector] + +Proposers + +* @ZacBlanco + +## [Related Issues] + +- prestodb/presto#20937 +- prestodb/presto#20993 +- prestodb/presto#20776 +- prestodb/presto#20720 + + +## Summary + +Collecting and keeping statistics up to date is costly. This proposal seeks to introduce the concept +of "stored samples" within the iceberg connector. Stored samples would allow statistics to be +derived for a table and its partitions by storing a small sample of rows alongside the true table. + +## Background + +Query optimizers use table statistics in order to make cost-based decisions at optimization time on +the output size of particular operators such as joins and filters. Having accurate statistics is +imperative for the optimizer to make the most correct query planning decisions. The downside is that +for large tables, such as those on the scale of tens to thousands of terabytes, scanning the entire +table for statistics is costly. + +There are ways to reduce the cost of generating statistics. One such way is through table samples. +Table samples are a useful way to allow DBMS systems to calculate statistics without resorting to +full table scans every time the statistics need to be updated. A sample is essentially a subset of +rows from the original table. Each row from the original table should have equal likelihood of being +chosen to be in the sample to have the most representative sample[^4]. Recently, we've added support +for fixed-size bernoulli sampling with #21296. + +In this issue I'm proposing we add a full set of features to support statistics via sampling in the +Iceberg connector. The details are described in the following sections + +### [Optional] Goals + +- Provide the optimizer with table statistics without having to resort to full table scans +- Provide a way to update statistics in a cost-effective manner in the face of updates. + +### [Optional] Non-goals + + +## Proposed Implementation + +The first section describes the general user experience. The following section describes some of the +implementation details. + +### UX + +- The overall usage would resemble the following steps: + - Introduce a boolean configuration property in the iceberg connector `iceberg.use_sample_tables` (to be used later) + - Iceberg users run a procedure which generates (but does not populate) a sample table in their directory existing table's structure. The generated sample table is also an iceberg table. + - ```SQL + CALL {catalog}.system.create_sample_table('{table}') + ``` + - If the iceberg table's location exists at `/warehouse/database/table`, then a new iceberg table is created at `/warehouse/database/table/samples`. Because iceberg tables only use files if they are specifically listed in a manifest, we can muck around with the directory structure around the table without incurring any changes in query behavior to the main table. + - Iceberg users populate sample tables with a specific query using the recently-added `reservoir_sample` function: + - ```sql + INSERT INTO "{schema}.{table}$samples" + SELECT reservoir_sample(...) + FROM {schema}.{table}; + ``` + - Currently, `reservoir_sample` needs to process every single row to guarantee a [bernoulli sample](https://en.wikipedia.org/wiki/Bernoulli_sampling). However, if a perfect sample is not a requirement, it may be desirable to generate the reservoir sample from an even smaller subset of data to prevent scanning a large amount of data. e.g. + - ```SQL + INSERT INTO "{schema}.{table}$samples" + SELECT reservoir_sample(...) + FROM {schema}.{table}$sample TABLESAMPLE SYSTEM 1 + ``` + - This will have the effect of producing a sample with very little resources at the cost of accuracy. It may only be desirable for tables which might be extremely large (e.g. 1000s of TB) + - Users run `ANALYZE` command on all desired tables. + - The connector will selects the sample table handle rather than the true table handle when executing the `ANALYZE` query if `iceberg.use_sample_tables` is set to true. + - The resulting stats are then stored alongside the sample table, completely independent of the real table + - A new query using one or more of the iceberg tables comes in: + - ```SQL + SELECT ... FROM {table}; + ``` + - when the optimizer tries to call `getTableStatistics` on `{table}` it will check `iceberg.use_sample_table` and if the sample table exists, it will attempt to read the statistics from the sample table rather than the real table. + +### Implementation Details + +There are four cases where we need to inject logic in the existing operation of the iceberg +connector in order to support sample tables. All other changes should logically separate. + +1. Using the sample table handle when the user executes `ANALYZE` when the configuration is enabled. +2. Returning statistics from the sample when the optimizer calls to `getTableStatistics` +3. Generating splits from the sample table from the `IcebergSplitManager` when reading from the sample table. +4. Writing pages to the sample table when populating data within the `IcebergPageSinkProvider` + +These should each be straightforward as it is just a matter of checking the feature flag of +`iceberg.use_sample_statistics` first and checking whether the sample table exists. If either of the +above checks return false, then the connector will continue using its existing logic. + +The following sections flesh out a few more areas where additional details for the implementation +may be desirable. + +#### Sample Table Handles + +Within the Iceberg connector, there is an enum for `IcebergTableType`. We would add a new entry to +the enum named `SAMPLES`. A table handle could have this set as the table type. When generating +splits, inserting data, or retrieving statistics the table type on the handle should be checked, and +logic adjusted accordingly to utilize the iceberg table stored at the sample path, rather than the +true table. + +One benefit of creating the sample tables as an iceberg table is that we can re-use the +infrastructure of the iceberg connector to read the tables. All that is necessary is to properly +pass a correctly constructued `IcebergTableHandle` and then pass it to the necessary functions. + +#### Time Travel and Schema Updates + +The iceberg connector supports time travel and schema updates to the columns and partitions of the +table. The sample table should undergo the same schema updates whenever the real table does. The +implementation should perform the same schema updates that the real table does. + +In the implementation, this means that when the connector calls `Metadata#renameColumn`, +`dropColumn`, etc then the operation should run on both the real and sample table. + +#### Correctly Generating Stats through ANALYZE + +When using samples, generally all of the stats except for distinct values can map 1:1 onto the stats +for the full table. So, there needs to be some additional logic to estimate the number of distinct +values in the real dataset based on the sample[^2]. + +A recent PR, #20993, introduced the ability for connectors to specify the function used for a +specific column statistic aggregation. When analyzing a sample table, we can replace the +`approx_distinct` function with a separate function which uses one of the sample NDV estimators. +We'll then write the distinct values from the result of that function into the iceberg table. +Initially we can use the Chao Estimator[^1] as it's easy to implement, but can replace it with +different estimators as we see fit. + +#### Statistics for Partitioned Tables + +Currently, iceberg does not fully support support partition-level statistics[^3]. Once partitions +statistics are officially released, the iceberg connector should be updated to support collecting +and reading the partition-level stats. As long as the sample table creation code ensures the sample +table's layout is the exact same as the true table we should not need to change the stats +calculation code path in the connector's implementation of `getTableStatistics`. + +#### Sample Metadata + +For external systems to determine when and how to update the sample, additional metadata may +need to be stored, such as the last snapshot ID of the table used to create or update the sample, +or the sampling method and/or source used. + +As long as the metadata stored is limited to something small, say, less than 1KiB, then it is +probably fine to store this information inside of the iceberg's table properties which is a simple +`map`. + +#### Sample Maintenance + +Sample maintenance can be performed through a set of SQL statements which replace records in the +sample with updates, deletes, or inserts that occur on the real table. With the introduction of +Iceberg's changelog table in #20937 we can efficiently update the sample after the table has been +updated without needing to re-sample the entire table. + +Incremental sample maintenance will be out of scope for the main PR, but all of the infrastructure +for sample maintenance is available already once the samples are generated. The biggest hurdle for +users will be that sample maintenance will need to be done manually. We can document the maintenance +process for samples, but there currently isn't a way to schedule or automatically run the correct +set of maintenance queries for the sample tables. + +Further discussion on sample maintenance can be done in another issue. + +### Prototype + +A prototype containing most of the implementation above is at https://github.com/prestodb/presto/pull/21955 + +## [Optional] Metrics + +We should be able to see improvement in query plans and execution times without running full +table scans to generate the samples. Additionally, cost of maintaining statistics should decrease. +I would recommend measuring the amount of data scanned across multiple ANALYZE queries and compare +it to the amount of data scanned for sample creation and maintenance. + +## [Optional] Other Approaches Considered + +### Table-Level Storage in Puffin files + +One alternative approach is to generate the samples and store them in Puffin files[^puffin_files]. +Puffin files may store statistics or other arbitrary data at the table-level. This approach allows +us to store samples without having to generate a secondary table and still enables us to version the +samples because snapshots can have its own association with a specific puffin file. I like this +approach because it doesn't require us to "own" any kind of secondary table and would let other +systems read off the sample generated by presto. I do have a few large concerns noted below: + +The first downside to this approach is that you lose fine granularity of control over the sample +because you can no longer treat it as a regular table. Having the sample table be a separate iceberg +table lets you control the sample at a fine granularity using SQL. You can +update/delete/insert/select specific rows from the sample or run queries against it. `ANALYZE` is +basically just a SQL query with many aggregate functions. It would require work to figure out how to +use the puffin file as a data source for the `ANALYZE` query. + +Second, it isn't clear to me yet how exactly we would be able to from sample to a set of statistics. +To generate a puffin file blob with the sample, we would need to run the `reservoir_sample` +aggregation function. This means it would require updating the `TableStatisticType` enum to include +a new `SAMPLE` entry. Then, the connector could specify it supports samples. After specifying +support and running the `ANALYZE` aggregation query, we could then store the sample in the table. +However, the part that I don't have a clear answer to is how you would then run the `ANALYZE` +aggregation again on the sample to generate all the necessary column statistics, and where exactly +to store those. + +My trouble with this is that I can't envision a clear path that would allow you to go from puffin +file blob to set of column statistics that are viable to use by the optimizer without having to +re-implement a lot of what `ANALYZE` already does. + +## Adoption Plan + +Users would need to opt-in to this feature by enabling a feature flag +`iceberg.use_sample_statistics` and the samples would only be used if a sample table exixts for the +given table. + +## Test Plan + +How do we ensure the feature works as expected? Mention if any functional tests/integration tests +are needed. Special mention for product-test changes. If any PoC has been done already, please +mention the relevant test results here that you think will bolster your case of getting this RFC +approved. + +### References + +[^1]: [Chao Estimator](https://www.jstor.org/stable/2290471) +[^2]: [Sample-Based NDV Estimation](https://vldb.org/conf/1995/P311.PDF) +[^3]: Iceberg Partition Stats Issue Tracker - Issue 8450 +[^4]: [Bernoulli Sampling](https://en.wikipedia.org/wiki/Bernoulli_sampling) +[^puffin_files]: [Puffin Spec](https://iceberg.apache.org/puffin-spec/) \ No newline at end of file