diff --git a/docs/docs/flink/procedures.md b/docs/docs/flink/procedures.md index 1e34926ae385..766cb5d105a9 100644 --- a/docs/docs/flink/procedures.md +++ b/docs/docs/flink/procedures.md @@ -847,7 +847,7 @@ All available procedures are listed below. CALL [catalog.]sys.rescale(`table` => 'identifier', `bucket_num` => bucket_num, `partition` => 'partition', `scan_parallelism` => scan_parallelism, `sink_parallelism` => sink_parallelism) - Rescale one partition of a table. Arguments: + Rescale one partition of a table. For partitioned tables, different partitions can have different bucket counts after rescaling. Arguments:
  • table: The target table identifier. Cannot be empty.
  • bucket_num: Resulting bucket number after rescale. The default value of argument bucket_num is the current bucket number of the table. Cannot be empty for postpone bucket tables.
  • partition: What partition to rescale. For partitioned table this argument cannot be empty.
  • diff --git a/docs/docs/maintenance/rescale-bucket.md b/docs/docs/maintenance/rescale-bucket.md index 1a304525345e..74889941f7ff 100644 --- a/docs/docs/maintenance/rescale-bucket.md +++ b/docs/docs/maintenance/rescale-bucket.md @@ -45,14 +45,17 @@ Please note that - `ALTER TABLE` only modifies the table's metadata and will **NOT** reorganize or reformat existing data. Reorganize existing data must be achieved by `INSERT OVERWRITE`. - Rescale bucket number does not influence the read and running write jobs. -- Once the bucket number is changed, any newly scheduled `INSERT INTO` jobs which write to without-reorganized - existing table/partition will throw a `TableException` with message like +- **Partitioned tables** support per-partition bucket counts. Each partition retains its own bucket + count from its data files, and the new bucket count only applies to newly created partitions or partitions that + have been reorganized with `INSERT OVERWRITE`. +- **Unpartitioned tables** require a full rescale before writing. If you change the bucket number and attempt + to write without reorganizing the data first, a `RuntimeException` will be thrown: ```text - Try to write table/partition ... with a new bucket num ..., + Try to write table with a new bucket num ..., but the previous bucket num is ... Please switch to batch mode, and perform INSERT OVERWRITE to rescale current data layout first. ``` -- For partitioned table, it is possible to have different bucket number for different partitions. *E.g.* +- For partitioned tables, it is possible to have different bucket numbers for different partitions. *E.g.* ```sql ALTER TABLE my_table SET ('bucket' = '4'); INSERT OVERWRITE my_table PARTITION (dt = '2022-01-01') @@ -62,6 +65,8 @@ Please note that INSERT OVERWRITE my_table PARTITION (dt = '2022-01-02') SELECT * FROM ...; ``` + After these operations, partition `dt=2022-01-01` uses 4 buckets, `dt=2022-01-02` uses 8 buckets, and any + new partitions will use the latest table-level default (8 buckets in this case). - During overwrite period, make sure there are no other jobs writing the same table/partition. ## Use Case @@ -121,8 +126,12 @@ and the job's latency keeps increasing. To improve the data freshness, users can -- scaling out ALTER TABLE verified_orders SET ('bucket' = '32'); ``` -- Switch to the batch mode and overwrite the current partition(s) to which the streaming job is writing +- Use the `rescale` procedure or switch to batch mode and overwrite the partition(s) that need rescaling ```sql + -- Option 1: Use the rescale procedure (recommended) + CALL sys.rescale(`table` => 'default.verified_orders', `bucket_num` => 32, `partition` => 'dt=2022-06-22'); + + -- Option 2: Manual batch overwrite SET 'execution.runtime-mode' = 'batch'; -- suppose today is 2022-06-22 -- case 1: there is no late event which updates the historical partitions, thus overwrite today's partition is enough @@ -142,8 +151,11 @@ and the job's latency keeps increasing. To improve the data freshness, users can FROM verified_orders WHERE dt IN ('2022-06-20', '2022-06-21', '2022-06-22'); ``` -- After overwrite job has finished, switch back to streaming mode. And now, the parallelism can be increased alongside with bucket number to restore the streaming job from the savepoint -( see [Start a SQL Job from a savepoint](https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/table/sqlclient/#start-a-sql-job-from-a-savepoint) ) +- After the overwrite job has finished, switch back to streaming mode. The parallelism can be increased alongside + the bucket number to restore the streaming job from the savepoint + ( see [Start a SQL Job from a savepoint](https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/table/sqlclient/#start-a-sql-job-from-a-savepoint) ). + Note that for partitioned tables, each partition retains its own bucket count, so only the rescaled partitions + are affected. ```sql SET 'execution.runtime-mode' = 'streaming'; SET 'execution.savepoint.path' = ; diff --git a/docs/docs/primary-key-table/data-distribution.md b/docs/docs/primary-key-table/data-distribution.md index 8b74724ff0e4..30ef3a1ddef9 100644 --- a/docs/docs/primary-key-table/data-distribution.md +++ b/docs/docs/primary-key-table/data-distribution.md @@ -1,8 +1,10 @@ --- title: "Data Distribution" -sidebar_position: 2 +weight: 2 +type: docs +aliases: +- /primary-key-table/data-distribution.html --- -