-
Notifications
You must be signed in to change notification settings - Fork 518
Description
Search before asking
- I searched in the issues and found nothing similar.
Motivation
Today, enabling lakehouse for an existing table only works reliably if the table was created after the cluster had already enabled datalake support. Here, "enabled datalake support" means the cluster had already configured datalake.format; in the current behavior, setting datalake.format is treated as enabling datalake support. This causes a compatibility problem for the following user flow:
- Create a Fluss table when the cluster has not explicitly enabled lakehouse.
- Later configure the cluster to enable lakehouse.
- Enable lakehouse for the existing table.
At the moment, step 3 fails for tables created before cluster-level lakehouse was enabled. The root issue is that datalake.format currently serves two roles at the same time:
- selecting the lake-format-specific bucketing / key-encoding behavior; and
- indicating that the cluster is ready to create and manage lake tables.
This makes the semantics unclear for new deployments that want to pre-bind the future lake format (for example Paimon, so that bucketing stays consistent) but do not want users to enable lakehouse for tables until the cluster is explicitly switched on.
We need a backward-compatible way to separate "legacy cluster behavior" from "new cluster behavior", while still allowing tables created before table.datalake.enabled=true to be enabled later if their bucketing format is already predetermined.
Solution
Introduce a new cluster config datalake.enabled with compatibility semantics:
datalake.enabledis unset: treat the cluster as a legacy cluster and keep the current behavior unchanged.datalake.enabled=false: treat the cluster as a new-style cluster in "pre-bind only" mode.datalake.enabled=true: treat the cluster as a new-style cluster with lakehouse fully enabled.
For clusters where datalake.enabled is explicitly configured (either true or false):
- require
datalake.formatto be configured; - automatically persist
table.datalake.format=<cluster datalake.format>into newly created tables; - when
datalake.enabled=false, do not allow creating/enabling lake tables yet; - when
datalake.enabled=true, allowALTER TABLE ... SET ('table.datalake.enabled'='true')for tables whosetable.datalake.formatalready matches the clusterdatalake.format.
This keeps old clusters fully compatible while enabling the desired flow for new clusters:
- Create cluster with
datalake.enabled=falseanddatalake.format=paimon. - Create table; Fluss auto-persists
table.datalake.format=paimon, so writes already follow Paimon bucketing. - Later switch cluster to
datalake.enabled=true. - Enable lakehouse for the existing table successfully.
Suggested validation rules:
- If
datalake.enabledis explicitly set butdatalake.formatis missing, fail fast. - If a table has no persisted
table.datalake.format, keep rejecting later lakehouse enablement to avoid bucket inconsistency. - If a table's
table.datalake.formatdiffers from the clusterdatalake.format, reject enablement. - In new-style clusters,
datalake.formatshould be treated as immutable (or at least strongly restricted) once tables have been created with the pre-bound format.
Affected areas likely include:
- cluster config parsing / compatibility checks;
CoordinatorService.applySystemDefaults(...);LakeCatalogDynamicLoaderload conditions;- alter-table validation for
table.datalake.enabled.
Anything else?
This issue is mainly about compatibility and semantic clarity:
- old clusters should continue to behave exactly as they do today;
- new clusters should be able to pre-bind lake-format bucketing without exposing lakehouse functionality too early;
- users should be able to create a table first, enable cluster lakehouse later, and then enable lakehouse on that table successfully.
Willingness to contribute
- I'm willing to submit a PR!