Skip to content

Allow enabling lakehouse on tables created before cluster-level lakehouse is enabled #2908

@luoyuxia

Description

@luoyuxia

Search before asking

  • I searched in the issues and found nothing similar.

Motivation

Today, enabling lakehouse for an existing table only works reliably if the table was created after the cluster had already enabled datalake support. Here, "enabled datalake support" means the cluster had already configured datalake.format; in the current behavior, setting datalake.format is treated as enabling datalake support. This causes a compatibility problem for the following user flow:

  1. Create a Fluss table when the cluster has not explicitly enabled lakehouse.
  2. Later configure the cluster to enable lakehouse.
  3. Enable lakehouse for the existing table.

At the moment, step 3 fails for tables created before cluster-level lakehouse was enabled. The root issue is that datalake.format currently serves two roles at the same time:

  • selecting the lake-format-specific bucketing / key-encoding behavior; and
  • indicating that the cluster is ready to create and manage lake tables.

This makes the semantics unclear for new deployments that want to pre-bind the future lake format (for example Paimon, so that bucketing stays consistent) but do not want users to enable lakehouse for tables until the cluster is explicitly switched on.

We need a backward-compatible way to separate "legacy cluster behavior" from "new cluster behavior", while still allowing tables created before table.datalake.enabled=true to be enabled later if their bucketing format is already predetermined.

Solution

Introduce a new cluster config datalake.enabled with compatibility semantics:

  • datalake.enabled is unset: treat the cluster as a legacy cluster and keep the current behavior unchanged.
  • datalake.enabled=false: treat the cluster as a new-style cluster in "pre-bind only" mode.
  • datalake.enabled=true: treat the cluster as a new-style cluster with lakehouse fully enabled.

For clusters where datalake.enabled is explicitly configured (either true or false):

  • require datalake.format to be configured;
  • automatically persist table.datalake.format=<cluster datalake.format> into newly created tables;
  • when datalake.enabled=false, do not allow creating/enabling lake tables yet;
  • when datalake.enabled=true, allow ALTER TABLE ... SET ('table.datalake.enabled'='true') for tables whose table.datalake.format already matches the cluster datalake.format.

This keeps old clusters fully compatible while enabling the desired flow for new clusters:

  1. Create cluster with datalake.enabled=false and datalake.format=paimon.
  2. Create table; Fluss auto-persists table.datalake.format=paimon, so writes already follow Paimon bucketing.
  3. Later switch cluster to datalake.enabled=true.
  4. Enable lakehouse for the existing table successfully.

Suggested validation rules:

  • If datalake.enabled is explicitly set but datalake.format is missing, fail fast.
  • If a table has no persisted table.datalake.format, keep rejecting later lakehouse enablement to avoid bucket inconsistency.
  • If a table's table.datalake.format differs from the cluster datalake.format, reject enablement.
  • In new-style clusters, datalake.format should be treated as immutable (or at least strongly restricted) once tables have been created with the pre-bound format.

Affected areas likely include:

  • cluster config parsing / compatibility checks;
  • CoordinatorService.applySystemDefaults(...);
  • LakeCatalogDynamicLoader load conditions;
  • alter-table validation for table.datalake.enabled.

Anything else?

This issue is mainly about compatibility and semantic clarity:

  • old clusters should continue to behave exactly as they do today;
  • new clusters should be able to pre-bind lake-format bucketing without exposing lakehouse functionality too early;
  • users should be able to create a table first, enable cluster lakehouse later, and then enable lakehouse on that table successfully.

Willingness to contribute

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions