Skip to content

Flaky TableChangeWatcherTest.testSchemaChanges due to CuratorCache initial sync race #2935

@platinumhamburg

Description

@platinumhamburg

Search before asking

  • I searched in the issues and found nothing similar.

Fluss version

0.9.0 (latest release)

Please describe the bug 🐞

Description

TableChangeWatcherTest.testSchemaChanges intermittently fails on CI with a missing CreateTableEvent for one of the 10 created tables.

Observed Failure

Expecting actual:
  [SchemaChangeEvent{table_0}, CreateTableEvent{table_0}, ...,
   SchemaChangeEvent{table_4}, SchemaChangeEvent{table_5}, CreateTableEvent{table_5}, ...]
but could not find:
  [CreateTableEvent{table_4}]

Key observations from CI log:

  • 20 SchemaChangeEvents received (expected 10) — table_4 has a duplicate SchemaChangeEvent
  • 19 CreateTableEvents received (expected 10) — table_4's CreateTableEvent is missing
  • The missing event is replaced by an extra SchemaChangeEvent, suggesting stale data interference

Root Cause Hypothesis

testTableChanges and testSchemaChanges both create tables named table_0 through table_9. In beforeEach, the database is dropped and recreated, and a new TableChangeWatcher (backed by CuratorCache) is started. However, CuratorCache.start() performs an asynchronous initial tree scan.

If the initial sync has not completed before new tables are created, the cache may hold stale oldData from the previous test's ZK nodes. When the new table_4 node is written, CuratorCache fires NODE_CHANGED with non-empty oldData (from the stale cache), causing the event to be routed to processTableRegistrationChange() instead of processCreateTable():

// TableChangeWatcher.java:134-140
if (oldData != null && oldData.getData() != null && oldData.getData().length > 0) {
    processTableRegistrationChange(tablePath, newData);  // ← wrong branch
} else {
    processCreateTable(tablePath, newData);  // ← correct branch
}

This explains why:

  • Only 1 table is affected (depends on timing of initial sync vs table creation)
  • SchemaChangeEvent is duplicated (initial sync picks up stale schema node)
  • CreateTableEvent is lost (routed to wrong handler)

Reproducibility

This race cannot be reliably reproduced locally (50/50 passes) because local ZooKeeper is fast enough that CuratorCache initial sync completes before any table creation begins. The race only manifests on slow CI hosts with shared resources.

Solution

  • Add awaitInitialized() to TableChangeWatcher using CuratorCacheListener.initialized() callback
  • Call awaitInitialized() in test beforeEach after start() to ensure cache is fully synced
  • Clear stale events after initial sync completes

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions