-
Notifications
You must be signed in to change notification settings - Fork 518
Description
Search before asking
- I searched in the issues and found nothing similar.
Fluss version
0.9.0 (latest release)
Please describe the bug 🐞
Description
TableChangeWatcherTest.testSchemaChanges intermittently fails on CI with a missing CreateTableEvent for one of the 10 created tables.
Observed Failure
Expecting actual:
[SchemaChangeEvent{table_0}, CreateTableEvent{table_0}, ...,
SchemaChangeEvent{table_4}, SchemaChangeEvent{table_5}, CreateTableEvent{table_5}, ...]
but could not find:
[CreateTableEvent{table_4}]
Key observations from CI log:
- 20
SchemaChangeEvents received (expected 10) — table_4 has a duplicate SchemaChangeEvent - 19
CreateTableEvents received (expected 10) — table_4'sCreateTableEventis missing - The missing event is replaced by an extra
SchemaChangeEvent, suggesting stale data interference
Root Cause Hypothesis
testTableChanges and testSchemaChanges both create tables named table_0 through table_9. In beforeEach, the database is dropped and recreated, and a new TableChangeWatcher (backed by CuratorCache) is started. However, CuratorCache.start() performs an asynchronous initial tree scan.
If the initial sync has not completed before new tables are created, the cache may hold stale oldData from the previous test's ZK nodes. When the new table_4 node is written, CuratorCache fires NODE_CHANGED with non-empty oldData (from the stale cache), causing the event to be routed to processTableRegistrationChange() instead of processCreateTable():
// TableChangeWatcher.java:134-140
if (oldData != null && oldData.getData() != null && oldData.getData().length > 0) {
processTableRegistrationChange(tablePath, newData); // ← wrong branch
} else {
processCreateTable(tablePath, newData); // ← correct branch
}This explains why:
- Only 1 table is affected (depends on timing of initial sync vs table creation)
SchemaChangeEventis duplicated (initial sync picks up stale schema node)CreateTableEventis lost (routed to wrong handler)
Reproducibility
This race cannot be reliably reproduced locally (50/50 passes) because local ZooKeeper is fast enough that CuratorCache initial sync completes before any table creation begins. The race only manifests on slow CI hosts with shared resources.
Solution
- Add
awaitInitialized()toTableChangeWatcherusingCuratorCacheListener.initialized()callback - Call
awaitInitialized()in testbeforeEachafterstart()to ensure cache is fully synced - Clear stale events after initial sync completes
Are you willing to submit a PR?
- I'm willing to submit a PR!