Conversation
Idea is to use short term mutexes to protect arrays and hashes, and long term read locks for fdb and table read access. There is a live read/write for each fdb object. As long as there is a pointer to fdb, a read lock is acquired for that fdb. Each remote table is protected by its own long term read lock (similar to a table lock). There are two intervals when both table locks and fdb live lock are acquired: 1) during query preparing, if remote table discovery is needed, when we retrieve and attach remote tables; these locks are released once setup is done; the sqlite_stat tables are also locked during this phase 2) during query execution; we get an fdb live lock and table locks for each remote table locked, and we release them when unlocking the remote table Updating a table requires an exclusive lock on that table; to avoid blocking for long duration when a remote table is schema changed, we use a mvcc scheme that COW new table objects and detects readers using trywrlock. Last reader of a stable table will free that object. Signed-off-by: Dorin Hogea <dhogea@bloomberg.net>
…ate retrievals Signed-off-by: Dorin Hogea <dhogea@bloomberg.net>
|
/plugin-branch fdblocking3 |
roborivers
left a comment
There was a problem hiding this comment.
Cbuild submission: Success ✓.
Regression testing: Success ✓.
The first 10 failing tests are:
2026-03-18T14:20:03EDT [131700] misstable_remsql_rte_connect_generated [failed with core dumped]
2026-03-18T14:20:03EDT [131700] misstable_remsql [failed with core dumped]
2026-03-18T14:20:03EDT [131700] sc_resume_logicalsc_generated **quarantined**
2026-03-18T14:20:03EDT [131700] sc_timepart **quarantined**
2026-03-18T14:20:03EDT [131700] queuedb_rollover_noroll1_generated **quarantined**
2026-03-18T14:20:03EDT [131700] consumer_non_atomic_default_consumer_generated **quarantined**
2026-03-18T14:20:03EDT [131700] reco-ddlk-sql [timeout] **quarantined**
2026-03-18T14:20:03EDT [131700] fdb_push_rte_connect_generated [timeout]
2026-03-18T14:20:03EDT [131700] fdb_push_redirect_generated [timeout]
2026-03-18T14:20:03EDT [131700] fdb_push [timeout]
roborivers
left a comment
There was a problem hiding this comment.
Cbuild submission: Success ✓.
Regression testing: Success ✓.
The first 10 failing tests are:
sc_truncate_multiddl_generated [db unavailable at finish] **quarantined**
consumer_non_atomic_default_consumer_generated **quarantined**
Signed-off-by: Dorin Hogea <dhogea@bloomberg.net>
roborivers
left a comment
There was a problem hiding this comment.
Cbuild submission: Error ⚠.
Regression testing: Success ✓.
The first 10 failing tests are:
sc_truncate [db unavailable at finish]
consumer_non_atomic_default_consumer_generated **quarantined**
guid [timeout]
This is a follow-up for the reverted:
#5726
here
#5810
The additional commits target two issues:
fix deadlock during table remote discovery due to bdb lock and the new mutex
master swing thread needs bdb write lock, waits for bdb read locks (sql threads to go away)
-> sql thread doing table discovery using cdb2api runs recover deadlock to free bdb read lock while it has tables_mtx, and waits for master swing thread to finish
-> sql threads finishing free the table locks, which require table_mtx lock, so block waiting for sql thread doing discovery
Fix is to not run cdb2api while tables_mtx is acquired; the trade-off is concurrent threads will race to discover same table duplicating effort.
fix never completing sql threads due to a new regression in remote table versioning; blocked threads spew "Remote table ... version 0, ..."