new fdb locking locking by dorinhogea · Pull Request #5821 · bloomberg/comdb2

dorinhogea · 2026-03-18T16:04:26Z

This is a follow-up for the reverted:
#5726
here
#5810

The additional commits target two issues:

fix deadlock during table remote discovery due to bdb lock and the new mutex
master swing thread needs bdb write lock, waits for bdb read locks (sql threads to go away)
-> sql thread doing table discovery using cdb2api runs recover deadlock to free bdb read lock while it has tables_mtx, and waits for master swing thread to finish
-> sql threads finishing free the table locks, which require table_mtx lock, so block waiting for sql thread doing discovery
Fix is to not run cdb2api while tables_mtx is acquired; the trade-off is concurrent threads will race to discover same table duplicating effort.
fix never completing sql threads due to a new regression in remote table versioning; blocked threads spew "Remote table ... version 0, ..."

Idea is to use short term mutexes to protect arrays and hashes, and long term read locks for fdb and table read access. There is a live read/write for each fdb object. As long as there is a pointer to fdb, a read lock is acquired for that fdb. Each remote table is protected by its own long term read lock (similar to a table lock). There are two intervals when both table locks and fdb live lock are acquired: 1) during query preparing, if remote table discovery is needed, when we retrieve and attach remote tables; these locks are released once setup is done; the sqlite_stat tables are also locked during this phase 2) during query execution; we get an fdb live lock and table locks for each remote table locked, and we release them when unlocking the remote table Updating a table requires an exclusive lock on that table; to avoid blocking for long duration when a remote table is schema changed, we use a mvcc scheme that COW new table objects and detects readers using trywrlock. Last reader of a stable table will free that object. Signed-off-by: Dorin Hogea <dhogea@bloomberg.net>

…ate retrievals Signed-off-by: Dorin Hogea <dhogea@bloomberg.net>

dorinhogea · 2026-03-18T16:04:37Z

/plugin-branch fdblocking3

roborivers

Cbuild submission: Success ✓.
Regression testing: Success ✓.

The first 10 failing tests are:
2026-03-18T14:20:03EDT [131700] misstable_remsql_rte_connect_generated [failed with core dumped]
2026-03-18T14:20:03EDT [131700] misstable_remsql [failed with core dumped]
2026-03-18T14:20:03EDT [131700] sc_resume_logicalsc_generated **quarantined**
2026-03-18T14:20:03EDT [131700] sc_timepart **quarantined**
2026-03-18T14:20:03EDT [131700] queuedb_rollover_noroll1_generated **quarantined**
2026-03-18T14:20:03EDT [131700] consumer_non_atomic_default_consumer_generated **quarantined**
2026-03-18T14:20:03EDT [131700] reco-ddlk-sql [timeout] **quarantined**
2026-03-18T14:20:03EDT [131700] fdb_push_rte_connect_generated [timeout]
2026-03-18T14:20:03EDT [131700] fdb_push_redirect_generated [timeout]
2026-03-18T14:20:03EDT [131700] fdb_push [timeout]

roborivers

Cbuild submission: Success ✓.
Regression testing: Success ✓.

The first 10 failing tests are:
sc_truncate_multiddl_generated [db unavailable at finish] **quarantined**
consumer_non_atomic_default_consumer_generated **quarantined**

Signed-off-by: Dorin Hogea <dhogea@bloomberg.net>

roborivers

Cbuild submission: Error ⚠.
Regression testing: Success ✓.

The first 10 failing tests are:
sc_truncate [db unavailable at finish]
consumer_non_atomic_default_consumer_generated **quarantined**
guid [timeout]

dorinhogea added 2 commits March 12, 2026 16:17

do not run cdb2api with table_mtx acquired; accept and correct duplic…

da7c92c

…ate retrievals Signed-off-by: Dorin Hogea <dhogea@bloomberg.net>

This comment was marked as spam.

Sign in to view

roborivers approved these changes Mar 18, 2026

View reviewed changes

dorinhogea force-pushed the fdblocking3 branch from 92c9964 to 6094335 Compare March 18, 2026 20:06

roborivers approved these changes Mar 18, 2026

View reviewed changes

fix remote table version 0 regression

a0dce40

Signed-off-by: Dorin Hogea <dhogea@bloomberg.net>

dorinhogea force-pushed the fdblocking3 branch from 6094335 to a0dce40 Compare March 19, 2026 14:13

roborivers suggested changes Mar 19, 2026

View reviewed changes

markhannum approved these changes Mar 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new fdb locking locking#5821

new fdb locking locking#5821
dorinhogea wants to merge 3 commits intobloomberg:mainfrom
dorinhogea:fdblocking3

dorinhogea commented Mar 18, 2026

Uh oh!

dorinhogea commented Mar 18, 2026

Uh oh!

This comment was marked as spam.

Uh oh!

roborivers left a comment

Uh oh!

roborivers left a comment

Uh oh!

roborivers left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dorinhogea commented Mar 18, 2026

Uh oh!

dorinhogea commented Mar 18, 2026

Uh oh!

This comment was marked as spam.

Uh oh!

roborivers left a comment

Choose a reason for hiding this comment

Uh oh!

roborivers left a comment

Choose a reason for hiding this comment

Uh oh!

roborivers left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants