Skip to content

PoC: Race condition prevention in worker#352

Merged
Rcc999 merged 15 commits intomasterfrom
rayane/fix/sui-worker-race-condition
Feb 19, 2026
Merged

PoC: Race condition prevention in worker#352
Rcc999 merged 15 commits intomasterfrom
rayane/fix/sui-worker-race-condition

Conversation

@Rcc999
Copy link
Contributor

@Rcc999 Rcc999 commented Feb 9, 2026

Closes: #309

Summary by Sourcery

Introduce a distributed locking mechanism to prevent concurrent executions of the Sui indexer scheduled job.

New Features:

  • Add D1Storage support for acquiring and releasing named cron locks backed by the database.
  • Introduce a cron_locks table in the initial schema to persist distributed locks for scheduled tasks.

Enhancements:

  • Guard the scheduled Sui indexer handler with a database-backed lock to skip overlapping runs and always release the lock afterward.

Tests:

  • Add unit tests covering lock acquisition, expiration handling, release, re-acquisition, and multiple independent locks.

Summary by Sourcery

Introduce a database-backed distributed lock to prevent overlapping executions of the Sui indexer scheduled job.

New Features:

  • Add D1Storage support for acquiring and releasing named cron locks persisted in the database.
  • Add a cron_locks table to the initial schema to store lock state for scheduled tasks.

Enhancements:

  • Guard the scheduled Sui indexer and redeem solver cron handler with a distributed lock to skip concurrent runs while ensuring lock release on completion or failure.

Build:

  • Extend initial database migration and test DB initialisation to include the cron_locks table.

Documentation:

  • Tidy AGENTS.md formatting and table layout for package documentation.

Tests:

  • Add unit tests covering lock acquisition, contention, expiration, release, and re-acquisition semantics for distributed locks.

… it runs one at a time across all worker instances

Signed-off-by: Rayane Charif <rayane.charif@gonative.cc>
@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Feb 9, 2026

Reviewer's Guide

Adds a database-backed distributed locking mechanism for the Sui indexer cron job and wires it into the scheduled handler, including schema changes, storage APIs, and tests to prevent overlapping runs.

Sequence diagram for scheduled Sui indexer cron with distributed lock

sequenceDiagram
    actor Scheduler
    participant Worker
    participant D1Storage
    participant D1Database as D1_DB
    participant SuiIndexer
    participant RedeemSolver
    participant Logger

    Scheduler->>Worker: scheduled(event, env, ctx)
    Worker->>D1Storage: new D1Storage(env.DB)
    Worker->>D1Storage: acquireLock(sui-indexer-cron, ttlMs=5min)
    D1Storage->>D1Database: INSERT INTO cron_locks ... ON CONFLICT ...
    D1Database-->>D1Storage: acquired_at or null
    D1Storage-->>Worker: lockToken

    alt lock not acquired
        Worker->>Logger: warn(Cron job already running, skipping)
        Worker-->>Scheduler: return
    else lock acquired
        Worker->>D1Storage: getActiveNetworks()
        Worker->>Worker: getSecret(NBTC_MINTING_SIGNER_MNEMONIC)
        Worker->>Worker: createSuiClients(activeNetworks, mnemonic)
        par parallel tasks
            Worker->>SuiIndexer: runSuiIndexer(storage, activeNetworks, suiClients)
            Worker->>RedeemSolver: runRedeemSolver(storage, env, suiClients, activeNetworks)
        end
        Worker->>Worker: reportErrors(results, scheduled, Scheduled task error, [SuiIndexer, RedeemSolver])
        Worker->>D1Storage: releaseLock(sui-indexer-cron, lockToken)
        D1Storage->>D1Database: DELETE FROM cron_locks WHERE lock_name = ... AND acquired_at = ...
        D1Database-->>D1Storage: ok
        D1Storage-->>Worker: void
        Worker-->>Scheduler: return
    end
Loading

ER diagram for new cron_locks table for distributed locks

erDiagram
    cron_locks {
        TEXT lock_name PK
        INTEGER acquired_at
        INTEGER expires_at
    }
Loading

Class diagram for D1Storage with distributed lock methods

classDiagram
    class D1Storage {
        - any db
        + acquireLock(lockName string, ttlMs number) Promise~number|null~
        + releaseLock(lockName string, acquiredAt number) Promise~void~
    }
Loading

File-Level Changes

Change Details Files
Introduce D1-backed distributed lock APIs on D1Storage for named cron locks.
  • Add acquireLock method that uses an INSERT ... ON CONFLICT ... DO UPDATE statement with an expires_at check to implement a TTL-based lock with a numeric token (acquired_at).
  • Add releaseLock method that deletes the lock row only when both lock_name and acquired_at match, preventing one instance from releasing another instance’s lock.
  • Log lock acquisition failures via logError while returning null so callers can treat failures as no-lock-acquired.
packages/sui-indexer/src/storage.ts
Guard the Sui indexer scheduled handler with the new distributed lock to skip overlapping executions and always release the lock.
  • Acquire a named lock ("sui-indexer-cron") with a 5-minute TTL at the start of the scheduled handler and short-circuit if no lock is obtained, logging a warning.
  • Wrap the existing scheduled logic (fetch active networks, create Sui clients, run indexer and redeem solver in parallel, report errors) in a try/finally that releases the lock using the returned token.
  • Ensure that lock release is attempted even when tasks fail, preventing stale locks within the TTL window.
packages/sui-indexer/src/index.ts
Add schema and test harness support for cron_locks table to back the distributed lock.
  • Extend the initial D1 schema with a cron_locks table keyed by lock_name and storing acquired_at and expires_at timestamps.
  • Include cron_locks in the shared test DB initialization tables list so tests run against the new table.
  • Reformat the AGENTS.md package table markdown without changing semantics.
packages/btcindexer/db/migrations/0001_initial_schema.sql
packages/lib/src/test-helpers/init_db.ts
AGENTS.md
Add unit tests validating distributed lock semantics on D1Storage.
  • Test acquiring a lock when none exists writes the row to cron_locks.
  • Test that a non-expired lock prevents acquiring another lock with the same name.
  • Test that an expired lock can be replaced and that the new expires_at is in the future.
  • Test that releaseLock with the correct token removes the row and allows subsequent re-acquisition.
  • Test that an old token cannot release a renewed lock acquired after expiry, ensuring instance isolation.
packages/sui-indexer/src/storage.test.ts

Assessment against linked issues

Issue Objective Addressed Explanation
#309 Introduce a mechanism in the Sui worker to prevent concurrent executions that can cause race conditions when selecting or spending SUI (e.g., via a distributed lock).
#309 Persist and manage the concurrency control mechanism in the database schema and integrate it into the scheduled Sui indexer workflow.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a DB-backed distributed lock to prevent overlapping scheduled executions of the Sui indexer worker (and its paired redeem-solver task) by persisting named cron locks in D1.

Changes:

  • Add D1Storage.acquireLock / releaseLock methods backed by a new cron_locks table.
  • Guard scheduled() execution in packages/sui-indexer behind the DB-backed lock and ensure unlock in finally.
  • Add unit tests for lock acquisition/expiry/release behaviors and update test DB initialization to include the new table.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
packages/sui-indexer/src/storage.ts Adds D1-backed lock acquire/release helpers.
packages/sui-indexer/src/storage.test.ts Adds unit tests for the new locking behavior.
packages/sui-indexer/src/index.ts Wraps scheduled cron work in the distributed lock.
packages/lib/src/test-helpers/init_db.ts Ensures test DB teardown includes cron_locks.
packages/btcindexer/db/migrations/0001_initial_schema.sql Adds cron_locks table to the schema.

Signed-off-by: Rayane Charif <rayane.charif@gonative.cc>
Signed-off-by: Rayane Charif <rayane.charif@gonative.cc>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Copy link
Contributor

@robert-zaremba robert-zaremba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is solving the issue

Signed-off-by: Rayane Charif <rayane.charif@gonative.cc>
Signed-off-by: Rayane Charif <rayane.charif@gonative.cc>
@Rcc999 Rcc999 marked this pull request as ready for review February 17, 2026 14:52
@Rcc999 Rcc999 requested a review from a team as a code owner February 17, 2026 14:52
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • In acquireLock, any database error is treated the same as normal lock contention (returns null); consider distinguishing these cases (e.g., by rethrowing or returning a separate error state) so scheduler behavior and alerting can differentiate real failures from expected lock contention.
  • The "should not release another instance's lock after expiry" test relies on real timing with setTimeout(20), which can be flaky in CI; consider mocking Date.now() or using a controllable clock instead of wall time to make the timing-based lock expiry behavior deterministic.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `acquireLock`, any database error is treated the same as normal lock contention (returns `null`); consider distinguishing these cases (e.g., by rethrowing or returning a separate error state) so scheduler behavior and alerting can differentiate real failures from expected lock contention.
- The `"should not release another instance's lock after expiry"` test relies on real timing with `setTimeout(20)`, which can be flaky in CI; consider mocking `Date.now()` or using a controllable clock instead of wall time to make the timing-based lock expiry behavior deterministic.

## Individual Comments

### Comment 1
<location> `packages/sui-indexer/src/storage.ts:858-877` </location>
<code_context>
 		};
 	}
+
+	async acquireLock(lockName: string, ttlMs: number): Promise<number | null> {
+		const now = Date.now();
+		try {
+			const result = await this.db
+				.prepare(
+					`INSERT INTO cron_locks (lock_name, acquired_at, expires_at)
+					 VALUES (?, ?, ?)
+					 ON CONFLICT(lock_name) DO UPDATE
+					 SET acquired_at = excluded.acquired_at, expires_at = excluded.expires_at
+					 WHERE cron_locks.expires_at <= excluded.acquired_at
+					 RETURNING acquired_at`,
+				)
+				.bind(lockName, now, now + ttlMs)
+				.first<number>("acquired_at");
+			return result ?? null;
+		} catch (error) {
+			logError({ msg: "Failed to acquire lock", method: "acquireLock", lockName }, error);
+			return null;
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Consider distinguishing DB errors from an actually-contended lock instead of returning null in both cases.

Since `scheduled` treats `null` as “job already running,” DB failures are indistinguishable from normal lock contention and may silently mask storage issues. Consider propagating DB errors (or using a distinct error/sentinel type) so callers can handle contention vs. operational failure differently.

```suggestion
	async acquireLock(lockName: string, ttlMs: number): Promise<number | null> {
		const now = Date.now();
		try {
			const result = await this.db
				.prepare(
					`INSERT INTO cron_locks (lock_name, acquired_at, expires_at)
					 VALUES (?, ?, ?)
					 ON CONFLICT(lock_name) DO UPDATE
					 SET acquired_at = excluded.acquired_at, expires_at = excluded.expires_at
					 WHERE cron_locks.expires_at <= excluded.acquired_at
					 RETURNING acquired_at`,
				)
				.bind(lockName, now, now + ttlMs)
				.first<number>("acquired_at");
			// returns `null` when the lock is currently held (no row inserted/updated),
			// and throws on DB/operational errors (distinguishable by callers)
			return result ?? null;
		} catch (error) {
			logError({ msg: "Failed to acquire lock", method: "acquireLock", lockName }, error);
			throw error;
		}
	}
```
</issue_to_address>

### Comment 2
<location> `packages/sui-indexer/src/storage.ts:879-884` </location>
<code_context>
+		}
+	}
+
+	async releaseLock(lockName: string, acquiredAt: number): Promise<void> {
+		await this.db
+			.prepare(`DELETE FROM cron_locks WHERE lock_name = ? AND acquired_at = ?`)
+			.bind(lockName, acquiredAt)
+			.run();
+	}
 }
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Consider adding error handling/logging to `releaseLock` similar to `acquireLock`.

If this delete fails (e.g. due to a transient DB issue), the lock can be left behind and block future runs until expiry, with no visibility into why. Adding error logging here, similar to `acquireLock` (even if the error is rethrown), would make diagnosing stuck cron locks much easier.

```suggestion
	async releaseLock(lockName: string, acquiredAt: number): Promise<void> {
		try {
			await this.db
				.prepare(`DELETE FROM cron_locks WHERE lock_name = ? AND acquired_at = ?`)
				.bind(lockName, acquiredAt)
				.run();
		} catch (error) {
			logError({ msg: "Failed to release lock", method: "releaseLock", lockName, acquiredAt }, error);
			throw error;
		}
	}
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Signed-off-by: Rayane Charif <rayane.charif@gonative.cc>
Signed-off-by: Rayane Charif <rayane.charif@gonative.cc>
Signed-off-by: Rayane Charif <rayane.charif@gonative.cc>
Signed-off-by: Rayane Charif <rayane.charif@gonative.cc>
Signed-off-by: Rayane Charif <rayane.charif@gonative.cc>
Signed-off-by: Rayane Charif <rayane.charif@gonative.cc>
@Rcc999 Rcc999 merged commit 29981ca into master Feb 19, 2026
12 checks passed
@Rcc999 Rcc999 deleted the rayane/fix/sui-worker-race-condition branch February 19, 2026 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

make sure we don't have race condintion when selecting / spending SUI in workers

3 participants