Skip to content

Comments

CCIP-9463 Durable queue for storage writer #723

Merged
mateusz-sekara merged 27 commits intomainfrom
durable-q
Feb 20, 2026
Merged

CCIP-9463 Durable queue for storage writer #723
mateusz-sekara merged 27 commits intomainfrom
durable-q

Conversation

@mateusz-sekara
Copy link
Collaborator

@mateusz-sekara mateusz-sekara commented Feb 18, 2026

Using a DB-backed queue for processing in storageWriter. It's backward compatible. If db is not defined, we fallback to the previous implementation using memory - mainly because we need db migrations to be added in a chainlink repository as well

    postgres_queue_bench_test.go:181: Duration: 2.731303583s
    postgres_queue_bench_test.go:182:   Publish phase:  945.473167ms (10577 jobs/sec)
    postgres_queue_bench_test.go:183:   Published:      10000
    postgres_queue_bench_test.go:184:   Consumed:       11756 (includes retried jobs re-consumed)
    postgres_queue_bench_test.go:185:   Completed:      10000
    postgres_queue_bench_test.go:186:   Retried:        1166
    postgres_queue_bench_test.go:187:   Perm. failed:   590
    postgres_queue_bench_test.go:196:   Remaining in queue: 0, Archived: 10000, Sum: 10000 (expected 10000)

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements a durable, database-backed queue for the storage writer component of the verifier system. The key innovation is replacing the in-memory batcher with a PostgreSQL-backed job queue to provide durability and crash recovery. The implementation maintains backward compatibility by falling back to the in-memory implementation when no database connection is provided.

Changes:

  • Implements a generic PostgreSQL job queue interface with full CRUD operations, retry logic, and archival capabilities
  • Adds StorageWriterProcessorDB as a database-backed alternative to the existing StorageWriterProcessor
  • Introduces queue-batcher adapter to integrate the existing batcher pattern with the new persistent queue
  • Adds comprehensive database migrations for job queue tables
  • Updates all coordinator constructors to accept an optional database connection parameter

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
verifier/jobqueue/interface.go Defines generic job queue interface with Jobable trait for queue items
verifier/jobqueue/postgres_queue.go PostgreSQL implementation of job queue with row-level locking and SKIP LOCKED
verifier/jobqueue/postgres_queue_test.go Comprehensive unit tests covering all queue operations and concurrent scenarios
verifier/jobqueue/postgres_queue_bench_test.go Benchmark tests demonstrating throughput under load
verifier/storage_writer_db.go Database-backed storage writer processor that polls the queue
verifier/queue_batcher_adapter.go Adapter that forwards batcher output to the persistent queue
verifier/verification_coordinator.go Factory function to create appropriate storage writer based on DB availability
verifier/types.go Adds JobKey method to VerificationTask for queue integration
protocol/message_types.go Adds JobKey method to VerifierNodeResult for queue integration
verifier/pkg/db/migrations/postgres/00004_create_job_queues.sql Database schema for both verification_tasks and verification_results queues
verifier/testutil/test_db.go Test utility for creating PostgreSQL testcontainers with migrations
verification_coordinator_*_test.go Updates test setup to use database-backed queue
cmd/verifier/token/main.go Passes database connection to coordinators
cmd/verifier/servicefactory.go Passes database connection to coordinator
integration/pkg/constructors/committee_verifier.go Explicitly passes nil for DB (backward compatibility)
executor/pkg/adapter/adapter.go Minor code formatting improvement for type parameters
build/devenv/fakes/go.mod Adds lib/pq dependency
build/devenv/fakes/go.sum Adds dependencies for goose migrations and testcontainers

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +33 to +36
go func() {
<-ctx.Done()
wg.Wait()
}()
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goroutine spawned on lines 33-36 will continue running until ctx.Done() is signaled, but there's no guarantee it will complete before the caller continues. This could lead to a goroutine leak if the function returns before the cleanup goroutine exits. Additionally, this cleanup goroutine waits for wg.Wait() which will only complete after forwardToQueue exits, creating a dependency where the cleanup goroutine waits for the forwarding goroutine. Consider using a more structured approach where the caller is responsible for cleanup, or ensure the cleanup goroutine signals its completion.

Copilot uses AI. Check for mistakes.
Comment on lines +63 to +71
// Publish items to queue
if len(batch.Items) > 0 {
publishCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
if err := queue.Publish(publishCtx, batch.Items...); err != nil {
lggr.Errorw("Failed to publish to queue", "error", err, "count", len(batch.Items))
} else {
lggr.Debugw("Published results to queue", "count", len(batch.Items))
}
cancel()
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When queue.Publish fails on line 66, the error is logged but the batch items are silently dropped. This could lead to data loss if the queue is temporarily unavailable or if there's a transient database issue. Since this is replacing the in-memory batcher that had retry capabilities, consider implementing a retry mechanism here or at least making the data loss more visible through metrics or critical logging.

Suggested change
// Publish items to queue
if len(batch.Items) > 0 {
publishCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
if err := queue.Publish(publishCtx, batch.Items...); err != nil {
lggr.Errorw("Failed to publish to queue", "error", err, "count", len(batch.Items))
} else {
lggr.Debugw("Published results to queue", "count", len(batch.Items))
}
cancel()
// Publish items to queue with bounded retries to reduce data loss on transient failures
if len(batch.Items) > 0 {
const maxPublishRetries = 3
for attempt := 1; attempt <= maxPublishRetries; attempt++ {
publishCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
err := queue.Publish(publishCtx, batch.Items...)
cancel()
if err == nil {
lggr.Debugw("Published results to queue", "count", len(batch.Items))
break
}
if attempt < maxPublishRetries {
lggr.Errorw(
"Failed to publish to queue, will retry",
"error", err,
"count", len(batch.Items),
"attempt", attempt,
"maxAttempts", maxPublishRetries,
)
// Simple exponential backoff between retries, but abort if context is done
backoff := time.Duration(attempt) * time.Second
select {
case <-ctx.Done():
return
case <-time.After(backoff):
}
} else {
lggr.Errorw(
"Failed to publish to queue after max retries, dropping batch",
"error", err,
"count", len(batch.Items),
"maxAttempts", maxPublishRetries,
)
}
}

Copilot uses AI. Check for mistakes.
Comment on lines 330 to 334
if time.Now().After(retryDeadline) || time.Now().Equal(retryDeadline) {
failed = append(failed, resultJobID)
} else {
retried = append(retried, resultJobID)
}
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a potential race condition when comparing the current time with retry_deadline. The comparison on line 330 uses time.Now() which could be different from the NOW() used in the SQL query on line 276. This means a job could be marked as 'pending' in the database but then classified as 'failed' in the logging counters due to the time difference between SQL execution and Go code execution. Consider using the retry_deadline returned from the query to make the decision, comparing it against the time when the query was executed (captured before the query), not a new time.Now() call.

Copilot uses AI. Check for mistakes.
@mateusz-sekara mateusz-sekara changed the title Durable queue for storage writer CCIP-9463 Durable queue for storage writer Feb 20, 2026
mateusz-sekara and others added 2 commits February 20, 2026 15:43
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
// When the job was created
CreatedAt time.Time
// When processing started (nil if not started)
StartedAt *time.Time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be a NPE if we're not careful?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Go code we only write to that field; it's used in produce/consume sql queries, which won't result in NPE

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 19 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +189 to +201
if err := s.resultQueue.Complete(ctx, jobIDs...); err != nil {
s.lggr.Errorw("Failed to complete jobs in queue",
"error", err,
"batchSize", len(jobIDs),
)
// Continue anyway - data is written, tracking will catch up
}

// Update checkpoints
s.updateCheckpoints(ctx, affectedChains)

// Track message latencies
s.messageTracker.TrackMessageLatencies(ctx, results)
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When resultQueue.Complete() fails (lines 189-195), the code continues with checkpoint updates and message latency tracking. However, the jobs remain in 'processing' status in the database and will eventually be reclaimed and reprocessed, potentially leading to duplicate writes. Consider failing the entire operation or implementing compensating logic to handle this edge case.

Copilot uses AI. Check for mistakes.
Comment on lines +77 to +79
-- Index for archive cleanup
CREATE INDEX IF NOT EXISTS idx_verification_tasks_archive_completed
ON verification_tasks_archive (completed_at DESC);
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The archive tables have indexes on completed_at for cleanup queries, but the Complete operation that inserts into the archive uses job_id for deletion from the main table. If there are queries that look up archived jobs by job_id (e.g., for debugging or audit), consider adding an index on job_id in the archive tables as well.

Copilot uses AI. Check for mistakes.
}

// Publish items to queue
if len(batch.Items) > 0 {
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The forwardToQueue function uses the parent ctx for the timeout context on line 65, but if the parent context is already canceled, context.WithTimeout will immediately return a canceled context. The error from queue.Publish could then be misleading (context canceled vs actual publish error). Consider checking if ctx.Err() is non-nil before attempting to publish, or use context.Background() with timeout for the publish operation.

Suggested change
if len(batch.Items) > 0 {
if len(batch.Items) > 0 {
// If the parent context is already canceled, avoid creating a timed context
// that is immediately canceled and producing a misleading publish error.
if err := ctx.Err(); err != nil {
lggr.Errorw("Skipping publish to queue due to context error", "error", err, "count", len(batch.Items))
return
}

Copilot uses AI. Check for mistakes.
Comment on lines +254 to +255
INSERT INTO %s
SELECT *, NOW() as completed_at
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SELECT *, NOW() as completed_at query assumes the archive table has the exact same columns as the main table plus completed_at. This is fragile because if the schema changes (e.g., new columns added), this INSERT will fail. Consider explicitly listing all columns to make schema evolution easier and failures more obvious.

Suggested change
INSERT INTO %s
SELECT *, NOW() as completed_at
INSERT INTO %s (
job_id,
owner_id,
payload,
priority,
run_at,
attempts,
max_attempts,
created_at,
updated_at,
task_job_id,
completed_at
)
SELECT
job_id,
owner_id,
payload,
priority,
run_at,
attempts,
max_attempts,
created_at,
updated_at,
task_job_id,
NOW() AS completed_at

Copilot uses AI. Check for mistakes.
Comment on lines +28 to +30
wg.Go(func() {
forwardToQueue(ctx, b.OutChannel(), queue, lggr)
})
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sync.WaitGroup from the standard library does not have a Go method. This code will fail to compile. Based on other files in the codebase that use wg.Go(), it appears you need to use a different WaitGroup type (likely from chainlink-common services package or a custom type). Check other files like storage_writer_db.go line 84 or storage_writer.go line 96 to see the correct import/type being used.

Suggested change
wg.Go(func() {
forwardToQueue(ctx, b.OutChannel(), queue, lggr)
})
wg.Add(1)
go func() {
defer wg.Done()
forwardToQueue(ctx, b.OutChannel(), queue, lggr)
}()

Copilot uses AI. Check for mistakes.
Comment on lines +33 to +36
go func() {
<-ctx.Done()
wg.Wait()
}()
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goroutine spawned on line 33 only waits for the WaitGroup but never properly shuts down. If ctx is canceled but the forwardToQueue goroutine is blocked (e.g., waiting on a channel), this cleanup goroutine will wait indefinitely on wg.Wait(). Consider adding proper shutdown coordination or a timeout.

Copilot uses AI. Check for mistakes.
//
//nolint:gofumpt
func queryWithFailover[TInput any, TResponse any](
func queryWithFailover[TInput, TResponse any](
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This formatting change to the generic type parameters appears unrelated to the PR's stated purpose of adding a durable queue for storage writer. Consider moving unrelated formatting/style changes to a separate PR to keep this PR focused on its main objective.

Copilot uses AI. Check for mistakes.
KodeyThomas
KodeyThomas previously approved these changes Feb 20, 2026
@github-actions
Copy link

Code coverage report:

Package main durable-q diff
github.com/smartcontractkit/chainlink-ccv/aggregator 48.22% 48.22% +0.00%
github.com/smartcontractkit/chainlink-ccv/bootstrap 39.26% 39.26% +0.00%
github.com/smartcontractkit/chainlink-ccv/cmd 0.00% 0.00% +0.00%
github.com/smartcontractkit/chainlink-ccv/committee 100.00% 100.00% +0.00%
github.com/smartcontractkit/chainlink-ccv/common 45.74% 45.74% +0.00%
github.com/smartcontractkit/chainlink-ccv/executor 47.21% 47.26% +0.05%
github.com/smartcontractkit/chainlink-ccv/indexer 34.11% 34.05% -0.06%
github.com/smartcontractkit/chainlink-ccv/integration 39.84% 39.74% -0.10%
github.com/smartcontractkit/chainlink-ccv/pkg 100.00% 100.00% +0.00%
github.com/smartcontractkit/chainlink-ccv/pricer 15.70% 15.70% +0.00%
github.com/smartcontractkit/chainlink-ccv/protocol 65.22% 65.30% +0.08%
github.com/smartcontractkit/chainlink-ccv/verifier 53.79% 43.04% -10.75%

@mateusz-sekara mateusz-sekara added this pull request to the merge queue Feb 20, 2026
s.lggr.Errorw("Error processing batch", "error", err)
}

case <-cleanupTicker.C:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Do we want to stall the consumption while we do the cleanup? Wondering if that could introduce some delays

Merged via the queue into main with commit 58d41a9 Feb 20, 2026
24 checks passed
@mateusz-sekara mateusz-sekara deleted the durable-q branch February 20, 2026 19:04

// Publish items to queue
if len(batch.Items) > 0 {
publishCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe we can move the 10 seconds to a constant or make it configurable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants