Skip to content

Cancellation handling gap in replication publish worker causes transient infinite retry during shutdown #1865

@neekolas

Description

@neekolas

Severity: Informational | Likelihood: Low | Impact: Informational | Type: Vulnerability

Details

The replication publish worker does not honor context cancellation in a specific window: if cancellation occurs after the insert step but before deleting the staged row, publishStagedEnvelope returns false and the inner retry loop continues without re-checking ctx.Done(), leading to transient infinite retries. This does not block shutdown, preserves data integrity, and only causes minor operational/UX friction.

In pkg/api/message/publish_worker.go, start() processes each staged envelope with an inner loop: for !p.publishStagedEnvelope(stagedEnv) { time.Sleep(...) }. This loop does not select on ctx.Done(). Inside publishStagedEnvelope(), if p.ctx.Err() is non-nil immediately after the insert-and-increment step (and before attempting to delete the staged row), the function returns false. As a result, when shutdown triggers and the context is canceled during this window, the inner loop keeps retrying and never re-enters the outer select that checks ctx.Done(). The shutdown sequence in pkg/server/server.go does not wait for this worker goroutine, so shutdown is not blocked. Database insert semantics are idempotent (duplicate inserts return inserted == 0), and any staged row left over is cleaned up on the next attempt or restart. Client-side waits after staging are bounded by a 30-second timeout (and also cancel with request context). Overall impact is minor: transient retry looping and possible bounded client delay near shutdown, with no data corruption or lasting blockage.

Exploitation

Scenario 1

Operator initiates shutdown while a staged envelope is in-flight: the DB is closed before the shared context is canceled; publishStagedEnvelope sees cancellation or DB errors pre-delete and returns false; the inner loop retries every ~10ms until the process exits, causing transient log noise but not blocking shutdown.

Preconditions / Assumptions:

  • (a) API enabled; replication publish worker running
  • (b) A staged envelope is being processed when shutdown starts
  • (c) BaseServer.Shutdown closes the DB before canceling the shared context
  • (d) publishStagedEnvelope is between insert and delete when cancellation happens

Scenario 2

A client calls PublishPayerEnvelopes just before shutdown: the envelope is staged and the worker gets stuck pre-delete, so lastProcessed does not advance; waitForGatewayPublish waits up to 30 seconds (or until the request context is canceled), resulting in bounded extra latency for the client.

Preconditions / Assumptions:

  • (a) Client request to PublishPayerEnvelopes in-flight near shutdown
  • (b) Staged envelope successfully inserted for processing
  • (c) Worker cancellation occurs pre-delete so lastProcessed is not updated
  • (d) waitForGatewayPublish uses a 30-second timeout and respects request context cancellation

Scenario 3

An envelope is inserted into gateway tables but the staged row delete does not occur due to cancellation: after restart, the staged row is reprocessed; duplicate insert is ignored (inserted == 0) and the staged row is deleted safely without double-accounting.

Preconditions / Assumptions:

  • (a) Envelope insertion succeeded but staged-row deletion did not run due to cancellation
  • (b) Node is restarted and the publish worker resumes
  • (c) Insert path is idempotent (duplicate insert returns inserted == 0) enabling safe deletion of the staged row

Files impacted

  • pkg/api/message/publish_worker.go

Lines 115-120:

// Infinite retry on failure to publish; we cannot
// continue to the next envelope until this one is processed
time.Sleep(p.sleepOnFailureTime)
}

p.lastProcessed.Store(stagedEnv.ID) metrics.EmitApiStagedEnvelopeProcessingDelay(time.Since(stagedEnv.OriginatorTime))

Metadata

Metadata

Labels

Type

No type

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions