Cancellation handling gap in replication publish worker causes transient infinite retry during shutdown


**Severity:** Informational | **Likelihood:** Low | **Impact:** Informational | **Type:** Vulnerability

## Details

The replication publish worker does not honor context cancellation in a specific window: if cancellation occurs after the insert step but before deleting the staged row, publishStagedEnvelope returns false and the inner retry loop continues without re-checking ctx.Done(), leading to transient infinite retries. This does not block shutdown, preserves data integrity, and only causes minor operational/UX friction.

In pkg/api/message/publish_worker.go, start() processes each staged envelope with an inner loop: `for !p.publishStagedEnvelope(stagedEnv) { time.Sleep(...) }`. This loop does not select on ctx.Done(). Inside publishStagedEnvelope(), if p.ctx.Err() is non-nil immediately after the insert-and-increment step (and before attempting to delete the staged row), the function returns false. As a result, when shutdown triggers and the context is canceled during this window, the inner loop keeps retrying and never re-enters the outer select that checks ctx.Done(). The shutdown sequence in pkg/server/server.go does not wait for this worker goroutine, so shutdown is not blocked. Database insert semantics are idempotent (duplicate inserts return inserted == 0), and any staged row left over is cleaned up on the next attempt or restart. Client-side waits after staging are bounded by a 30-second timeout (and also cancel with request context). Overall impact is minor: transient retry looping and possible bounded client delay near shutdown, with no data corruption or lasting blockage.

## Exploitation

### Scenario 1

Operator initiates shutdown while a staged envelope is in-flight: the DB is closed before the shared context is canceled; publishStagedEnvelope sees cancellation or DB errors pre-delete and returns false; the inner loop retries every ~10ms until the process exits, causing transient log noise but not blocking shutdown.

**Preconditions / Assumptions:**
- (a) API enabled; replication publish worker running
- (b) A staged envelope is being processed when shutdown starts
- (c) BaseServer.Shutdown closes the DB before canceling the shared context
- (d) publishStagedEnvelope is between insert and delete when cancellation happens

### Scenario 2

A client calls PublishPayerEnvelopes just before shutdown: the envelope is staged and the worker gets stuck pre-delete, so lastProcessed does not advance; waitForGatewayPublish waits up to 30 seconds (or until the request context is canceled), resulting in bounded extra latency for the client.

**Preconditions / Assumptions:**
- (a) Client request to PublishPayerEnvelopes in-flight near shutdown
- (b) Staged envelope successfully inserted for processing
- (c) Worker cancellation occurs pre-delete so lastProcessed is not updated
- (d) waitForGatewayPublish uses a 30-second timeout and respects request context cancellation

### Scenario 3

An envelope is inserted into gateway tables but the staged row delete does not occur due to cancellation: after restart, the staged row is reprocessed; duplicate insert is ignored (inserted == 0) and the staged row is deleted safely without double-accounting.

**Preconditions / Assumptions:**
- (a) Envelope insertion succeeded but staged-row deletion did not run due to cancellation
- (b) Node is restarted and the publish worker resumes
- (c) Insert path is idempotent (duplicate insert returns inserted == 0) enabling safe deletion of the staged row

## Files impacted

- `pkg/api/message/publish_worker.go`

**Lines 115-120:**
```go
// Infinite retry on failure to publish; we cannot
// continue to the next envelope until this one is processed
time.Sleep(p.sleepOnFailureTime)
}

p.lastProcessed.Store(stagedEnv.ID) metrics.EmitApiStagedEnvelopeProcessingDelay(time.Since(stagedEnv.OriginatorTime))
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cancellation handling gap in replication publish worker causes transient infinite retry during shutdown #1865

Details

Exploitation

Scenario 1

Scenario 2

Scenario 3

Files impacted

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cancellation handling gap in replication publish worker causes transient infinite retry during shutdown #1865

Description

Details

Exploitation

Scenario 1

Scenario 2

Scenario 3

Files impacted

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions