Skip to content

fix: drain buffered ACKs when errorsCh closes before acksCh#258

Merged
infiniteregrets merged 1 commit intomainfrom
fix/drain-acks-before-exit
Mar 29, 2026
Merged

fix: drain buffered ACKs when errorsCh closes before acksCh#258
infiniteregrets merged 1 commit intomainfrom
fix/drain-acks-before-exit

Conversation

@infiniteregrets
Copy link
Copy Markdown
Member

@infiniteregrets infiniteregrets commented Mar 29, 2026

closes #233

Summary

  • readAcksLoop defers close(errorsCh) before close(acksCh) (LIFO), so errorsCh closes first.
  • The select in readAcks could non-deterministically pick the closed errorsCh and return, abandoning buffered ACKs in acksCh (buffer size 20).
  • Lost ACKs leave successful appends in inflightQueue, which then time out and get retried on a new session, causing data duplication.
  • Fix: drain acksCh when errorsCh closes to process all pending ACKs before returning.

Test plan

  • Verify go build ./... passes
  • Confirm that under high-throughput pipelined appends, session teardown does not lose buffered ACKs

🤖 Generated with Claude Code

readAcksLoop closes errorsCh before acksCh (LIFO defer order). The
select in readAcks could non-deterministically pick the closed errorsCh
and return, abandoning buffered ACKs in acksCh. This caused successful
appends to remain in inflightQueue, time out, and get duplicated on
retry. Drain acksCh when errorsCh closes to process all pending ACKs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@infiniteregrets infiniteregrets requested a review from a team as a code owner March 29, 2026 08:56
@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Mar 29, 2026

Greptile Summary

This PR fixes a data-duplication bug in AppendSession.readAcks caused by a channel-closure ordering issue. In transportAppendSession.readAcksLoop, defers are executed LIFO — close(errorsCh) fires before close(acksCh) — so a select in readAcks could non-deterministically pick the closed errorsCh branch and return early, silently abandoning up to 20 buffered ACKs. Those un-acknowledged entries would then time out in inflightQueue and be re-sent on a new session, causing data duplication. The fix is a targeted five-line drain loop that processes all remaining acksCh items before returning when errorsCh is observed as closed.

Key changes:

  • Added a for ack := range session.acksCh drain after the errorsCh closed-channel path in readAcks; the loop is safe because acksCh will close momentarily (next defer in the same goroutine)
  • The drain correctly relies on handleAck's internal guards (inflightQueue empty-check, CAS on entry.completed) to make repeated or post-shutdown calls idempotent
  • No test is added for the specific closed-errorsCh-while-acksCh-has-buffered-items scenario, which would help prevent regression if the defer ordering in readAcksLoop ever changes

Confidence Score: 5/5

Safe to merge — the fix is minimal, logically sound, and does not introduce new error paths or races.

The only findings are P2 style/improvement suggestions (defensive context check during drain, missing regression test). The core fix correctly exploits the invariant that acksCh will always be closed by the same goroutine immediately after errorsCh, making the drain loop safe and bounded. handleAck is already idempotent under concurrent shutdown, so no new correctness risk is introduced.

No files require special attention; the single changed file is a targeted, low-risk bug fix.

Important Files Changed

Filename Overview
s2/append_session.go Adds a drain loop in readAcks to consume buffered ACKs from acksCh when errorsCh closes first (due to LIFO defer ordering in readAcksLoop), preventing successful appends from being silently lost and retried as duplicates.

Sequence Diagram

sequenceDiagram
    participant RL as readAcksLoop (goroutine)
    participant ACK as acksCh (buf=20)
    participant ERR as errorsCh (buf=1)
    participant RA as readAcks (goroutine)

    Note over RL: Normal/EOF exit — loop ends
    RL->>ACK: (buffered ACKs already in channel)
    RL->>ERR: defer close(errorsCh) [LIFO #1]
    ERR-->>RA: select picks closed errorsCh

    alt BEFORE fix
        RA->>RA: return immediately
        Note over ACK: buffered ACKs abandoned → entries time out → duplicates
    else AFTER fix
        RA->>RA: for ack := range acksCh
        RA->>RA: handleAck(ack) × N
        RL->>ACK: defer close(acksCh) [LIFO #2]
        ACK-->>RA: channel closed — range exits
        RA->>RA: return cleanly
        Note over ACK: all ACKs processed → no duplicates
    end
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: s2/append_session.go
Line: 428-430

Comment:
**No `pumpCtx` cancellation check during drain**

The new drain loop — `for ack := range session.acksCh` — does not observe `r.pumpCtx.Done()`. This means that if the pump context is cancelled (e.g. `Close()` is called concurrently) while a handful of ACKs are still being processed, the goroutine will finish draining before exiting, rather than honouring the cancellation signal.

In practice this is harmless because:
1. `close(p.acksCh)` fires in the very next deferred statement after `close(p.errorsCh)` (both run on the same goroutine, sequentially), so the drain window is tiny.
2. The buffer cap is 20, so at most 20 `handleAck` calls can occur.
3. `handleAck` itself is safe to call after shutdown — it returns immediately when the inflight queue is empty.

If you'd like to be defensive anyway, you could break early on context cancellation:

```go
for ack := range session.acksCh {
    select {
    case <-r.pumpCtx.Done():
        return
    default:
    }
    r.handleAck(session, ack)
}
```

This is purely a hardening suggestion — the current code is functionally correct.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: s2/append_session.go
Line: 424-431

Comment:
**No regression test for the buffered-ACK drain path**

The existing test suite in `append_session_retry_test.go` doesn't appear to include a test that verifies ACKs buffered in `acksCh` are not lost when `errorsCh` closes first. Without a test, it would be easy for a future refactor of the defer order in `readAcksLoop` to silently reintroduce this bug.

A minimal test could:
1. Pre-populate `acksCh` with 1–N `*AppendAck` values.
2. Close `errorsCh` (simulating LIFO defer firing first).
3. Then close `acksCh`.
4. Assert that all submitted entries are resolved with their ACKs (not timed out or retried).

This would give clear coverage of the exact race the PR description documents.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix: drain buffered ACKs when errorsCh c..." | Re-trigger Greptile

Comment on lines +428 to +430
for ack := range session.acksCh {
r.handleAck(session, ack)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No pumpCtx cancellation check during drain

The new drain loop — for ack := range session.acksCh — does not observe r.pumpCtx.Done(). This means that if the pump context is cancelled (e.g. Close() is called concurrently) while a handful of ACKs are still being processed, the goroutine will finish draining before exiting, rather than honouring the cancellation signal.

In practice this is harmless because:

  1. close(p.acksCh) fires in the very next deferred statement after close(p.errorsCh) (both run on the same goroutine, sequentially), so the drain window is tiny.
  2. The buffer cap is 20, so at most 20 handleAck calls can occur.
  3. handleAck itself is safe to call after shutdown — it returns immediately when the inflight queue is empty.

If you'd like to be defensive anyway, you could break early on context cancellation:

for ack := range session.acksCh {
    select {
    case <-r.pumpCtx.Done():
        return
    default:
    }
    r.handleAck(session, ack)
}

This is purely a hardening suggestion — the current code is functionally correct.

Prompt To Fix With AI
This is a comment left during a code review.
Path: s2/append_session.go
Line: 428-430

Comment:
**No `pumpCtx` cancellation check during drain**

The new drain loop — `for ack := range session.acksCh` — does not observe `r.pumpCtx.Done()`. This means that if the pump context is cancelled (e.g. `Close()` is called concurrently) while a handful of ACKs are still being processed, the goroutine will finish draining before exiting, rather than honouring the cancellation signal.

In practice this is harmless because:
1. `close(p.acksCh)` fires in the very next deferred statement after `close(p.errorsCh)` (both run on the same goroutine, sequentially), so the drain window is tiny.
2. The buffer cap is 20, so at most 20 `handleAck` calls can occur.
3. `handleAck` itself is safe to call after shutdown — it returns immediately when the inflight queue is empty.

If you'd like to be defensive anyway, you could break early on context cancellation:

```go
for ack := range session.acksCh {
    select {
    case <-r.pumpCtx.Done():
        return
    default:
    }
    r.handleAck(session, ack)
}
```

This is purely a hardening suggestion — the current code is functionally correct.

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines 424 to 431
case err, ok := <-session.errorsCh:
if !ok {
// errorsCh closes before acksCh (defer order), so
// drain any buffered ACKs before returning.
for ack := range session.acksCh {
r.handleAck(session, ack)
}
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No regression test for the buffered-ACK drain path

The existing test suite in append_session_retry_test.go doesn't appear to include a test that verifies ACKs buffered in acksCh are not lost when errorsCh closes first. Without a test, it would be easy for a future refactor of the defer order in readAcksLoop to silently reintroduce this bug.

A minimal test could:

  1. Pre-populate acksCh with 1–N *AppendAck values.
  2. Close errorsCh (simulating LIFO defer firing first).
  3. Then close acksCh.
  4. Assert that all submitted entries are resolved with their ACKs (not timed out or retried).

This would give clear coverage of the exact race the PR description documents.

Prompt To Fix With AI
This is a comment left during a code review.
Path: s2/append_session.go
Line: 424-431

Comment:
**No regression test for the buffered-ACK drain path**

The existing test suite in `append_session_retry_test.go` doesn't appear to include a test that verifies ACKs buffered in `acksCh` are not lost when `errorsCh` closes first. Without a test, it would be easy for a future refactor of the defer order in `readAcksLoop` to silently reintroduce this bug.

A minimal test could:
1. Pre-populate `acksCh` with 1–N `*AppendAck` values.
2. Close `errorsCh` (simulating LIFO defer firing first).
3. Then close `acksCh`.
4. Assert that all submitted entries are resolved with their ACKs (not timed out or retried).

This would give clear coverage of the exact race the PR description documents.

How can I resolve this? If you propose a fix, please make it concise.

@infiniteregrets infiniteregrets merged commit cdcfc58 into main Mar 29, 2026
10 checks passed
@infiniteregrets infiniteregrets deleted the fix/drain-acks-before-exit branch March 29, 2026 09:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Detail Bug] Append session can miss buffered ACKs on clean transport shutdown, causing duplicate appends

1 participant