fix: drain buffered ACKs when errorsCh closes before acksCh by infiniteregrets · Pull Request #258 · s2-streamstore/s2-sdk-go

infiniteregrets · 2026-03-29T08:56:52Z

closes #233

Summary

readAcksLoop defers close(errorsCh) before close(acksCh) (LIFO), so errorsCh closes first.
The select in readAcks could non-deterministically pick the closed errorsCh and return, abandoning buffered ACKs in acksCh (buffer size 20).
Lost ACKs leave successful appends in inflightQueue, which then time out and get retried on a new session, causing data duplication.
Fix: drain acksCh when errorsCh closes to process all pending ACKs before returning.

Test plan

Verify go build ./... passes
Confirm that under high-throughput pipelined appends, session teardown does not lose buffered ACKs

🤖 Generated with Claude Code

readAcksLoop closes errorsCh before acksCh (LIFO defer order). The select in readAcks could non-deterministically pick the closed errorsCh and return, abandoning buffered ACKs in acksCh. This caused successful appends to remain in inflightQueue, time out, and get duplicated on retry. Drain acksCh when errorsCh closes to process all pending ACKs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-03-29T09:00:26Z

Greptile Summary

This PR fixes a data-duplication bug in AppendSession.readAcks caused by a channel-closure ordering issue. In transportAppendSession.readAcksLoop, defers are executed LIFO — close(errorsCh) fires before close(acksCh) — so a select in readAcks could non-deterministically pick the closed errorsCh branch and return early, silently abandoning up to 20 buffered ACKs. Those un-acknowledged entries would then time out in inflightQueue and be re-sent on a new session, causing data duplication. The fix is a targeted five-line drain loop that processes all remaining acksCh items before returning when errorsCh is observed as closed.

Key changes:

Added a for ack := range session.acksCh drain after the errorsCh closed-channel path in readAcks; the loop is safe because acksCh will close momentarily (next defer in the same goroutine)
The drain correctly relies on handleAck's internal guards (inflightQueue empty-check, CAS on entry.completed) to make repeated or post-shutdown calls idempotent
No test is added for the specific closed-errorsCh-while-acksCh-has-buffered-items scenario, which would help prevent regression if the defer ordering in readAcksLoop ever changes

Confidence Score: 5/5

Safe to merge — the fix is minimal, logically sound, and does not introduce new error paths or races.

The only findings are P2 style/improvement suggestions (defensive context check during drain, missing regression test). The core fix correctly exploits the invariant that acksCh will always be closed by the same goroutine immediately after errorsCh, making the drain loop safe and bounded. handleAck is already idempotent under concurrent shutdown, so no new correctness risk is introduced.

No files require special attention; the single changed file is a targeted, low-risk bug fix.

Important Files Changed

Filename	Overview
s2/append_session.go	Adds a drain loop in `readAcks` to consume buffered ACKs from `acksCh` when `errorsCh` closes first (due to LIFO defer ordering in `readAcksLoop`), preventing successful appends from being silently lost and retried as duplicates.

Sequence Diagram

sequenceDiagram
    participant RL as readAcksLoop (goroutine)
    participant ACK as acksCh (buf=20)
    participant ERR as errorsCh (buf=1)
    participant RA as readAcks (goroutine)

    Note over RL: Normal/EOF exit — loop ends
    RL->>ACK: (buffered ACKs already in channel)
    RL->>ERR: defer close(errorsCh) [LIFO #1]
    ERR-->>RA: select picks closed errorsCh

    alt BEFORE fix
        RA->>RA: return immediately
        Note over ACK: buffered ACKs abandoned → entries time out → duplicates
    else AFTER fix
        RA->>RA: for ack := range acksCh
        RA->>RA: handleAck(ack) × N
        RL->>ACK: defer close(acksCh) [LIFO #2]
        ACK-->>RA: channel closed — range exits
        RA->>RA: return cleanly
        Note over ACK: all ACKs processed → no duplicates
    end

Prompt To Fix All With AI

This is a comment left during a code review.
Path: s2/append_session.go
Line: 428-430

Comment:
**No `pumpCtx` cancellation check during drain**

The new drain loop — `for ack := range session.acksCh` — does not observe `r.pumpCtx.Done()`. This means that if the pump context is cancelled (e.g. `Close()` is called concurrently) while a handful of ACKs are still being processed, the goroutine will finish draining before exiting, rather than honouring the cancellation signal.

In practice this is harmless because:
1. `close(p.acksCh)` fires in the very next deferred statement after `close(p.errorsCh)` (both run on the same goroutine, sequentially), so the drain window is tiny.
2. The buffer cap is 20, so at most 20 `handleAck` calls can occur.
3. `handleAck` itself is safe to call after shutdown — it returns immediately when the inflight queue is empty.

If you'd like to be defensive anyway, you could break early on context cancellation:

```go
for ack := range session.acksCh {
    select {
    case <-r.pumpCtx.Done():
        return
    default:
    }
    r.handleAck(session, ack)
}
```

This is purely a hardening suggestion — the current code is functionally correct.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: s2/append_session.go
Line: 424-431

Comment:
**No regression test for the buffered-ACK drain path**

The existing test suite in `append_session_retry_test.go` doesn't appear to include a test that verifies ACKs buffered in `acksCh` are not lost when `errorsCh` closes first. Without a test, it would be easy for a future refactor of the defer order in `readAcksLoop` to silently reintroduce this bug.

A minimal test could:
1. Pre-populate `acksCh` with 1–N `*AppendAck` values.
2. Close `errorsCh` (simulating LIFO defer firing first).
3. Then close `acksCh`.
4. Assert that all submitted entries are resolved with their ACKs (not timed out or retried).

This would give clear coverage of the exact race the PR description documents.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "fix: drain buffered ACKs when errorsCh c..." | Re-trigger Greptile}

greptile-apps · 2026-03-29T09:00:29Z

s2/append_session.go

+				for ack := range session.acksCh {
+					r.handleAck(session, ack)
+				}


No pumpCtx cancellation check during drain

The new drain loop — for ack := range session.acksCh — does not observe r.pumpCtx.Done(). This means that if the pump context is cancelled (e.g. Close() is called concurrently) while a handful of ACKs are still being processed, the goroutine will finish draining before exiting, rather than honouring the cancellation signal.

In practice this is harmless because:

close(p.acksCh) fires in the very next deferred statement after close(p.errorsCh) (both run on the same goroutine, sequentially), so the drain window is tiny.

The buffer cap is 20, so at most 20 handleAck calls can occur.

handleAck itself is safe to call after shutdown — it returns immediately when the inflight queue is empty.

If you'd like to be defensive anyway, you could break early on context cancellation:

for ack := range session.acksCh { select { case <-r.pumpCtx.Done(): return default: } r.handleAck(session, ack) }

This is purely a hardening suggestion — the current code is functionally correct.

Prompt To Fix With AI

This is a comment left during a code review. Path: s2/append_session.go Line: 428-430 Comment: **No `pumpCtx` cancellation check during drain** The new drain loop — `for ack := range session.acksCh` — does not observe `r.pumpCtx.Done()`. This means that if the pump context is cancelled (e.g. `Close()` is called concurrently) while a handful of ACKs are still being processed, the goroutine will finish draining before exiting, rather than honouring the cancellation signal. In practice this is harmless because: 1. `close(p.acksCh)` fires in the very next deferred statement after `close(p.errorsCh)` (both run on the same goroutine, sequentially), so the drain window is tiny. 2. The buffer cap is 20, so at most 20 `handleAck` calls can occur. 3. `handleAck` itself is safe to call after shutdown — it returns immediately when the inflight queue is empty. If you'd like to be defensive anyway, you could break early on context cancellation: ```go for ack := range session.acksCh { select { case <-r.pumpCtx.Done(): return default: } r.handleAck(session, ack) } ``` This is purely a hardening suggestion — the current code is functionally correct. How can I resolve this? If you propose a fix, please make it concise.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-03-29T09:00:30Z

s2/append_session.go

 		case err, ok := <-session.errorsCh:
 			if !ok {
+				// errorsCh closes before acksCh (defer order), so
+				// drain any buffered ACKs before returning.
+				for ack := range session.acksCh {
+					r.handleAck(session, ack)
+				}
 				return


No regression test for the buffered-ACK drain path

The existing test suite in append_session_retry_test.go doesn't appear to include a test that verifies ACKs buffered in acksCh are not lost when errorsCh closes first. Without a test, it would be easy for a future refactor of the defer order in readAcksLoop to silently reintroduce this bug.

A minimal test could:

Pre-populate acksCh with 1–N *AppendAck values.

Close errorsCh (simulating LIFO defer firing first).

Then close acksCh.

Assert that all submitted entries are resolved with their ACKs (not timed out or retried).

This would give clear coverage of the exact race the PR description documents.

Prompt To Fix With AI

This is a comment left during a code review. Path: s2/append_session.go Line: 424-431 Comment: **No regression test for the buffered-ACK drain path** The existing test suite in `append_session_retry_test.go` doesn't appear to include a test that verifies ACKs buffered in `acksCh` are not lost when `errorsCh` closes first. Without a test, it would be easy for a future refactor of the defer order in `readAcksLoop` to silently reintroduce this bug. A minimal test could: 1. Pre-populate `acksCh` with 1–N `*AppendAck` values. 2. Close `errorsCh` (simulating LIFO defer firing first). 3. Then close `acksCh`. 4. Assert that all submitted entries are resolved with their ACKs (not timed out or retried). This would give clear coverage of the exact race the PR description documents. How can I resolve this? If you propose a fix, please make it concise.

infiniteregrets requested a review from a team as a code owner March 29, 2026 08:56

greptile-apps bot reviewed Mar 29, 2026

View reviewed changes

infiniteregrets merged commit cdcfc58 into main Mar 29, 2026
10 checks passed

infiniteregrets deleted the fix/drain-acks-before-exit branch March 29, 2026 09:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: drain buffered ACKs when errorsCh closes before acksCh#258

fix: drain buffered ACKs when errorsCh closes before acksCh#258
infiniteregrets merged 1 commit intomainfrom
fix/drain-acks-before-exit

infiniteregrets commented Mar 29, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Mar 29, 2026

Uh oh!

greptile-apps bot Mar 29, 2026

Uh oh!

greptile-apps bot Mar 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

infiniteregrets commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

greptile-apps bot commented Mar 29, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

infiniteregrets commented Mar 29, 2026 •

edited

Loading