-
Notifications
You must be signed in to change notification settings - Fork 367
Fix partition update failure causing existing producers to close #1437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix partition update failure causing existing producers to close #1437
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a critical bug where partition update failures would incorrectly close all existing producers, making the entire producer unavailable. The fix refactors producer list management to use only producersPtr with atomic operations, and ensures that during partition updates, new producers are only added after successful creation.
- Replaced dual-variable producer management (
producersandproducersPtr) with single atomic pointer approach usingproducersPtr - Modified error handling to close only newly created producers on failure, preserving existing working producers
- Added comprehensive test coverage for the partition update failure scenario
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| pulsar/producer_impl.go | Core fix: removed producers field, added helper methods getProducer() and getProducers(), updated error handling to only close new producers on failure |
| pulsar/producer_test.go | Updated existing tests to use new getProducer() method, added TestPartitionUpdateFailed to verify fix |
| pulsar/message_chunking_test.go | Updated test to use new getProducer() method |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
BewareMyPower
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest not refactor the producers related code. Please just address the issue that existing producers are closed for such a failure.
Actually I think the existing design of #286 is not safe.
producers := *(*[]Producer)(atomic.LoadPointer(&p.producersPtr))
if partition >= len(producers) {producers refers the memory of the slice that p.producersPtr points to, which is the pair of the pointer and the length.
However, when p.producers = make([]Producer, newNumPartitions) is called, the slice content could be modified, so the following race condition could happen:
| time | goroutine | action |
|---|---|---|
| t1 | A | atomic.LoadPointer(&p.producersPtr) loads the pointer p0 |
| t2 | B | modify the length from 5 to 10 |
| t3 | A | dereference the slice and read pointer p0 and new length 10 |
| t4 | B | modify the pointer from p0 to p1 |
The key point is that the modification of a slice (pointer and length) is not atomic.
BewareMyPower
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this change, it seems that p.Lock() in internalCreatePartitionsProducers becomes unnecessary because it's added to guarantee the safety to access the producers slice. With an atomic value, the slice pointed by producers is always immutable.
However, I suggest just keeping producers a slice and acquire read lock in getPartition. Removing the lock in this case looks like a pre-mature optimization. The modification on producers is a rare case, most time it quickly returns in internalCreatePartitionsProducers because oldNumPartitions == newNumPartitions.
|
I wrote an example locally: package main
import (
"fmt"
"math/rand"
"sync"
"time"
)
func main() {
slice := make([]int, 10)
for i := 0; i < len(slice); i++ {
slice[i] = rand.Intn(100)
}
wg := sync.WaitGroup{}
wg.Add(2)
lock := sync.Mutex{}
go func() {
start := time.Now()
sum := int64(0)
const n = 10000000
for i := 0; i < n; i++ {
partition := rand.Int()
lock.Lock()
sum = sum + int64(slice[partition%len(slice)])
lock.Unlock()
}
finished := time.Since(start)
fmt.Println(sum, " elapsed time:", finished, " avg: ", finished.Nanoseconds()/n, " ns")
wg.Done()
}()
go func() {
for i := 0; i < 100; i++ {
lock.Lock()
for j := 0; j < 10; j++ {
slice = append(slice, rand.Intn(100))
}
lock.Unlock()
time.Sleep(1 * time.Millisecond)
}
wg.Done()
}()
wg.Wait()
}
Example outputs: The lock contention is not as scary as thought by many people. I'm wondering how much can be improved by using such an atomic approach? |
|
Without the lock: wg.Add(1)
//wg.Add(2)
//lock := sync.Mutex{}
go func() {
start := time.Now()
sum := int64(0)
const n = 10000000
for i := 0; i < n; i++ {
partition := rand.Int()
//lock.Lock()
sum = sum + int64(slice[partition%len(slice)])
//lock.Unlock()
}
finished := time.Since(start)
fmt.Println(sum, " elapsed time:", finished, " avg: ", finished.Nanoseconds()/n, " ns")
wg.Done()
}() |
Yes. We don't need the lock here in this PR. As we discussed privately, we’ve decided to remove the lock. I agree that removing the lock isn’t necessary in this case. However, the immutable-view approach used in this PR is just as straightforward (if we remove the producer lock) and doesn’t introduce any safety concerns. We can add the lock later if we really need one. |
Motivation
When a partition update is triggered and the creation of one of the new producers fails, it currently fails all existing producers here:
pulsar-client-go/pulsar/producer_impl.go
Lines 267 to 271 in 148c10c
Modifications
Currently, two variables,
producersandproducersPtr, are used to manage the producer list inside producer_impl:pulsar-client-go/pulsar/producer_impl.go
Lines 55 to 56 in 148c10c
Both are accessed across multiple methods inside
producer_impl, which is misleading and makesproducersPtrineffective.This PR addresses the issue by:
producersas a atmoic value to access the producer list insideproducer_impl. OnlyinternalCreatePartitionsProducerscan modifyproducers, which protects the producer list.producersuntil all new producers are successfully created.Verifying this change
This change added tests
Does this pull request potentially affect one of the following parts:
If
yeswas chosen, please highlight the changesDocumentation