Skip to content

Conversation

@zzzming
Copy link
Contributor

@zzzming zzzming commented Nov 24, 2023

Fixes #1138

Motivation

In the newPartitionProducer() function, there should be a retry of grabCnx(). It will be similar to the reconnectToBroker's grabCnx() retry logic.

Java producer has this retry logic.

At the producer creation call, after a successful topic lookup at grabCnx() in producer_partition.go, if there is a network issue before the COMMAND to create producer sent, the grabCnx() will exit without retry.

The same connectoToBroker retry logic is observed in this implementation.

We had frequent failures upon the initial producer creation under unstable network conditions .

It's tricky to reproduce. But we observe the problem more frequently on Azure pod's initialization stage. After implementing the grabCnx() retry in the newPartitionProducer(), the problem has gone away. The error often shows a connection closed (EOF) by the other side. But it's not by the broker (or Pulsar) based on the logs on the Pulsar side. It can be network issues in between the producer pod and the Pulsar cluster. That's why a grabCnx() retry is much needed.

System configuration

Pulsar version: 2.10

Modifications

In the newPartitionProducer() function, adding a retry of grabCnx() with the same retry logic specified in reconnectToBroker's grabCnx() retry logic.

Verifying this change

  • [ x] Make sure that the change passes the CI checks.

This change is already covered by existing tests, such as (please describe tests).

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API: (no)
  • The schema: (no)
  • The default values of configurations: (no)
  • The wire protocol: (no)

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (not applicable)

@zzzming zzzming changed the title [fix] retry producer creation upon error after succssful topic lookip [fix] retry producer creation upon error after succssful topic lookup Nov 24, 2023
@lhotari
Copy link
Member

lhotari commented Nov 24, 2023

Great work @zzzming! I'll review again after you reply to the question.

}
p.log.WithError(err).Error("Failed to create producer at newPartitionProducer")
errMsg := err.Error()
if strings.Contains(errMsg, errTopicNotFount) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if strings.Contains(errMsg, errTopicNotFount) {
if errors.Is(err, ErrTopicNotfound) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebase with the latest and fixed the error evaluation per your review comment

break
}

if strings.Contains(errMsg, "TopicTerminatedError") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if strings.Contains(errMsg, "TopicTerminatedError") {
if errors.Is(err, ErrTopicTerminated) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Copy link
Member

@nodece nodece left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, could you rebase this PR? #1143 exports some error var, so you need to update your PR.

@zzzming zzzming force-pushed the reconnectAfterLookup branch from c676c7b to a84c97d Compare January 12, 2024 19:00
@zzzming
Copy link
Contributor Author

zzzming commented Jan 12, 2024

@nodece I fixed based on your review comments. CI does not seem to run. Does it require any approval to run CI?

@eolivelli
Copy link

Ci triggered

@nodece
Copy link
Member

nodece commented Jan 17, 2024

Ping @zzzming

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

retry producer creation upon error after successful topic lookup

4 participants