Improper callback timing in leaderelection leads to the dual-leader

### Describe the bug

Under extreme concurrency scenarios caused by network latency or etcd performance issues, **the operations of geting lock, updating observed, and updating the lock are not atomic**. This can result in temporary state inconsistencies:
1. A node exits the process proactively due to lease renewal timeout. After restarting, it fetches the lock information and considers itself the leader, updating its local observed state (triggering onStartLeading).
2. However, since the lock has already been acquired by another node (**dual-leader scenario,** where both nodes consider themselves leader), updating the lock fails and the PATCH request throws a 409 exception. This exception is caught by the acquire method, which then waits for the next retry.
3. Only in the next retry does the node discover the leader change (triggering onStopLeading).

<img width="2866" height="742" alt="Image" src="https://github.com/user-attachments/assets/ab387dfb-b442-4d69-af44-4b790daba29c" />

The biggest difference between the leader election implementations in fabric8 and client-go lies in the timing of callback execution:

Java client (fabric8): The onStartLeading and onStopLeading callbacks are executed immediately when the local observed state is updated. In other words, as long as the local observed state changes, the callbacks are triggered, regardless of whether the lock is actually updated successfully.

Go client (client-go): The onStartLeading callback is executed only after the lock has been successfully updated and leadership has truly been acquired (i.e., after the acquire method is completed). The onStopLeading callback is triggered only when the renew phase times out or fails, right before the election process exits.

This difference means that the Java client may encounter the issue of "**failing to update the lock but still considering itself the leader**," whereas the Go client, due to its stricter callback timing, does not have this problem.

### Fabric8 Kubernetes Client version

other (please specify in additional context)

### Steps to reproduce

Election parameters:
leaseDuration=30
renewDeadline=20
retryPeriod=5
releaseOnCancel=false (If enabled, it can reduce the probability of the above issue occurring, but cannot completely prevent it.)

This issue can only be reproduced under extreme concurrency conditions. I believe the above explanation and timeline have made the situation clear.

### Expected behavior

The timing of callback execution should follow the approach used in client-go.
The onStartLeading callback should not be triggered before the lock is actually acquired.

### Runtime

Kubernetes (vanilla)

### Kubernetes API Server version

other (please specify in additional context)

### Environment

Linux

### Fabric8 Kubernetes Client Logs

```shell

```

### Additional context

Fabric8 Kubernetes Client version：6.12.1
Kubernetes API Server version：1.21


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improper callback timing in leaderelection leads to the dual-leader #7350

Describe the bug

Fabric8 Kubernetes Client version

Steps to reproduce

Expected behavior

Runtime

Kubernetes API Server version

Environment

Fabric8 Kubernetes Client Logs

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improper callback timing in leaderelection leads to the dual-leader #7350

Description

Describe the bug

Fabric8 Kubernetes Client version

Steps to reproduce

Expected behavior

Runtime

Kubernetes API Server version

Environment

Fabric8 Kubernetes Client Logs

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions