Skip to content

Comments

fix: update database connection info bugs#40

Merged
jason-lynch merged 3 commits intomainfrom
fix/PLAT-98/update-connection-info
Jun 3, 2025
Merged

fix: update database connection info bugs#40
jason-lynch merged 3 commits intomainfrom
fix/PLAT-98/update-connection-info

Conversation

@jason-lynch
Copy link
Member

This resolves some issues related to updating the instance connection information when the instance IP addresses change after a restart:

  • We were not updating the instance connection info when we would refresh or update an instance.
    • This caused subsequent operations to fail from using stale connection info.
  • We were calling the Docker "service inspect" endpoint too soon after updating the service in WaitForService.
    • The Docker service API can serve stale reads if all manager nodes haven't committed the service spec.
    • WaitForService waits for the service to hit the desired state, so a stale read can cause it to exit before the service has started updating.
    • I've included a detailed comment that links to a relevant portion of the Docker CLI that uses a different implementation with a similar effect. We're able to use a simpler implementation because we don't need general-purpose logic that works for services without health checks.

PLAT-98

@jason-lynch jason-lynch requested review from mmols and tsivaprasad June 2, 2025 13:32
Instance IPs can change when their containers restart, so we need to
update their connection info on each refresh.

PLAT-98
Refactors the `orch.GetInstanceConnectionInfo` and the `ConnectionInfo`
usage to deduplicate code and behavior.

PLAT-98
This adds a short initial in `WaitForService` and adds some additional
checks to verify that the service has hit its desired state. See the
inline comment for more details about this change.

It also adds some tolerance for transient connection errors from the
Patroni API. This lengthens the time to detect a failed instance in
exchange for more durability when the Patroni reload operation causes
a brief unavailibility.

PLAT-98
@jason-lynch jason-lynch force-pushed the fix/PLAT-98/update-connection-info branch from 4a8a63c to d09ddd5 Compare June 2, 2025 20:52
return fmt.Errorf("failed to get cluster status: %w", err)
}
continue
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we reset errCount to 0 after a successful connection attempt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. The danger of doing that is that if the service is "flapping", we could end up stuck in an infinite loop. This is meant to be a very small tolerance to handle cases where Patroni becomes unavailable after we call its reload endpoint.

Does that make sense to you too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay

@jason-lynch jason-lynch requested a review from tsivaprasad June 3, 2025 17:11
@jason-lynch jason-lynch merged commit 3f182ae into main Jun 3, 2025
2 checks passed
@jason-lynch jason-lynch deleted the fix/PLAT-98/update-connection-info branch June 3, 2025 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants