Skip to content

VIP stays up when Patroni is down/not reachable #336

@wolbernd

Description

@wolbernd

Steps to Reproduce

  • two servers (serverA and serverB) each with patroni and vip-manager installed and configured
  • dcs-type is set to patroni. all other trigger related options are set to default
  • Currently serverA is Leader and has the VIP
  • Stop patroni on serverA (systemctl stop patroni)

expected Behaviour

  • serverB becomes db leader
  • vip-manager on serverB takes VIP
  • vip-manager on serverA releases VIP

current behaviour (vip-manager 4.0.0)

  • serverB becomes the leader
  • vip-manager on serverB activates the VIP
  • vip-manager on serverA does not release the VIP and even tries to get it back even though its dcs-backend (patroni) is not reachable
  • The VIP is switching between serverA and serverB since they both think they have to have it thus making database connection unreliable

Logs

vip-manager on serverA:

Sep 30 13:22:18 serverA vip-manager[803251]: 2025-09-30T13:22:18.668+0200        ERROR        patroni REST API error:Get "http://127.0.0.1:8008//leader": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Sep 30 13:22:18 serverA vip-manager[803251]: github.com/cybertec-postgresql/vip-manager/checker.(*PatroniLeaderChecker).GetChangeNotificationStream
Sep 30 13:22:18 serverA vip-manager[803251]:         /home/runner/work/vip-manager/vip-manager/checker/patroni_leader_checker.go:52
Sep 30 13:22:18 serverA vip-manager[803251]: main.main.func3
Sep 30 13:22:18 serverA vip-manager[803251]:         /home/runner/work/vip-manager/vip-manager/main.go:65
Sep 30 13:22:19 serverA vip-manager[803251]: 2025-09-30T13:22:19.669+0200        ERROR        patroni REST API error:Get "http://127.0.0.1:8008//leader": dial tcp 127.0.0.1:8008: connect: connection refused
[...]
Sep 30 13:22:29 serverA vip-manager[803251]: 2025-09-30T13:22:29.681+0200        ERROR        patroni REST API error:Get "http://127.0.0.1:8008//leader": dial tcp 127.0.0.1:8008: connect: connection refused
Sep 30 13:22:29 serverA vip-manager[803251]: github.com/cybertec-postgresql/vip-manager/checker.(*PatroniLeaderChecker).GetChangeNotificationStream
Sep 30 13:22:29 serverA vip-manager[803251]:         /home/runner/work/vip-manager/vip-manager/checker/patroni_leader_checker.go:52
Sep 30 13:22:29 serverA vip-manager[803251]: main.main.func3
Sep 30 13:22:29 serverA vip-manager[803251]:         /home/runner/work/vip-manager/vip-manager/main.go:65
Sep 30 13:22:29 serverA vip-manager[803251]: 2025-09-30T13:22:29.967+0200        INFO        IP address 10.0.99.64/24 is up, must be up

vip-manager on serverB:

Sep 30 13:21:49 serverB vip-manager[501796]: 2025-09-30T13:21:49.685+0200        INFO        IP address 10.0.99.64/24 is down, must be down
Sep 30 13:21:59 serverB vip-manager[501796]: 2025-09-30T13:21:59.685+0200        INFO        IP address 10.0.99.64/24 is down, must be down
Sep 30 13:22:09 serverB vip-manager[501796]: 2025-09-30T13:22:09.686+0200        INFO        IP address 10.0.99.64/24 is down, must be down
Sep 30 13:22:19 serverB vip-manager[501796]: 2025-09-30T13:22:19.592+0200        INFO        IP address 10.0.99.64/24 is down, must be up
Sep 30 13:22:19 serverB vip-manager[501796]: 2025-09-30T13:22:19.592+0200        INFO        Configuring address 10.0.99.64/24 on enp3s0
Sep 30 13:22:29 serverB vip-manager[501796]: 2025-09-30T13:22:29.603+0200        INFO        IP address 10.0.99.64/24 is up, must be up
Sep 30 13:22:39 serverB vip-manager[501796]: 2025-09-30T13:22:39.604+0200        INFO        IP address 10.0.99.64/24 is up, must be up

Possible Solution

One possible workaround would be to amend the systemd unit of vip-manager so that it starts and stops together with patroni:

[Unit]
Description=Manages Virtual IP for Patroni
After=network-online.target
Before=patroni.service
PartOf=patroni.service

[Service]
Type=simple

ExecStart=/usr/bin/vip-manager --config=/etc/default/vip-manager.yml

Restart=on-failure

[Install]
WantedBy=multi-user.target
WantedBy=patroni.service

However this solution would only work if the systemd unit is stopped (either by a user or by systemd itself in case the main process crashes). This would not trigger if the patroni process hangs for some reason.

A better solution would be to release the VIP if the dcs-endpoint is not reachable since the leader role will probably not be on any server where patroni is not running.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    To do

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions