Skip to content

Commit 4945e7e

Browse files
authored
WPB-19318: Ensure high-availability of the postgress cluster (#807)
* add pg failover automation with repmgr * Add a drop-IN to guard the priamry auto start * add monitoring to detect split-brain and organize the plabooks * Update postgresql configuration and documentation * Update the doc * fix: typo on repmger.conf and update playbooks * debug: test deployment * skip demo and mini build for now * fix: set the right dns-resolver * feat: Enhance PostgreSQL HA cluster with unified config and comprehensive docs - Consolidate PostgreSQL configuration into single unified template - Fix split-brain detection script (correct 'rouge' to 'rogue' typo) - Add detailed HA features documentation with failover validation - Include monitoring & event system documentation - Add node_id and priority configuration parameters - Add official repmgr and PostgreSQL documentation references - Improve deployment commands and monitoring checks - Enhance split-brain protection with advanced features * docs: Remove duplicate content from PostgreSQL HA documentation - Remove duplicate HA features list from Key Concepts section - Remove duplicate monitoring system section from Configuration Options - Fix incorrect numbering in monitoring commands (5 → 8) - Consolidate monitoring information into single comprehensive section * docs: Clarify Kubernetes integration architecture - PostgreSQL cluster runs independently, not integrated with endpoint-manager - Explain postgres-endpoint-manager as separate component that monitors cluster externally - Emphasize independent operation of cluster vs endpoint management * Optimize the doc * Optimize the doc to have a cleaner order of texts * Update postgres document with full command paths * fix the repmgr reconnect time and adjust doc * update document * add postrgresql-external values file for the CI * add demo values * Update with different cluster recovery scenario * add instructions regarding rogue-detector and unmasking the pg service * store the postgresql secret as k8s secret * optimize the password management section * sync k8s secrets * refactor the sync command
1 parent f0ac1c6 commit 4945e7e

34 files changed

+2819
-1009
lines changed

.github/workflows/offline.yml

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
11
on:
22
push:
33
branches: [master, develop]
4-
tags: [ v* ]
4+
tags: [v*]
55
paths-ignore:
6-
- '*.md'
7-
- '**/*.md'
6+
- "*.md"
7+
- "**/*.md"
88
pull_request:
99
branches: [master, develop]
1010
paths-ignore:
11-
- '*.md'
12-
- '**/*.md'
11+
- "*.md"
12+
- "**/*.md"
1313
jobs:
1414
# Build default profile and create local assets
1515
build-default:
@@ -167,16 +167,16 @@ jobs:
167167
- name: Process the demo profile build
168168
run: ./offline/demo-build/build.sh
169169
env:
170-
GPG_PRIVATE_KEY: '${{ secrets.GPG_PRIVATE_KEY }}'
171-
DOCKER_LOGIN: '${{ secrets.DOCKER_LOGIN }}'
170+
GPG_PRIVATE_KEY: "${{ secrets.GPG_PRIVATE_KEY }}"
171+
DOCKER_LOGIN: "${{ secrets.DOCKER_LOGIN }}"
172172

173173
- name: Copy demo build assets tarball to S3
174174
run: |
175175
aws s3 cp offline/demo-build/output/assets.tgz s3://public.wire.com/artifacts/wire-server-deploy-static-demo-${{ steps.upload_name.outputs.UPLOAD_NAME }}.tgz
176176
echo "Uploaded to: https://s3-$AWS_REGION.amazonaws.com/public.wire.com/artifacts/wire-server-deploy-static-demo-${{ steps.upload_name.outputs.UPLOAD_NAME }}.tgz"
177177
env:
178-
AWS_ACCESS_KEY_ID: '${{ secrets.AWS_ACCESS_KEY_ID }}'
179-
AWS_SECRET_ACCESS_KEY: '${{ secrets.AWS_SECRET_ACCESS_KEY }}'
178+
AWS_ACCESS_KEY_ID: "${{ secrets.AWS_ACCESS_KEY_ID }}"
179+
AWS_SECRET_ACCESS_KEY: "${{ secrets.AWS_SECRET_ACCESS_KEY }}"
180180
AWS_REGION: "eu-west-1"
181181

182182
- name: Cleanup demo build assets
@@ -208,16 +208,16 @@ jobs:
208208
- name: Process the min profile build
209209
run: ./offline/min-build/build.sh
210210
env:
211-
GPG_PRIVATE_KEY: '${{ secrets.GPG_PRIVATE_KEY }}'
212-
DOCKER_LOGIN: '${{ secrets.DOCKER_LOGIN }}'
211+
GPG_PRIVATE_KEY: "${{ secrets.GPG_PRIVATE_KEY }}"
212+
DOCKER_LOGIN: "${{ secrets.DOCKER_LOGIN }}"
213213

214214
- name: Copy min build assets tarball to S3
215215
run: |
216216
aws s3 cp offline/min-build/output/assets.tgz s3://public.wire.com/artifacts/wire-server-deploy-static-min-${{ steps.upload_name.outputs.UPLOAD_NAME }}.tgz
217217
echo "Uploaded to: https://s3-$AWS_REGION.amazonaws.com/public.wire.com/artifacts/wire-server-deploy-static-min-${{ steps.upload_name.outputs.UPLOAD_NAME }}.tgz"
218218
env:
219-
AWS_ACCESS_KEY_ID: '${{ secrets.AWS_ACCESS_KEY_ID }}'
220-
AWS_SECRET_ACCESS_KEY: '${{ secrets.AWS_SECRET_ACCESS_KEY }}'
219+
AWS_ACCESS_KEY_ID: "${{ secrets.AWS_ACCESS_KEY_ID }}"
220+
AWS_SECRET_ACCESS_KEY: "${{ secrets.AWS_SECRET_ACCESS_KEY }}"
221221
AWS_REGION: "eu-west-1"
222222

223223
- name: Cleanup min build assets

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ values-init-done
1818

1919
# Envrc local overrides
2020
.envrc.local
21-
21+
.vscode
2222
# Nix-created result symlinks
2323
result
2424
result-*

ansible/inventory/offline/99-static

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -85,8 +85,7 @@
8585
postgresql_network_interface = enp1s0
8686
wire_dbname = wire-server
8787
wire_user = wire-server
88-
# if not defined, a random password will be generated
89-
# wire_pass = verysecurepassword
88+
wire_namespace = default # Kubernetes namespace for secret storage
9089

9190
[elasticsearch:vars]
9291
# elasticsearch_network_interface = enp1s0

ansible/inventory/offline/group_vars/postgresql/postgresql.yml

Lines changed: 47 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,44 @@ postgresql_version: 17
33
postgresql_data_dir: /var/lib/postgresql/{{ postgresql_version }}/main
44
postgresql_conf_dir: /etc/postgresql/{{ postgresql_version }}/main
55

6-
# Replication services configuration
7-
repsvc_user: repsvc
8-
repsvc_password: "securepassword"
9-
repsvc_database: repsvc_db
6+
# repmgr HA configuration
7+
repmgr_user: repmgr
8+
repmgr_password: "securepassword"
9+
repmgr_database: repmgr
10+
11+
# Node configuration for repmgr
12+
repmgr_node_config:
13+
postgresql1: # Maps to postgresql_rw group
14+
node_id: 1
15+
priority: 150
16+
role: primary
17+
postgresql2: # Maps to first postgresql_ro
18+
node_id: 2
19+
priority: 100
20+
role: standby
21+
postgresql3: # Maps to second postgresql_ro
22+
node_id: 3
23+
priority: 50
24+
role: standby
25+
26+
# repmgr settings
27+
# repmgrd monitoring and reconnection configuration
28+
# Reference: https://repmgr.org/docs/current/repmgrd-basic-configuration.html
29+
#
30+
# monitor_interval_secs: Interval in seconds between monitoring checks
31+
# - Default: 2 seconds
32+
# - Controls how frequently repmgr monitors the primary server status
33+
#
34+
# reconnect_attempts: Maximum number of reconnection attempts
35+
# - Default: 6 attempts
36+
# - Number of times repmgr will attempt to reconnect to a failed primary
37+
#
38+
# reconnect_interval: Interval in seconds between reconnection attempts
39+
# - Default: 10 seconds
40+
# - Time to wait between each reconnection attempt
41+
monitor_interval_secs: 2
42+
reconnect_attempts: 6
43+
reconnect_interval: 5
1044

1145
# Use local packages instead of repository
1246
postgresql_use_repository: false # Set to true to use local packages from urls
@@ -35,3 +69,12 @@ postgresql_pkgs:
3569
- name: python3-psycopg2
3670
url: "{{ binaries_url }}/python3-psycopg2_2.9.10-1.pgdg22.04+1_amd64.deb"
3771
checksum: "sha256:cc2f749e3af292a67e012edeb4aa5d284f57f2d66a9a09fe5b81e5ffda73cab4"
72+
- name: repmgr-common
73+
url: "{{ binaries_url }}/repmgr-common_5.5.0+debpgdg-1.pgdg22.04+1_all.deb"
74+
checksum: "sha256:34c660c66a9710fd4f20a66cc932741d3399dbba7e7ae4b67468b3e18f65f61c"
75+
- name: repmgr
76+
url: "{{ binaries_url }}/repmgr_5.5.0+debpgdg-1.pgdg22.04+1_all.deb"
77+
checksum: "sha256:20c280811e758106335df1eb9954b61aa552823d3129f1e38c488fbd5efe0567"
78+
- name: postgresql-17-repmgr
79+
url: "{{ binaries_url }}/postgresql-17-repmgr_5.5.0+debpgdg-1.pgdg22.04+1_amd64.deb"
80+
checksum: "sha256:520d6ed4d540a2bb9174ac8276f8cb686c0268c13cccb89b28a9cdbd12049df8"

ansible/postgresql-deploy.yml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,9 @@
1+
- name: Clean previous deployment state
2+
import_playbook: postgresql-playbooks/clean_existing_setup.yml
3+
tags:
4+
- postgresql
5+
- cleanup
6+
17
- name: Install PostgreSQL packages
28
import_playbook: postgresql-playbooks/postgresql-install.yml
39
tags:
@@ -27,3 +33,9 @@
2733
tags:
2834
- postgresql
2935
- wire-setup
36+
37+
- name: Deploy cluster monitoring
38+
import_playbook: postgresql-playbooks/postgresql-monitoring.yml
39+
tags:
40+
- postgresql
41+
- monitoring
Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,173 @@
1+
- name: Clean previous deployment state
2+
hosts: "{{ target_nodes | default('postgresql_rw,postgresql_ro') }}"
3+
become: yes
4+
tasks:
5+
# ===== DETECT INSTALLATION TYPE =====
6+
- name: Check if PostgreSQL is installed
7+
stat:
8+
path: "/usr/bin/psql"
9+
register: postgresql_installed
10+
11+
- name: Check if PostgreSQL data directory exists
12+
stat:
13+
path: "/var/lib/postgresql/{{ postgresql_version }}/main/PG_VERSION"
14+
register: postgresql_data_exists
15+
16+
- name: Check if repmgr configuration exists
17+
stat:
18+
path: "/etc/repmgr/{{ postgresql_version }}-main/repmgr.conf"
19+
register: repmgr_config_exists
20+
21+
- name: Determine if this is a fresh installation
22+
set_fact:
23+
is_fresh_install: >-
24+
{{
25+
not postgresql_installed.stat.exists or
26+
not postgresql_data_exists.stat.exists or
27+
not repmgr_config_exists.stat.exists
28+
}}
29+
30+
- name: Display installation type
31+
debug:
32+
msg: |
33+
{{ inventory_hostname }}: {{ 'Fresh installation detected - skipping most cleanup tasks' if is_fresh_install else 'Existing deployment detected - performing full cleanup' }}
34+
35+
# ===== FRESH INSTALLATION TASKS (MINIMAL) =====
36+
- name: Handle fresh installation
37+
block:
38+
- name: Ensure basic directories exist for fresh install
39+
file:
40+
path: "{{ item }}"
41+
state: directory
42+
owner: postgres
43+
group: postgres
44+
mode: "0755"
45+
loop:
46+
- "/etc/repmgr/{{ postgresql_version }}-main"
47+
- "/opt/repmgr/scripts"
48+
- "/var/log/postgresql"
49+
when: postgresql_installed.stat.exists
50+
51+
- name: Skip cleanup message for fresh install
52+
debug:
53+
msg: "Fresh installation - cleanup tasks skipped"
54+
55+
when: is_fresh_install
56+
57+
# ===== EXISTING DEPLOYMENT CLEANUP =====
58+
- name: Handle existing deployment cleanup
59+
block:
60+
- name: Check if PostgreSQL service exists
61+
systemd:
62+
name: "postgresql@{{ postgresql_version }}-main.service"
63+
register: postgresql_service_exists
64+
failed_when: false
65+
66+
- name: Check if repmgr database exists
67+
ansible.builtin.shell: |
68+
sudo -u postgres psql -t -A -c "SELECT COUNT(*) FROM pg_database WHERE datname = '{{ repmgr_database }}'" postgres 2>/dev/null || echo "0"
69+
register: repmgr_db_exists
70+
changed_when: false
71+
failed_when: false
72+
when:
73+
- postgresql_installed.stat.exists
74+
- postgresql_service_exists.status is defined
75+
- postgresql_service_exists.status.LoadState != "not-found"
76+
77+
- name: Drop repmgr database completely (if exists)
78+
ansible.builtin.shell: |
79+
sudo -u postgres psql -c "DROP DATABASE IF EXISTS {{ repmgr_database }};" postgres 2>/dev/null || true
80+
failed_when: false
81+
when:
82+
- postgresql_installed.stat.exists
83+
- repmgr_db_exists is defined
84+
- repmgr_db_exists.stdout | default('0') | trim != '0'
85+
86+
- name: Stop any existing split-brain monitoring timer
87+
systemd:
88+
name: detect-rogue-primary.timer
89+
state: stopped
90+
failed_when: false
91+
92+
- name: Stop any existing split-brain monitoring service
93+
systemd:
94+
name: detect-rogue-primary.service
95+
state: stopped
96+
failed_when: false
97+
98+
- name: Stop any existing repmgrd service
99+
systemd:
100+
name: "repmgrd@{{ postgresql_version }}-main.service"
101+
state: stopped
102+
failed_when: false
103+
104+
- name: Unmask PostgreSQL services from previous deployments
105+
systemd:
106+
name: "postgresql@{{ postgresql_version }}-main.service"
107+
masked: no
108+
failed_when: false
109+
110+
- name: Stop PostgreSQL service for clean state
111+
systemd:
112+
name: "postgresql@{{ postgresql_version }}-main.service"
113+
state: stopped
114+
failed_when: false
115+
116+
- name: Remove repmgr configuration files, scripts, and systemd units
117+
file:
118+
path: "{{ item }}"
119+
state: absent
120+
failed_when: false
121+
loop:
122+
- "/etc/repmgr/{{ postgresql_version }}-main/repmgr.conf"
123+
- "/etc/repmgr/{{ postgresql_version }}"
124+
- "/etc/repmgr/{{ postgresql_version }}-main"
125+
- "/var/lib/postgresql/{{ postgresql_version }}/main/recovery.conf"
126+
- "/var/lib/postgresql/{{ postgresql_version }}/main/standby.signal"
127+
- "/opt/repmgr/scripts"
128+
- "/usr/local/bin/repmgr"
129+
- "/usr/local/bin/repmgrd"
130+
- "/usr/local/bin/detect_rogue_primary.sh"
131+
- "/etc/systemd/system/detect-rogue-primary.service"
132+
- "/etc/systemd/system/detect-rogue-primary.timer"
133+
- "/etc/systemd/system/[email protected]"
134+
- "/etc/systemd/system/repmgrd@{{ postgresql_version }}-main.service"
135+
- "/etc/systemd/system/repmgrd@{{ postgresql_version }}.service"
136+
- "/etc/sudoers.d/postgres-postgresql-management"
137+
- "/etc/sudoers.d/postgres-postgresql-service"
138+
139+
- name: Find rogue split-brain service files
140+
find:
141+
paths: /etc/systemd/system
142+
patterns: "detect-rogue-primary.service*"
143+
register: rogue_service_files
144+
145+
- name: Remove rogue split-brain service files
146+
file:
147+
path: "{{ item.path }}"
148+
state: absent
149+
loop: "{{ rogue_service_files.files }}"
150+
when: rogue_service_files.matched > 0
151+
152+
when: not is_fresh_install
153+
154+
# ===== COMMON TASKS FOR ALL INSTALLATIONS =====
155+
- name: Reload systemd daemon after cleanup
156+
systemd:
157+
daemon_reload: yes
158+
failed_when: false
159+
160+
- name: Display cleanup status
161+
debug:
162+
msg: |
163+
Cleanup completed for {{ inventory_hostname }}:
164+
- Installation type: {{ 'Fresh' if is_fresh_install else 'Existing' }}
165+
- PostgreSQL installed: {{ postgresql_installed.stat.exists }}
166+
- PostgreSQL data exists: {{ postgresql_data_exists.stat.exists }}
167+
- repmgr config exists: {{ repmgr_config_exists.stat.exists }}
168+
{% if is_fresh_install %}
169+
- Action taken: Minimal setup (directories created)
170+
{% else %}
171+
- Action taken: Full cleanup (services stopped, configs removed)
172+
{% endif %}
173+
- Ready for deployment: ✅

0 commit comments

Comments
 (0)