diff --git a/docs/manual-galera-troubleshooting.md b/docs/manual-galera-troubleshooting.md new file mode 100644 index 00000000..1f564150 --- /dev/null +++ b/docs/manual-galera-troubleshooting.md @@ -0,0 +1,394 @@ +# Manual: MariaDB Galera Split-Brain Detection and Resolution + +This guide provides step-by-step manual procedures for detecting and resolving MariaDB Galera cluster split-brain scenarios in OpenShift environments. These procedures mirror the automated processes implemented in the monitoring scripts. + +## Prerequisites + +### 1. OpenShift Access + +1. Log into the OpenShift Console at: https://console.apps.silver.devops.gov.bc.ca +2. Click your username in the top right corner +3. Select **"Copy login command"** +4. Click **"Display Token"** +5. Copy the `oc login` command (it will look like): + + ```bash + oc login --token=sha256~ --server=https://api.silver.devops.gov.bc.ca:6443 + ``` + +### 2. Terminal Setup + +1. Open Windows Command Prompt or PowerShell +2. Paste and execute the login command +3. Set your project namespace: + + ```bash + oc project 950003-prod + ``` + +## Quick Health Check + +### Check All Running Pods + +```bash +# Get all running pods in the namespace (test access) +oc get pods --field-selector=status.phase=Running + +# Check MariaDB Galera pods specifically +oc get pods -l "app.kubernetes.io/name=mariadb-galera" --field-selector=status.phase=Running + +# Check other critical services +oc get pods -l "deployment=php" --field-selector=status.phase=Running +oc get pods -l "app=redis-proxy" --field-selector=status.phase=Running +``` + +## Detailed Galera Cluster Health Assessment + +### 1. Identify Galera Pods + +```bash +# Get Galera pod names +oc get pods -l "app.kubernetes.io/name=mariadb-galera" --field-selector=status.phase=Running -o jsonpath='{.items[*].metadata.name}' + +# Get detailed pod information +oc get pods -l "app.kubernetes.io/name=mariadb-galera" -o wide +``` + +### 2. Check Galera Cluster Status + +For each Galera pod, check the cluster status. Replace `` with actual pod names from step 1: + +```bash +# Get environment variables (needed for MySQL access) +# For PowerShell users: +$env:MARIADB_USER = oc get secret mariadb-galera -o jsonpath='{.data.mariadb-username}' | %{[System.Text.Encoding]::UTF8.GetString([System.Convert]::FromBase64String($_))} +$env:MARIADB_PASSWORD = oc get secret mariadb-galera -o jsonpath='{.data.mariadb-password}' | %{[System.Text.Encoding]::UTF8.GetString([System.Convert]::FromBase64String($_))} + +# For Command Prompt users (simpler approach): +# You'll need to decode the base64 values manually or use the actual username/password directly + +# Check if MySQL is responsive on each pod +oc exec -it -- mysqladmin -u "$MARIADB_USER" -p"$MARIADB_PASSWORD" ping + +# Get detailed Galera status for each pod +oc exec -it -- mysql -u "$MARIADB_USER" -p"$MARIADB_PASSWORD" -e "SHOW STATUS WHERE Variable_name IN ('wsrep_cluster_status', 'wsrep_local_state_comment', 'wsrep_cluster_size', 'wsrep_cluster_state_uuid');" +``` + +### 3. Example Health Check for Multiple Pods + +Run these commands for each pod (replace `mariadb-galera-0`, `mariadb-galera-1`, etc. with your actual pod names): + +```bash +# Pod 0 +echo "=== Checking mariadb-galera-0 ===" +oc exec -it mariadb-galera-0 -- mysql -u "$MARIADB_USER" -p"$MARIADB_PASSWORD" -e "SHOW STATUS WHERE Variable_name IN ('wsrep_cluster_status', 'wsrep_local_state_comment', 'wsrep_cluster_size', 'wsrep_cluster_state_uuid');" + +# Pod 1 +echo "=== Checking mariadb-galera-1 ===" +oc exec -it mariadb-galera-1 -- mysql -u "$MARIADB_USER" -p"$MARIADB_PASSWORD" -e "SHOW STATUS WHERE Variable_name IN ('wsrep_cluster_status', 'wsrep_local_state_comment', 'wsrep_cluster_size', 'wsrep_cluster_state_uuid');" + +# Pod 2 +echo "=== Checking mariadb-galera-2 ===" +oc exec -it mariadb-galera-2 -- mysql -u "$MARIADB_USER" -p"$MARIADB_PASSWORD" -e "SHOW STATUS WHERE Variable_name IN ('wsrep_cluster_status', 'wsrep_local_state_comment', 'wsrep_cluster_size', 'wsrep_cluster_state_uuid');" + +# Continue for all pods... +``` + +## Split-Brain Detection + +### Healthy Cluster Indicators + +A healthy cluster should show: + +- **wsrep_cluster_status**: `Primary` +- **wsrep_local_state_comment**: `Synced` +- **wsrep_cluster_size**: Same number across all pods (e.g., `5`) +- **wsrep_cluster_state_uuid**: Same UUID across all pods + +### Split-Brain Indicators + +🚨 **Split-brain detected if you see:** + +- Different `wsrep_cluster_state_uuid` values across pods +- Different `wsrep_cluster_size` values across pods +- Some pods showing `wsrep_cluster_status` as `non-primary` +- Some pods showing `wsrep_local_state_comment` as `Disconnected` + +### Example Split-Brain Output + +``` +Pod 1: wsrep_cluster_state_uuid = 12345-abcd, wsrep_cluster_size = 2 +Pod 2: wsrep_cluster_state_uuid = 12345-abcd, wsrep_cluster_size = 2 +Pod 3: wsrep_cluster_state_uuid = 67890-efgh, wsrep_cluster_size = 3 +Pod 4: wsrep_cluster_state_uuid = 67890-efgh, wsrep_cluster_size = 3 +Pod 5: wsrep_cluster_state_uuid = 67890-efgh, wsrep_cluster_size = 3 +``` +This shows two separate clusters with different UUIDs☝️ + +## Manual Split-Brain Resolution + +### ⚠️ Important Warnings + +- **ALWAYS take a database backup before attempting resolution** +- **Coordinate with your team** - ensure no other maintenance is happening +- **Document the issue** - note which pods were affected and the symptoms +- **Monitor closely** - watch the process and be ready to rollback + +### Step 1: Create a Database Backup + +```bash +# Find a healthy pod for backup (one that shows 'Primary' and 'Synced') +oc exec -it mariadb-galera-0 -- mysqldump -u "$MARIADB_USER" -p"$MARIADB_PASSWORD" --all-databases --single-transaction > galera-backup-$(date +%Y%m%d-%H%M).sql + +# Verify backup was created +ls -la galera-backup-*.sql +``` + +### Step 2: Identify the StatefulSet + +```bash +# Get StatefulSet information +oc get statefulset -l "app.kubernetes.io/name=mariadb-galera" + +# Get current replica count +oc get statefulset mariadb-galera -o jsonpath='{.spec.replicas}' +``` + +### Step 3: Scale Down to 1 Replica (Establish Primary) + +```bash +# Get original replica count first +ORIGINAL_REPLICAS=$(oc get statefulset mariadb-galera -o jsonpath='{.spec.replicas}') +echo "Original replica count: $ORIGINAL_REPLICAS" + +# Scale down to 1 replica +oc scale statefulset mariadb-galera --replicas=1 + +# Wait for scaling to complete (this may take a few minutes) +oc get pods -l "app.kubernetes.io/name=mariadb-galera" -w +``` + +### Step 4: Verify Single Node is Healthy + +```bash +# Wait for the remaining pod to be ready +oc wait --for=condition=ready pod -l "app.kubernetes.io/name=mariadb-galera" --timeout=300s + +# Check the remaining pod's status +REMAINING_POD=$(oc get pods -l "app.kubernetes.io/name=mariadb-galera" --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}') +echo "Remaining pod: $REMAINING_POD" + +# Verify it's healthy and in Primary state +oc exec -it $REMAINING_POD -- mysql -u "$MARIADB_USER" -p"$MARIADB_PASSWORD" -e "SHOW STATUS WHERE Variable_name IN ('wsrep_cluster_status', 'wsrep_local_state_comment', 'wsrep_cluster_size');" +``` + +Expected output should show: + +- wsrep_cluster_status: `Primary` +- wsrep_local_state_comment: `Synced` +- wsrep_cluster_size: `1` + +### Step 5: Scale Back Up to Original Replica Count + +```bash +# Scale back up to original size +oc scale statefulset mariadb-galera --replicas=$ORIGINAL_REPLICAS + +# Monitor the scaling process +oc get pods -l "app.kubernetes.io/name=mariadb-galera" -w +``` + +### Step 6: Verify Cluster Recovery + +Wait for all pods to be Running, then check cluster health: + +```bash +# Wait for all pods to be ready +oc wait --for=condition=ready pod -l "app.kubernetes.io/name=mariadb-galera" --timeout=600s + +# Get all pod names +GALERA_PODS=$(oc get pods -l "app.kubernetes.io/name=mariadb-galera" --field-selector=status.phase=Running -o jsonpath='{.items[*].metadata.name}') + +# Check each pod's status +for pod in $GALERA_PODS; do + echo "=== Checking $pod ===" + oc exec -it $pod -- mysql -u "$MARIADB_USER" -p"$MARIADB_PASSWORD" -e "SHOW STATUS WHERE Variable_name IN ('wsrep_cluster_status', 'wsrep_local_state_comment', 'wsrep_cluster_size', 'wsrep_cluster_state_uuid');" + echo "" +done +``` + +### Step 7: Validate Full Recovery + +All pods should now show: + +- **wsrep_cluster_status**: `Primary` +- **wsrep_local_state_comment**: `Synced` +- **wsrep_cluster_size**: Same value (your original replica count) +- **wsrep_cluster_state_uuid**: Same UUID across all pods + +## Pod Log Analysis + +### Check for Error Patterns + +```bash +# Check recent logs for MariaDB errors +oc logs mariadb-galera-0 --tail=50 +oc logs mariadb-galera-1 --tail=50 +# oc logs mariadb-galera-2, 3, 4, 5, etc... + +# Check for PHP application errors +# First get the pod names +oc get pods -l "deployment=php" --field-selector=status.phase=Running -o jsonpath="{.items[*].metadata.name}" + +# Then check logs for each pod individually (replace with actual names from above) +oc logs --tail=50 + +# Check Redis Proxy errors +# First get the pod names +oc get pods -l "app=redis-proxy" --field-selector=status.phase=Running -o jsonpath="{.items[*].metadata.name}" + +# Then check logs for each pod individually (replace with actual names from above) +oc logs --tail=50 | findstr /i "err:" +``` + +### Restart Problematic Pods + +If you find pods with errors: + +```bash +# Restart a specific pod by deleting it (it will be recreated) +oc delete pod + +# Example: Restart a PHP pod with errors +oc delete pod php-deployment-12345-abcde + +# Monitor the restart +oc get pods -w +``` + +### Log Analysis + +For more complex log analysis on Windows, you can use PowerShell: + +```powershell +# PowerShell method to check multiple PHP pods for errors +$phpPods = (oc get pods -l "deployment=php" --field-selector=status.phase=Running -o jsonpath="{.items[*].metadata.name}") -split " " +foreach ($pod in $phpPods) { + Write-Host "=== Checking $pod ===" + oc logs $pod --tail=50 | Select-String -Pattern "error|critical" -CaseSensitive:$false +} + +# PowerShell method to check Redis Proxy pods +$redisPods = (oc get pods -l "app=redis-proxy" --field-selector=status.phase=Running -o jsonpath="{.items[*].metadata.name}") -split " " +foreach ($pod in $redisPods) { + Write-Host "=== Checking Redis pod $pod ===" + oc logs $pod --tail=50 | Select-String -Pattern "err:" -CaseSensitive:$false +} +``` + +### Command Prompt Alternative + +If using Command Prompt (not PowerShell), use this step-by-step approach: + +```cmd +REM Get PHP pod names first +oc get pods -l "deployment=php" --field-selector=status.phase=Running -o jsonpath="{.items[*].metadata.name}" + +REM Copy each pod name and check logs manually +oc logs --tail=50 | findstr /i "error critical" +oc logs --tail=50 | findstr /i "error critical" + +REM Same for Redis pods +oc get pods -l "app=redis-proxy" --field-selector=status.phase=Running -o jsonpath="{.items[*].metadata.name}" +oc logs --tail=50 | findstr /i "err:" +oc logs --tail=50 | findstr /i "err:" +``` + +## Troubleshooting Common Issues + +### Issue: Pods Stuck in Pending or Init State + +```bash +# Check pod events for scheduling issues +oc describe pod + +# Check resource availability +oc describe nodes + +# Check PVC status +oc get pvc +``` + +### Issue: Pods Keep Restarting + +```bash +# Check restart count +oc get pods -l "app.kubernetes.io/name=mariadb-galera" + +# Get detailed restart reason +oc describe pod + +# Check resource limits +oc get statefulset mariadb-galera -o yaml | grep -A 10 resources +``` + +### Issue: MySQL Connection Refused + +```bash +# Check if MySQL process is running in pod +oc exec -it -- ps aux | grep mysql + +# Check MySQL error logs +oc exec -it -- tail -50 /var/log/mysql/error.log + +# Check network connectivity between pods +oc exec -it -- ping +``` + +## Emergency Procedures + +### Complete Cluster Failure Recovery + +If all pods are failing and split-brain resolution doesn't work: + +```bash +# 1. Scale down to 0 (DANGEROUS - only if cluster is completely broken) +oc scale statefulset mariadb-galera --replicas=0 + +# 2. Wait for all pods to terminate +oc get pods -l "app.kubernetes.io/name=mariadb-galera" -w + +# 3. Backup any recoverable data from PVCs if possible +oc get pvc + +# 4. Scale back up with fresh cluster +oc scale statefulset mariadb-galera --replicas=1 + +# 5. Wait for first pod to initialize +oc wait --for=condition=ready pod -l "app.kubernetes.io/name=mariadb-galera" --timeout=600s + +# 6. Restore from backup +# oc exec -it -- mysql -u "$MARIADB_USER" -p"$MARIADB_PASSWORD" < galera-backup-YYYYMMDD-HHMM.sql + +# 7. Scale up to full size +oc scale statefulset mariadb-galera --replicas=$ORIGINAL_REPLICAS +``` + +## Prevention and Best Practices + +### Regular Health Checks + +Run these commands regularly to catch issues early: + +```bash +# Weekly health check script +echo "=== Galera Cluster Health Check - $(date) ===" +GALERA_PODS=$(oc get pods -l "app.kubernetes.io/name=mariadb-galera" --field-selector=status.phase=Running -o jsonpath='{.items[*].metadata.name}') +for pod in $GALERA_PODS; do + echo "Checking $pod:" + oc exec -it $pod -- mysql -u "$MARIADB_USER" -p"$MARIADB_PASSWORD" -e "SHOW STATUS WHERE Variable_name IN ('wsrep_cluster_status', 'wsrep_local_state_comment', 'wsrep_cluster_size');" 2>/dev/null || echo " ERROR: Cannot connect to MySQL on $pod" + echo "" +done +``` + +> **Note**: This manual process mirrors the automated monitoring and healing to be implemented in the pod health monitoring system. The automated system performs these same checks every 60 seconds and auto-heals when safe to do so. Manual intervention should only be necessary when the automated system cannot resolve the issue.