Skip to content

Commit 0f081bc

Browse files
authored
Merge pull request #20 from xataio/fix-switchover-support
Fix switchover support
2 parents 9a565e3 + 20d8ea6 commit 0f081bc

File tree

6 files changed

+192
-109
lines changed

6 files changed

+192
-109
lines changed

.pre-commit-config.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,3 +16,9 @@ repos:
1616
hooks:
1717
- id: golangci-lint-full
1818
args: ["--timeout=10m", "--config=.golangci.yml"]
19+
- repo: local
20+
hooks:
21+
- id: generate-manifest
22+
language: system
23+
name: generate manifest
24+
entry: make manifest

README.md

Lines changed: 30 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,16 @@ A [CNPG-I](https://github.com/cloudnative-pg/cnpg-i) plugin that automatically h
44

55
## Overview
66

7-
This plugin monitors PostgreSQL database activity and automatically scales clusters down to zero replicas when they've been inactive for a configurable period. It injects a monitoring sidecar into the primary PostgreSQL pod that tracks database connections and query activity, then hibernates the cluster by setting the `cnpg.io/hibernation` annotation when the inactivity threshold is reached.
7+
This plugin monitors PostgreSQL database activity and automatically scales clusters down to zero replicas when they've been inactive for a configurable period. It injects a monitoring sidecar into all pods of the PostgreSQL cluster. Only the primary pod actively monitors database connections and manages hibernation, while replica pods run the sidecar in passive mode until promoted to primary.
88

99
### How It Works
1010

11-
1. **Sidecar Injection**: Automatically adds a monitoring sidecar to the primary PostgreSQL pod
12-
2. **Activity Monitoring**: The sidecar periodically checks for active database connections and recent queries
13-
3. **Automatic Hibernation**: When the cluster is inactive for the configured duration, it sets the hibernation annotation
14-
4. **Scheduled Backup Management**: Automatically pauses scheduled backups when the cluster is hibernated to prevent backup failures
15-
5. **Resource Optimization**: Inactive clusters are scaled to zero, freeing up cluster resources
11+
1. **Sidecar Injection**: Automatically adds a monitoring sidecar to all PostgreSQL pods in the cluster
12+
2. **Primary-Only Monitoring**: Only the primary pod actively monitors database connections and query activity
13+
3. **Passive Replicas**: Replica pods run the sidecar container but remain in passive mode (no monitoring)
14+
4. **Automatic Hibernation**: When the cluster is inactive for the configured duration, the primary sidecar sets the hibernation annotation
15+
5. **Scheduled Backup Management**: The primary pod automatically pauses scheduled backups when the cluster is hibernated to prevent backup failures
16+
6. **Switchover Handling**: During switchovers, the new primary automatically takes over monitoring duties while the old primary becomes passive
1617

1718
## Installation
1819

@@ -176,7 +177,8 @@ These resource configurations apply to all sidecar containers injected by the pl
176177
The plugin provides logging to help monitor its operation:
177178

178179
- Sidecar injection events are logged during pod creation
179-
- Activity monitoring status is logged at each check interval
180+
- Activity monitoring status is logged at each check interval (primary pod only)
181+
- Primary/replica role transitions are logged when pods change status
180182
- Hibernation events are logged when clusters are scaled down
181183
- Scheduled backup pause operations are logged
182184

@@ -189,9 +191,15 @@ kubectl logs -n cnpg-system deployment/cnpg-i-scale-to-zero-plugin
189191
And monitor the sidecar logs in the PostgreSQL pods:
190192

191193
```shell
192-
kubectl logs <pod-name> -c scale-to-zero
194+
# View logs from the primary pod's sidecar (active monitoring)
195+
kubectl logs <primary-pod-name> -c scale-to-zero
196+
197+
# View logs from replica pods' sidecars (passive mode)
198+
kubectl logs <replica-pod-name> -c scale-to-zero
193199
```
194200

201+
**Note**: Primary pod sidecars will show active monitoring logs, while replica pod sidecars will show minimal passive mode logs.
202+
195203
## Development
196204

197205
For local development and building from source:
@@ -214,3 +222,17 @@ make kind-deploy-dev
214222
This plugin uses the [pluginhelper](https://github.com/cloudnative-pg/cnpg-i-machinery/tree/main/pkg/pluginhelper) from [`cnpg-i-machinery`](https://github.com/cloudnative-pg/cnpg-i-machinery) to simplify the plugin's implementation.
215223

216224
For additional details on the plugin implementation, refer to the [development documentation](doc/development.md).
225+
226+
## Limitations
227+
228+
### Primary-Only Activity Tracking
229+
230+
Currently, the plugin only monitors database activity on the **primary instance**. This means:
231+
232+
- **Read-only workloads on replicas are not tracked** - If your application connects directly to replica instances for read queries, this activity will not prevent hibernation
233+
- **Replica-only traffic** - Clusters with active read traffic exclusively on replicas may be hibernated despite being in use
234+
- **Connection pooling to replicas** - Applications using connection poolers that direct read traffic to replicas will not be detected as active
235+
236+
**Workaround**: Ensure critical read workloads also maintain at least one connection to the primary instance, or configure longer inactivity periods to account for replica-only usage patterns.
237+
238+
**Future Enhancement**: Replica activity monitoring may be added in future versions to provide more comprehensive activity detection across the entire cluster.

doc/development.md

Lines changed: 12 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -87,8 +87,9 @@ The `OperatorLifecycleServer` interface requires several methods:
8787
The scale-to-zero plugin specifically:
8888

8989
- Monitors Pod creation events
90-
- Injects a sidecar container into the primary PostgreSQL pod only
91-
- The sidecar monitors database activity and hibernates inactive clusters
90+
- Injects a sidecar container into all PostgreSQL cluster pods
91+
- The sidecar on the primary monitors database activity and hibernates inactive clusters
92+
- The sidecar on the replicas remains passive until they are promoted to primary
9293
- Manages scheduled backups by pausing them during hibernation
9394

9495
### Sidecar Implementation
@@ -108,25 +109,22 @@ The sidecar manager handles the startup and configuration of the sidecar process
108109

109110
#### Scale-to-Zero Logic (`scale_to_zero.go`)
110111

111-
The main scale-to-zero functionality monitors database activity and hibernates
112-
inactive clusters:
112+
The main scale-to-zero functionality monitors database activity and hibernates inactive clusters:
113113

114-
- **Activity Monitoring**: Connects to PostgreSQL to check for active connections
115-
and recent query activity
114+
- **Activity Monitoring**: Connects to PostgreSQL to check for open connections
115+
- **Switchover Handling**: Automatically detects primary changes and transfers monitoring responsibility
116116
- **Configurable Inactivity Threshold**: Uses the `xata.io/scale-to-zero-inactivity-minutes`
117117
annotation to determine when a cluster should be hibernated (defaults to 30 minutes)
118118
- **Hibernation**: Sets the `cnpg.io/hibernation` annotation to scale the cluster to zero
119-
- **Scheduled Backup Management**: Automatically pauses scheduled backups when hibernating
120-
clusters to prevent backup failures on inactive clusters
121-
- **Primary-Only Operation**: Only runs on the primary PostgreSQL instance
119+
- **Scheduled Backup Management**: Automatically pauses scheduled backups when hibernating clusters to prevent backup failures on inactive clusters
122120

123121
Key features:
124122

125123
- Periodic checks at configurable intervals (default: 1 minute)
126124
- PostgreSQL connection pooling for activity monitoring
127125
- Graceful shutdown on context cancellation
128-
- Error handling for replica instances (stops monitoring if not primary)
129126
- Automatic scheduled backup pause operations
127+
- Switchover support
130128

131129
#### Environment Variables
132130

@@ -166,16 +164,13 @@ are inactive for a specified period. Here's how it operates:
166164
1. **Sidecar Injection**: When a PostgreSQL pod is created, the plugin injects a
167165
sidecar container that monitors database activity.
168166

169-
2. **Primary Pod Only**: The sidecar only runs monitoring on the primary PostgreSQL
170-
instance to avoid conflicts and ensure consistent behavior.
167+
2. **Activity Monitoring**: The sidecar periodically connects to PostgreSQL to check open database connections.
171168

172-
3. **Activity Monitoring**: The sidecar periodically connects to PostgreSQL to check open database connections.
173-
174-
4. **Hibernation**: When the cluster has been inactive for the configured duration,
175-
the sidecar sets the `cnpg.io/hibernation` annotation on the cluster, causing
169+
3. **Hibernation**: When the cluster has been inactive for the configured duration,
170+
the primary sidecar sets the `cnpg.io/hibernation` annotation on the cluster, causing
176171
CloudNativePG to scale it down to zero replicas.
177172

178-
5. **Scheduled Backup Management**: After hibernating a cluster, the sidecar automatically
173+
4. **Scheduled Backup Management**: After hibernating a cluster, the sidecar automatically
179174
pauses any associated scheduled backups to prevent backup operations from failing
180175
on hibernated clusters.
181176

internal/plugin/lifecycle/lifecycle.go

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -101,11 +101,6 @@ func (impl Implementation) reconcileMetadata(
101101
return nil, err
102102
}
103103

104-
if cluster.Status.CurrentPrimary != "" && pod.Name != cluster.Status.CurrentPrimary {
105-
logger.Info("pod is not the current primary, skipping sidecar injection", "pod", pod.Name, "primary", cluster.Status.CurrentPrimary)
106-
return &lifecycle.OperatorLifecycleResponse{}, nil
107-
}
108-
109104
mutatedPod := pod.DeepCopy()
110105

111106
sidecarContainer := &corev1.Container{

internal/sidecar/scale_to_zero.go

Lines changed: 23 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -90,12 +90,23 @@ func (s *scaleToZero) Start(ctx context.Context) error {
9090
case <-ctx.Done():
9191
return nil
9292
case <-ticker.C:
93-
scaleToZeroConfig, err := s.getClusterScaleToZeroConfig(ctx)
93+
cluster, err := s.client.getCluster(ctx, doNotForceUpdate)
9494
if err != nil {
95-
contextLogger.Error(err, "failed to get scale to zero configuration")
95+
contextLogger.Error(err, "failed to get cluster")
9696
continue
9797
}
9898

99+
// only the primary keeps track of activity and hibernation
100+
if !s.isPrimary(cluster) {
101+
// reset last active time when it's not the primary to make sure
102+
// when there's a switchover, the new primary has a clean state
103+
s.lastActive = time.Time{}
104+
contextLogger.Info("running on non-primary pod, skipping activity monitoring", "primary", cluster.Status.CurrentPrimary)
105+
continue
106+
}
107+
108+
scaleToZeroConfig := s.getClusterScaleToZeroConfig(ctx, cluster)
109+
99110
if !scaleToZeroConfig.enabled {
100111
// reset last active time if scale to zero is disabled. This
101112
// prevents old activity tracking from kicking in when scale to
@@ -110,6 +121,7 @@ func (s *scaleToZero) Start(ctx context.Context) error {
110121
contextLogger.Error(err, "failed to check cluster activity")
111122
continue
112123
}
124+
113125
if !isActive {
114126
if err := s.hibernate(ctx); err != nil {
115127
contextLogger.Error(err, "hibernation failed")
@@ -154,6 +166,13 @@ func (s *scaleToZero) initQuerier(ctx context.Context) error {
154166
return err
155167
}
156168

169+
func (s *scaleToZero) isPrimary(cluster *cnpgv1.Cluster) bool {
170+
// when the cluster is first initialised, the current primary might not be
171+
// set yet. Assume it's the primary if it's not set to avoid blocking the
172+
// scale to zero checks.
173+
return cluster.Status.CurrentPrimary == "" || (cluster.Status.CurrentPrimary == s.currentPodName)
174+
}
175+
157176
// isClusterActive checks if the cluster has any open connections.
158177
func (s *scaleToZero) isClusterActive(ctx context.Context, inactivityMinutes int) (bool, error) {
159178
openConns, err := s.openConnections(ctx)
@@ -241,12 +260,7 @@ func (s *scaleToZero) hibernate(ctx context.Context) error {
241260
// getClusterScaleToZeroConfig retrieves the scale to zero configuration from
242261
// the cluster annotations. It returns the enabled status and inactivity
243262
// minutes. If the annotation is not set, it uses default values.
244-
func (s *scaleToZero) getClusterScaleToZeroConfig(ctx context.Context) (*scaleToZeroConfig, error) {
245-
cluster, err := s.client.getCluster(ctx, doNotForceUpdate)
246-
if err != nil {
247-
return nil, fmt.Errorf("failed to get cluster: %w", err)
248-
}
249-
263+
func (s *scaleToZero) getClusterScaleToZeroConfig(ctx context.Context, cluster *cnpgv1.Cluster) *scaleToZeroConfig {
250264
enabled := false
251265
inactivityMinutes := defaultInactivityMinutes
252266

@@ -266,7 +280,7 @@ func (s *scaleToZero) getClusterScaleToZeroConfig(ctx context.Context) (*scaleTo
266280
return &scaleToZeroConfig{
267281
enabled: enabled,
268282
inactivityMinutes: inactivityMinutes,
269-
}, nil
283+
}
270284
}
271285

272286
func (s *scaleToZero) pauseScheduledBackup(ctx context.Context) error {

0 commit comments

Comments
 (0)