CNPG Storage Manager

A Kubernetes controller for automated storage management of CloudNativePG (CNPG) clusters. It monitors storage usage, automatically expands PVCs when thresholds are breached, performs WAL cleanup in emergencies, and sends alerts through multiple channels.

Features

Storage Monitoring: Continuously monitors PVC usage across CNPG clusters
Automated PVC Expansion: Automatically expands PVCs when configurable thresholds are breached
WAL Cleanup: Performs PostgreSQL WAL file cleanup in emergency situations
Multi-Channel Alerting: Sends alerts via Prometheus Alertmanager, Slack, and PagerDuty
Circuit Breaker Protection: Prevents action loops with configurable failure thresholds
Dry-Run Mode: Test policies without taking actual actions
Per-Cluster Cooldowns: Configurable cooldown periods between operations
Prometheus Metrics: Comprehensive metrics for monitoring and observability

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    CNPG Storage Manager                         │
├─────────────────────────────────────────────────────────────────┤
│  StoragePolicy Controller                                       │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐ │
│  │  Discovery  │→ │  Metrics    │→ │  Policy Evaluator       │ │
│  │  (CNPG)     │  │  Collector  │  │  (Threshold Checks)     │ │
│  └─────────────┘  └─────────────┘  └───────────┬─────────────┘ │
│                                                 │               │
│  ┌─────────────────────────────────────────────▼─────────────┐ │
│  │                    Action Handlers                        │ │
│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐  │ │
│  │  │  Expansion   │ │  WAL Cleanup │ │  Alert Manager   │  │ │
│  │  │  Engine      │ │  Engine      │ │  (Multi-Channel) │  │ │
│  │  └──────────────┘ └──────────────┘ └──────────────────┘  │ │
│  └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Getting Started

Prerequisites

Kubernetes v1.26+
CloudNativePG operator installed
go version v1.24+ (for development)
docker version 17.03+
kubectl version v1.26+

Installation

Install the CRDs:

make install

Deploy the controller:

make deploy IMG=ghcr.io/supporttools/cnpg-storage-manager:latest

Quick Start

Create a StoragePolicy to monitor your CNPG clusters:

apiVersion: cnpg.supporttools.io/v1alpha1
kind: StoragePolicy
metadata:
  name: production-postgres
  namespace: database
spec:
  selector:
    matchLabels:
      environment: production

  thresholds:
    warning: 70      # Alert at 70%
    critical: 80     # Critical alert at 80%
    expansion: 85    # Auto-expand at 85%
    emergency: 90    # WAL cleanup at 90%

  expansion:
    enabled: true
    percentage: 50        # Expand by 50%
    minIncrementGi: 5     # Minimum 5Gi expansion
    maxSize: 100Gi        # Maximum size limit
    cooldownMinutes: 30

  walCleanup:
    enabled: true
    retainCount: 10
    requireArchived: true
    cooldownMinutes: 15

  alerting:
    channels:
      - type: alertmanager
        endpoint: "http://alertmanager:9093"
      - type: slack
        webhookSecret: "default/slack-webhook"
        channel: "#db-alerts"

Apply the policy:

kubectl apply -f storagepolicy.yaml

Monitor with:

kubectl get storagepolicies
kubectl get storageevents

Testing with Dry-Run Mode

Before enabling actual remediation actions, you can deploy with global dry-run mode to test and validate the controller's behavior:

Using Helm with Dry-Run

# Deploy with dry-run mode enabled
helm install cnpg-storage-manager ./charts/cnpg-storage-manager \
  --namespace cnpg-storage-manager \
  --create-namespace \
  --set dryRun=true \
  --set logging.level=debug \
  --set logging.development=true

# Or use the pre-configured dry-run values file
helm install cnpg-storage-manager ./charts/cnpg-storage-manager \
  --namespace cnpg-storage-manager \
  --create-namespace \
  -f ./charts/cnpg-storage-manager/values-dryrun.yaml

Using kubectl/kustomize

Set the --dry-run flag or DRY_RUN=true environment variable:

# In your manager deployment
containers:
  - name: manager
    args:
      - --dry-run
    # Or via environment variable
    env:
      - name: DRY_RUN
        value: "true"

What Dry-Run Mode Does

When dry-run is enabled (globally or per-policy), the controller will:

Continue to do:
- Discover and monitor CNPG clusters
- Collect storage metrics from kubelet
- Evaluate thresholds against configured policies
- Log what actions would be taken
- Send alerts (alerts are still sent in dry-run)
- Update Prometheus metrics
NOT do:
- Expand PVCs (no actual size changes)
- Delete WAL files (no actual file deletions)
- Create StorageEvent audit records for remediation

Monitoring Dry-Run Behavior

Watch the controller logs to see what actions would be taken:

kubectl logs -f deployment/cnpg-storage-manager -n cnpg-storage-manager

Example dry-run log output:

INFO  DryRun: Would expand PVCs  {"cluster": "my-postgres", "globalDryRun": true, "policyDryRun": false}
INFO  DryRun: Would cleanup WAL  {"cluster": "my-postgres", "globalDryRun": true, "policyDryRun": false}

Transitioning to Live Mode

Once you're confident the controller is behaving as expected:

# Upgrade with dry-run disabled
helm upgrade cnpg-storage-manager ./charts/cnpg-storage-manager \
  --namespace cnpg-storage-manager \
  --set dryRun=false

Configuration

StoragePolicy Spec

Field	Description	Default
`selector`	Label selector for matching CNPG clusters	Required
`thresholds.warning`	Warning alert threshold (%)	70
`thresholds.critical`	Critical alert threshold (%)	80
`thresholds.expansion`	Auto-expansion threshold (%)	85
`thresholds.emergency`	WAL cleanup threshold (%)	90
`expansion.enabled`	Enable automatic PVC expansion	true
`expansion.percentage`	Percentage to expand by	50
`expansion.minIncrementGi`	Minimum expansion size (Gi)	5
`expansion.maxSize`	Maximum PVC size limit	-
`expansion.cooldownMinutes`	Time between expansions	30
`walCleanup.enabled`	Enable WAL cleanup	true
`walCleanup.retainCount`	Minimum WAL files to keep	10
`walCleanup.requireArchived`	Only clean archived WALs	true
`circuitBreaker.maxFailures`	Failures before circuit opens	3
`circuitBreaker.resetMinutes`	Time before circuit resets	60
`dryRun`	Enable dry-run mode	false

Alert Channels

Alertmanager:

- type: alertmanager
  endpoint: "http://alertmanager:9093"

Slack:

- type: slack
  webhookSecret: "namespace/secret-name"  # Secret with 'webhook-url' key
  channel: "#alerts"

PagerDuty:

- type: pagerduty
  routingKeySecret: "namespace/secret-name"  # Secret with 'routing-key' key

Metrics

The controller exposes Prometheus metrics on :8080/metrics:

Metric	Description
`cnpg_storage_manager_pvc_usage_bytes`	Current PVC usage in bytes
`cnpg_storage_manager_pvc_capacity_bytes`	Total PVC capacity in bytes
`cnpg_storage_manager_pvc_usage_percent`	PVC usage percentage
`cnpg_storage_manager_wal_directory_bytes`	WAL directory size
`cnpg_storage_manager_wal_files_count`	Number of WAL files
`cnpg_storage_manager_expansion_total`	Total expansion operations
`cnpg_storage_manager_wal_cleanup_total`	Total WAL cleanup operations
`cnpg_storage_manager_alerts_sent_total`	Total alerts sent
`cnpg_storage_manager_circuit_breaker_open`	Circuit breaker state
`cnpg_storage_manager_threshold_breaches_total`	Threshold breach count

Storage Events

The controller creates StorageEvent resources to track all operations:

# List all events
kubectl get storageevents -A

# View expansion events
kubectl get storageevents -l cnpg.supporttools.io/event-type=expansion

# View WAL cleanup events
kubectl get storageevents -l cnpg.supporttools.io/event-type=wal-cleanup

Annotations

Override policy settings per-cluster using annotations:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: my-cluster
  annotations:
    # Override thresholds
    cnpg.supporttools.io/warning-threshold: "75"
    cnpg.supporttools.io/expansion-threshold: "90"

    # Disable specific features
    cnpg.supporttools.io/expansion-enabled: "false"
    cnpg.supporttools.io/wal-cleanup-enabled: "false"

    # Set expansion parameters
    cnpg.supporttools.io/expansion-percentage: "100"
    cnpg.supporttools.io/max-size: "200Gi"

Development

Building

# Build binary
make build

# Run tests
make test

# Run tests with coverage
make test-coverage

# Run locally
make run-local

Testing

# Unit tests
make test

# Quick tests (no manifests regeneration)
make test-quick

# E2E tests (requires Kind)
make test-e2e

Uninstall

# Remove sample resources
make undeploy-samples

# Remove controller
make undeploy

# Remove CRDs
make uninstall

Contributing

Fork the repository
Create a feature branch
Make your changes with tests
Run make test and make lint
Submit a pull request

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.claude		.claude
.devcontainer		.devcontainer
.githooks		.githooks
.github		.github
api/v1alpha1		api/v1alpha1
charts/cnpg-storage-manager		charts/cnpg-storage-manager
cmd		cmd
config		config
docs		docs
hack		hack
internal/controller		internal/controller
pkg		pkg
scripts		scripts
test		test
.gitignore		.gitignore
.golangci.yml		.golangci.yml
.validation.json		.validation.json
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Makefile		Makefile
PROJECT		PROJECT
README.md		README.md
SECURITY.md		SECURITY.md
design.md		design.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CNPG Storage Manager

Features

Architecture

Getting Started

Prerequisites

Installation

Quick Start

Testing with Dry-Run Mode

Using Helm with Dry-Run

Using kubectl/kustomize

What Dry-Run Mode Does

Monitoring Dry-Run Behavior

Transitioning to Live Mode

Configuration

StoragePolicy Spec

Alert Channels

Metrics

Storage Events

Annotations

Development

Building

Testing

Uninstall

Contributing

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CNPG Storage Manager

Features

Architecture

Getting Started

Prerequisites

Installation

Quick Start

Testing with Dry-Run Mode

Using Helm with Dry-Run

Using kubectl/kustomize

What Dry-Run Mode Does

Monitoring Dry-Run Behavior

Transitioning to Live Mode

Configuration

StoragePolicy Spec

Alert Channels

Metrics

Storage Events

Annotations

Development

Building

Testing

Uninstall

Contributing

License

About

Resources

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages