Skip to content

Implement Hetzner Certificate Rotation Service #204

@jmgilman

Description

@jmgilman

Implement Hetzner Certificate Rotation Service

Overview

Implement a lightweight, long-running daemon that manages certificate lifecycle on bare-metal machines in Hetzner data centers. The service handles automatic certificate renewal, secure storage, and service integration while maintaining connectivity to AWS-hosted services via Tailscale VPN.

Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Certificate   │    │   Hetzner Cert   │    │   Keycloak      │
│   API Service   │◄──►│  Rotation Service│◄──►│   (JWT Auth)    │
│   (AWS)         │    │   (Hetzner)      │    │   (AWS)         │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                                │
                                ▼
                       ┌──────────────────┐
                       │  Local Storage   │
                       │ (/etc/certs/)    │
                       └──────────────────┘

                    Connectivity via Tailscale VPN

Core Responsibilities

  • Bootstrap with initial dual-purpose certificates for machine identity
  • Authenticate with Keycloak using mTLS to exchange certificates for JWTs
  • Automatically renew certificates before expiration using the Certificate API
  • Manage local certificate storage and service reloading
  • Provide graceful degradation during connectivity issues
  • Maintain audit trails for certificate operations

Sub-Issues

Project Structure

services/certificates/refresher/
├── cmd/
│   └── rotator/
│       └── main.go                 # Service entry point
├── internal/
│   ├── api/
│   │   ├── certificate.go         # Certificate API client
│   │   └── client.go              # HTTP client with retry logic
│   ├── auth/
│   │   ├── keycloak.go           # Keycloak mTLS auth
│   │   └── jwt.go                # JWT token management
│   ├── bootstrap/
│   │   └── bootstrap.go          # Initial certificate setup
│   ├── config/
│   │   └── config.go             # Configuration management
│   ├── rotation/
│   │   ├── scheduler.go          # Certificate rotation scheduler
│   │   ├── renewer.go            # Certificate renewal logic
│   │   └── validator.go          # Certificate validation
│   ├── storage/
│   │   ├── filesystem.go         # Local certificate storage
│   │   └── permissions.go        # File permission management
│   ├── monitoring/
│   │   ├── metrics.go            # Prometheus metrics
│   │   ├── health.go             # Health reporting
│   │   └── logging.go            # Structured logging
│   └── services/
│       ├── reload.go             # Service reload management
│       └── registry.go           # Service discovery
├── pkg/
│   ├── crypto/
│   │   └── utils.go              # Certificate utilities
│   ├── network/
│   │   └── connectivity.go       # Network health checking
│   └── errors/
│       └── errors.go             # Custom error types
├── deployments/
│   ├── systemd/
│   │   └── hetzner-cert-rotation.service
│   └── docker/                   # For testing
│       └── Dockerfile
├── configs/
│   ├── config.yaml.example
│   └── bootstrap.yaml.example
├── scripts/
│   ├── install.sh                # Installation script
│   └── bootstrap.sh              # Bootstrap script
├── docs/
│   └── operations.md             # Operational procedures
├── go.mod
├── go.sum
└── README.md

Key Requirements

Certificate Management

  • 7-day certificate validity period
  • Renewal at 30% of lifetime remaining (2.1 days)
  • Support for both bootstrap certificates and API-managed certificates
  • Atomic certificate replacement with backup retention (keep 3 versions)

Authentication

  • mTLS authentication with Keycloak for JWT token exchange
  • JWT tokens valid for 1 hour, refreshed 10 minutes before expiry
  • Cached JWT tokens for resilience during network issues

Network & Resilience

  • All AWS connectivity via Tailscale VPN
  • Graceful degradation during network outages
  • Exponential backoff for retries: 1min, 5min, 15min
  • Alert after 3-5 consecutive renewal failures

Security

  • Private keys generated locally, never transmitted
  • Restrictive file permissions (certs: 644, keys: 600)
  • TLS 1.3 for all external communications
  • Audit logging for all certificate operations

Performance

  • Maximum 50MB resident memory usage
  • <1% CPU utilization during normal operation
  • Certificate renewal completed within 30 seconds

Acceptance Criteria

  1. Bootstrap Process

    • Service can start with manually provisioned bootstrap certificate
    • Successfully transitions from bootstrap cert to API-managed cert
    • Never attempts to renew bootstrap certificate
  2. Certificate Lifecycle

    • Automatically renews certificates at 30% lifetime remaining
    • Handles both initial issuance and renewal endpoints correctly
    • Maintains certificate backups and performs atomic replacements
  3. Authentication Flow

    • Successfully authenticates with Keycloak using mTLS
    • Exchanges certificates for JWT tokens
    • Caches and refreshes JWT tokens appropriately
  4. Service Integration

    • Reloads nginx, HAProxy, and other configured services after rotation
    • Provides health check endpoints for monitoring
    • Exports Prometheus metrics for observability
  5. Resilience

    • Operates in degraded mode during network outages
    • Retries failed operations with exponential backoff
    • Maintains service availability with existing certificates during failures
  6. Security & Compliance

    • Follows all specified permission requirements
    • Generates audit logs matching the specified schema
    • Never exposes private keys or sensitive data

Testing Requirements

  • Unit tests for all core components
  • Integration tests for end-to-end flows
  • Network failure simulation tests
  • Performance benchmarks to validate resource usage
  • Security validation for file permissions and key management

Documentation Deliverables

  • Operational procedures document
  • Installation and bootstrap scripts
  • Configuration examples
  • Troubleshooting guide

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions