This document outlines the comprehensive DevOps practices implemented in the VPN project, covering CI/CD, monitoring, security, and operational excellence.
- CI/CD Pipeline
- Infrastructure as Code
- Containerization Strategy
- Monitoring and Observability
- Security Practices
- Deployment Strategies
- Backup and Disaster Recovery
- Performance Optimization
- Compliance and Governance
- Team Collaboration
Our CI/CD pipeline is built using GitHub Actions and follows the GitOps methodology:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Code │ │ Build │ │ Test │ │ Deploy │
│ Commit │───▶│ & Scan │───▶│ Suite │───▶│ Pipeline │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Git │ │ Docker │ │ Security │ │ K8s │
│ Hooks │ │ Build │ │ Scanning │ │ Deploy │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
- Static Code Analysis: ESLint, Black, isort, MyPy
- Security Scanning: Trivy, Bandit, Snyk
- Dependency Scanning: OWASP Dependency Check
- License Compliance: FOSSA, WhiteSource
- Multi-stage Docker builds for optimization
- Multi-architecture support (AMD64, ARM64)
- Container registry with vulnerability scanning
- Artifact signing for supply chain security
- Unit Tests: pytest with 90%+ coverage
- Integration Tests: API and database testing
- End-to-End Tests: VPN connectivity testing
- Performance Tests: Load and stress testing
- Security Tests: Penetration testing
- Blue-Green Deployment for zero downtime
- Canary Releases for gradual rollouts
- Rollback Capabilities for quick recovery
- Environment Promotion (dev → staging → prod)
# .github/workflows/ci-cd.yml
name: VPN CI/CD Pipeline
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
jobs:
security-scan:
runs-on: ubuntu-latest
steps:
- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@master
code-quality:
runs-on: ubuntu-latest
steps:
- name: Run Black formatter check
run: black --check src/
build-test:
runs-on: ubuntu-latest
needs: [security-scan, code-quality]
steps:
- name: Build Docker images
run: docker build -t vpn-api:latest .
deploy-staging:
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/develop'
environment: staging
steps:
- name: Deploy to staging
run: kubectl apply -f k8s/Our infrastructure is organized into reusable modules:
infrastructure/
├── main.tf # Main configuration
├── variables.tf # Input variables
├── outputs.tf # Output values
└── modules/
├── vpc/ # Network infrastructure
├── kubernetes/ # K8s cluster
├── security/ # Security groups, WAF
└── monitoring/ # Monitoring stack
# main.tf
module "aws_infrastructure" {
count = var.cloud_provider == "aws" ? 1 : 0
source = "./modules/aws"
region = var.region
cluster_name = var.cluster_name
node_count = var.node_count
instance_type = var.instance_type
}
module "gcp_infrastructure" {
count = var.cloud_provider == "gcp" ? 1 : 0
source = "./modules/gcp"
project_id = var.gcp_project_id
region = var.region
cluster_name = var.cluster_name
}# Development
terraform workspace select dev
terraform apply -var-file="environments/dev.tfvars"
# Staging
terraform workspace select staging
terraform apply -var-file="environments/staging.tfvars"
# Production
terraform workspace select prod
terraform apply -var-file="environments/prod.tfvars"# Build stage
FROM python:3.11-slim as builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Runtime stage
FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages
COPY . .
CMD ["python", "app.py"]- Non-root user execution
- Minimal base images (Alpine Linux)
- Security scanning in CI/CD
- Image signing and verification
- Runtime security monitoring
# Container registry configuration
registry:
type: "ghcr.io"
scanning:
enabled: true
tools: ["trivy", "snyk"]
signing:
enabled: true
key: "cosign"# prometheus.yml
scrape_configs:
- job_name: 'vpn-api'
static_configs:
- targets: ['vpn-api:8080']
metrics_path: /metrics
scrape_interval: 30s# logstash.conf
input {
beats {
port => 5044
}
}
filter {
if [fields][service] == "vpn-api" {
grok {
match => { "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:message}" }
}
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "vpn-logs-%{+YYYY.MM.dd}"
}
}# jaeger configuration
jaeger:
collector:
endpoint: "http://jaeger-collector:14268/api/traces"
sampler:
type: "const"
param: 1- VPN Performance: Connection count, throughput, latency
- System Resources: CPU, memory, disk, network
- Security Events: Failed logins, suspicious activity
- Business Metrics: User growth, usage patterns
# prometheus-rules.yml
groups:
- name: vpn.rules
rules:
- alert: VPNServerDown
expr: up{job="wireguard"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "VPN server is down"# Network policies
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: vpn-network-policy
spec:
podSelector:
matchLabels:
app: vpn-wireguard
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: vpn-api
ports:
- protocol: UDP
port: 51820# Pod security policy
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: vpn-psp
spec:
privileged: false
allowPrivilegeEscalation: false
requiredDropCapabilities:
- ALL
allowedCapabilities:
- NET_ADMIN# Kubernetes secrets
apiVersion: v1
kind: Secret
metadata:
name: vpn-secrets
type: Opaque
data:
jwt-secret: <base64-encoded>
database-url: <base64-encoded># Trivy vulnerability scanning
trivy image vpn-api:latest
# Snyk container scanning
snyk container test vpn-api:latest# Bandit security linter
bandit -r src/ -f json -o bandit-report.json
# Semgrep static analysis
semgrep --config=auto src/- Security: Access controls, encryption
- Availability: Uptime monitoring, backup
- Processing Integrity: Data validation, error handling
- Confidentiality: Data encryption, access controls
- Privacy: Data handling, user consent
- Data Minimization: Collect only necessary data
- Right to Erasure: User data deletion
- Data Portability: Export user data
- Consent Management: User consent tracking
# Blue-green deployment
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: vpn-api
spec:
strategy:
blueGreen:
activeService: vpn-api-active
previewService: vpn-api-preview
autoPromotionEnabled: false
scaleDownDelaySeconds: 30# Canary deployment
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: vpn-api
spec:
strategy:
canary:
steps:
- setWeight: 20
- pause: {duration: 10m}
- setWeight: 40
- pause: {duration: 10m}
- setWeight: 60
- pause: {duration: 10m}
- setWeight: 80
- pause: {duration: 10m}# argocd-application.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: vpn-app
spec:
project: default
source:
repoURL: https://github.com/org/vpn
targetRevision: HEAD
path: k8s
destination:
server: https://kubernetes.default.svc
namespace: vpn-system
syncPolicy:
automated:
prune: true
selfHeal: true# Database backup
kubectl exec -it vpn-api-pod -- pg_dump vpn_db > backup.sql
# Configuration backup
kubectl get all -n vpn-system -o yaml > k8s-backup.yaml
# Secrets backup
kubectl get secrets -n vpn-system -o yaml > secrets-backup.yaml# Terraform state backup
terraform state pull > terraform-state.json
# Container image backup
docker save vpn-api:latest | gzip > vpn-api.tar.gz- Recovery Time Objective (RTO): < 1 hour
- Recovery Point Objective (RPO): < 15 minutes
-
Infrastructure Recovery
# Provision new infrastructure terraform apply # Restore Kubernetes cluster kubectl apply -f k8s-backup.yaml
-
Data Recovery
# Restore database kubectl exec -it vpn-api-pod -- psql vpn_db < backup.sql # Restore secrets kubectl apply -f secrets-backup.yaml
-
Service Recovery
# Restart services kubectl rollout restart deployment/vpn-api # Verify health kubectl get pods -n vpn-system
# Redis caching
import redis
from functools import wraps
redis_client = redis.Redis(host='redis', port=6379, db=0)
def cache_result(expiration=300):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
cache_key = f"{func.__name__}:{hash(str(args) + str(kwargs))}"
cached = redis_client.get(cache_key)
if cached:
return json.loads(cached)
result = func(*args, **kwargs)
redis_client.setex(cache_key, expiration, json.dumps(result))
return result
return wrapper
return decorator# Connection pooling
from sqlalchemy import create_engine
from sqlalchemy.pool import QueuePool
engine = create_engine(
DATABASE_URL,
poolclass=QueuePool,
pool_size=20,
max_overflow=30,
pool_pre_ping=True
)# Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vpn-api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vpn-api
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70# Resource requests and limits
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "200m"- Code Review: Required for all changes
- Automated Testing: Must pass all tests
- Security Scanning: No high/critical vulnerabilities
- Documentation: Update relevant docs
- Semantic Versioning: MAJOR.MINOR.PATCH
- Release Notes: Document all changes
- Rollback Plan: Prepared for each release
- Communication: Notify stakeholders
# Audit logging
import logging
from datetime import datetime
audit_logger = logging.getLogger('audit')
def audit_log(action, user, resource, details=None):
audit_logger.info({
'timestamp': datetime.utcnow().isoformat(),
'action': action,
'user': user,
'resource': resource,
'details': details
})- SOC 2: Security controls monitoring
- GDPR: Data privacy compliance
- ISO 27001: Information security management
- PCI DSS: Payment card industry compliance
# Feature branch workflow
git checkout -b feature/new-vpn-client
git add .
git commit -m "feat: add new VPN client management"
git push origin feature/new-vpn-client
# Create pull request- Python: PEP 8, Black formatter
- YAML: Consistent indentation
- Docker: Multi-stage builds
- Terraform: Consistent formatting
- Architecture: System design and components
- API Documentation: OpenAPI/Swagger specs
- Deployment Guide: Step-by-step instructions
- Troubleshooting: Common issues and solutions
- Runbooks: Incident response procedures
- Playbooks: Standard operating procedures
- Knowledge Base: FAQs and best practices
- Training Materials: Onboarding and skill development
- Slack Channels: #vpn-alerts, #vpn-incidents
- PagerDuty: On-call rotation
- Post-mortems: Learn from incidents
- Status Page: Public communication
- Daily Standups: Progress and blockers
- Sprint Planning: Feature prioritization
- Retrospectives: Process improvement
- Architecture Reviews: Technical decisions
- Uptime: 99.9% availability target
- Response Time: < 200ms API response
- Throughput: VPN connections per second
- Error Rate: < 0.1% error rate
- Vulnerability Count: Zero high/critical
- Security Incidents: Response time < 1 hour
- Compliance Score: 100% compliance
- Access Reviews: Quarterly reviews
- User Satisfaction: NPS score > 8
- Support Tickets: < 5% of users
- Feature Adoption: 80% adoption rate
- Performance: User-reported issues < 1%
- Deployment Frequency: Daily deployments
- Lead Time: < 1 day from commit to production
- MTTR: < 30 minutes mean time to recovery
- Change Failure Rate: < 5% failure rate
- Training Budget: $2000 per engineer per year
- Conference Attendance: 2 conferences per year
- Certification: Cloud and security certifications
- Internal Tech Talks: Monthly knowledge sharing
- Retrospectives: Monthly process reviews
- Metrics Analysis: Quarterly KPI reviews
- Tool Evaluation: Annual tool assessment
- Best Practices: Industry standard adoption
- Proof of Concepts: Monthly POCs
- Technology Evaluation: Quarterly assessments
- Open Source Contribution: Community involvement
- Research and Development: 20% time for innovation
- Infrastructure Automation: 100% IaC
- Testing Automation: 90% test coverage
- Deployment Automation: Zero-touch deployments
- Monitoring Automation: Self-healing systems
This comprehensive DevOps practice guide ensures that our VPN infrastructure is built, deployed, and maintained with the highest standards of quality, security, and reliability. The practices outlined here provide a solid foundation for scaling the system and supporting business growth while maintaining operational excellence.