🎉 Overview
This release introduces a comprehensive, production-ready automated deployment system for Cortex (Prometheus long-term storage) on Google Cloud Platform. The system provides enterprise-grade monitoring with multi-cluster support, secure authentication, and scalable storage using GCS buckets.
🚀 Key Features
🏗️ Automated Infrastructure Deployment
- One-click main cluster setup with Cortex, Consul, and NGINX
- Automated worker cluster deployment with Prometheus remote write
- Kind cluster support for local development and testing
- Server deployment compatibility for production environments
🔐 Enterprise Security
- Multi-layered authentication with NGINX basic auth and Cortex multi-tenancy
- Secure credential management with template-based approach
- Git security protection preventing accidental credential commits
- Service account isolation with least-privilege GCP permissions
📊 Monitoring & Observability
- Complete ServiceMonitor setup for all Cortex components
- Prometheus operator integration with custom resources
- Real-time metrics collection from worker clusters
- Built-in alerting capabilities with Alertmanager integration
☁️ Cloud-Native Storage
- Google Cloud Storage backend for all data types
- Separate bucket architecture for blocks, rules, and alerts
- Automatic data lifecycle management
- Cost-optimized storage configuration
📦 Components Included
Main Cluster Stack
| Component |
Purpose |
Configuration |
| Cortex |
Long-term storage & querying |
Multi-tenant, GCS backend |
| Consul |
Service discovery |
HashiCorp Consul cluster |
| NGINX |
Load balancer & auth proxy |
Basic auth + org isolation |
| Prometheus |
Metrics collection |
kube-prometheus-stack |
| Alertmanager |
Alert management |
GCS-backed storage |
Worker Cluster Stack
| Component |
Purpose |
Configuration |
| Prometheus Server |
Metrics collection |
Remote write to Cortex |
| Node Exporter |
System metrics |
Host-level monitoring |
| kube-state-metrics |
Kubernetes metrics |
Cluster state monitoring |
| Alertmanager |
Local alerting |
Optional deployment |
🛠️ Technical Specifications
System Requirements
- Kubernetes: 1.20+ (Kind 0.11+ for local)
- Helm: 3.5+
- kubectl: 1.20+
- Docker: 20.10+ (for Kind)
GCP Requirements
- Project: Active GCP project with billing
- Storage: 3 GCS buckets (blocks, rules, alerts)
- IAM: Service account with Storage Admin permissions
- Network: Internet connectivity for Helm charts
Resource Specifications
Main Cluster Resources
Cortex Components:
- CPU: 100m-1000m per component
- Memory: 512Mi-2Gi per component
- Storage: GCS buckets (unlimited)
Supporting Services:
- Consul: 100m CPU, 128Mi RAM
- NGINX: 50m CPU, 64Mi RAM
- Prometheus: 500m CPU, 1Gi RAM
Worker Cluster Resources
Prometheus Server:
- CPU: 100m-500m
- Memory: 512Mi-1Gi
- Storage: 8Gi persistent volume
- Retention: 15 days (configurable)
🔧 Installation & Setup
Quick Start
# 1. Clone repository
git clone <repository-url>
cd GOOGLE_GCS
# 2. Setup GCP credentials
cp main_cluster/key.json.example main_cluster/key.json
# Edit key.json with your GCP service account key
# 3. Configure environment
vi main_cluster/env_variables.sh
# Update bucket names and credentials
# 4. Deploy main cluster
cd main_cluster/
./cortexMainCluster.sh
# 5. Deploy worker clusters
cd ../worker_clusters/
./workerCluster.sh
Deployment Options
| Option |
Use Case |
Command |
| Local Development |
Testing, development |
Choose 'Y' when prompted |
| Server Production |
Production deployment |
Choose 'N' when prompted |
| Multi-Region |
Geographic distribution |
Deploy workers in different regions |
🔐 Security Features
Credential Protection
- ✅ Template-based key management - No real credentials in git
- ✅ Comprehensive .gitignore - Prevents accidental commits
- ✅ Validation checks - Ensures proper credential setup
- ✅ Base64 encoding - Secure password storage
Network Security
- ✅ NGINX proxy - Centralized access control
- ✅ Basic authentication - Username/password protection
- ✅ Org-ID isolation - Multi-tenant data separation
- ✅ TLS support - HTTPS-ready configuration
Access Control
- ✅ RBAC configuration - Kubernetes role-based access
- ✅ Service account isolation - Minimal privilege principle
- ✅ Namespace separation - Logical resource isolation
📊 Monitoring Capabilities
Metrics Collection
- System Metrics: CPU, memory, disk, network from all nodes
- Kubernetes Metrics: Pod status, deployments, services, ingress
- Application Metrics: Custom application metrics via remote write
- Cortex Metrics: Internal component health and performance
Data Retention
- Worker Clusters: 15 days local retention
- Cortex Storage: Unlimited long-term storage in GCS
- Query Performance: Optimized for historical data analysis
Alerting
- Built-in Rules: Kubernetes and system-level alerts
- Custom Rules: Support for custom alerting rules
- Multi-channel Notifications: Email, Slack, webhook support
- Alert Persistence: GCS-backed alert state management
🌐 Networking & Connectivity
Service Endpoints
Main Cluster:
- NGINX Proxy: Port 80 (LoadBalancer)
- Prometheus: Port 9092 (LoadBalancer)
- Consul: Port 8500 (Internal)
Worker Clusters:
- Prometheus: Port 9090 (LoadBalancer)
- Alertmanager: Port 9093 (LoadBalancer)
Remote Write Configuration
- Endpoint:
http://cortex-nginx.cortex/api/prom/push
- Authentication: Basic auth with configurable credentials
- Reliability: Queue configuration with retry logic
- Performance: Batch sending with configurable timeouts
🧪 Testing & Validation
Automated Tests
- ✅ Connectivity validation - Tests Cortex endpoint accessibility
- ✅ Authentication verification - Validates credential setup
- ✅ Service health checks - Monitors component status
- ✅ Data flow validation - Confirms metrics ingestion
Manual Testing Commands
# Test main cluster
kubectl get pods -n cortex
curl -u openuser:openuser http://<nginx-ip>/ready
# Test worker cluster
kubectl config use-context kind-monitoring
kubectl logs deployment/prometheus-server | grep remote_write
# Test data ingestion
curl -u openuser:openuser http://<nginx-ip>/api/v1/query?query=up
🔄 Upgrade & Maintenance
Component Updates
- Helm Chart Updates: Automated via
helm repo update
- Container Images: Configurable in values files
- Configuration Changes: Version-controlled YAML templates
Backup Procedures
# Backup configurations
kubectl get all -n cortex -o yaml > cortex-backup.yaml
# Backup GCS data (automatic via GCS versioning)
gsutil versioning set on gs://your-cortex-blocks-bucket
Monitoring Health
- Component Status: ServiceMonitor integration
- Storage Usage: GCS bucket monitoring
- Performance Metrics: Built-in Cortex dashboards
🐛 Troubleshooting Guide
Common Issues & Solutions
| Issue |
Symptoms |
Solution |
| Key validation failed |
Script exits with credential error |
Update key.json with real GCP credentials |
| Pod startup issues |
Pods in Pending/Error state |
Check GCS permissions and bucket existence |
| Remote write failure |
No data in Cortex |
Verify network connectivity and credentials |
| Authentication errors |
401/403 responses |
Check basic auth configuration |
Debug Commands
# Check all components
kubectl get pods -n cortex
kubectl get services -n cortex
# View logs
kubectl logs -n cortex deployment/cortex-distributor
kubectl logs deployment/prometheus-server
# Test connectivity
kubectl exec -it <pod> -- wget -O- http://consul-server:8500
📈 Performance Metrics
Scalability Targets
- Ingestion Rate: 100K+ samples/second per Cortex cluster
- Query Performance: Sub-second response for recent data
- Storage Efficiency: ~1.5 bytes per sample with compression
- Retention: Unlimited via GCS lifecycle policies
Resource Utilization
- Main Cluster: 2-4 CPU cores, 4-8GB RAM (minimum)
- Worker Clusters: 0.5-1 CPU core, 1-2GB RAM per cluster
- Network: ~10-50Mbps per worker cluster (depending on metrics volume)
🛣️ Roadmap & Future Enhancements
Planned Features
Community Contributions
🤝 Support & Community
Getting Help
- Issues: Submit GitHub issues for bugs and feature requests
- Documentation: Comprehensive README and setup guides
- Examples: Working configuration samples included
Contributing
- Code: Submit pull requests with improvements
- Documentation: Help improve guides and examples
- Testing: Share deployment experiences and edge cases
📄 License & Attribution
- License: [Specify your license]
- Dependencies: All Helm charts and container images retain their original licenses
- Attribution: Built with Cortex, Prometheus, Consul, and NGINX
🔗 References & Links
📋 Changelog
v1.0.0 (December 2024)
- ✨ NEW: Complete automated deployment system
- ✨ NEW: Multi-cluster support with worker nodes
- ✨ NEW: GCP integration with service account management
- ✨ NEW: Security-first credential management
- ✨ NEW: Comprehensive monitoring and alerting
- ✨ NEW: Production-ready NGINX proxy with authentication
- ✨ NEW: Automated validation and testing framework
- 🔧 IMPROVED: Enhanced error handling and user guidance
- 🔧 IMPROVED: Resource optimization and scalability
- 🔧 IMPROVED: Documentation and setup instructions
- 🛡️ SECURITY: Git protection against credential leaks
- 🛡️ SECURITY: Multi-layered authentication system
🎉 Thank you for using Cortex Multi-Cluster Monitoring System!
For questions, issues, or suggestions, please visit our repository or contact the development team.