Release Cortex Multi-Cluster Monitoring System v1.0.0 · umesh-khatiwada/Cortex-Multi-Cluster-Monitoring-Alert-System-Setup-Using-GCP-bucket

🎉 Overview

This release introduces a comprehensive, production-ready automated deployment system for Cortex (Prometheus long-term storage) on Google Cloud Platform. The system provides enterprise-grade monitoring with multi-cluster support, secure authentication, and scalable storage using GCS buckets.

🚀 Key Features

🏗️ Automated Infrastructure Deployment

One-click main cluster setup with Cortex, Consul, and NGINX
Automated worker cluster deployment with Prometheus remote write
Kind cluster support for local development and testing
Server deployment compatibility for production environments

🔐 Enterprise Security

Multi-layered authentication with NGINX basic auth and Cortex multi-tenancy
Secure credential management with template-based approach
Git security protection preventing accidental credential commits
Service account isolation with least-privilege GCP permissions

📊 Monitoring & Observability

Complete ServiceMonitor setup for all Cortex components
Prometheus operator integration with custom resources
Real-time metrics collection from worker clusters
Built-in alerting capabilities with Alertmanager integration

☁️ Cloud-Native Storage

Google Cloud Storage backend for all data types
Separate bucket architecture for blocks, rules, and alerts
Automatic data lifecycle management
Cost-optimized storage configuration

📦 Components Included

Main Cluster Stack

Component	Purpose	Configuration
Cortex	Long-term storage & querying	Multi-tenant, GCS backend
Consul	Service discovery	HashiCorp Consul cluster
NGINX	Load balancer & auth proxy	Basic auth + org isolation
Prometheus	Metrics collection	kube-prometheus-stack
Alertmanager	Alert management	GCS-backed storage

Worker Cluster Stack

Component	Purpose	Configuration
Prometheus Server	Metrics collection	Remote write to Cortex
Node Exporter	System metrics	Host-level monitoring
kube-state-metrics	Kubernetes metrics	Cluster state monitoring
Alertmanager	Local alerting	Optional deployment

🛠️ Technical Specifications

System Requirements

Kubernetes: 1.20+ (Kind 0.11+ for local)
Helm: 3.5+
kubectl: 1.20+
Docker: 20.10+ (for Kind)

GCP Requirements

Project: Active GCP project with billing
Storage: 3 GCS buckets (blocks, rules, alerts)
IAM: Service account with Storage Admin permissions
Network: Internet connectivity for Helm charts

Resource Specifications

Main Cluster Resources

Cortex Components:
  - CPU: 100m-1000m per component
  - Memory: 512Mi-2Gi per component
  - Storage: GCS buckets (unlimited)

Supporting Services:
  - Consul: 100m CPU, 128Mi RAM
  - NGINX: 50m CPU, 64Mi RAM
  - Prometheus: 500m CPU, 1Gi RAM

Worker Cluster Resources

Prometheus Server:
  - CPU: 100m-500m
  - Memory: 512Mi-1Gi  
  - Storage: 8Gi persistent volume
  - Retention: 15 days (configurable)

🔧 Installation & Setup

Quick Start

# 1. Clone repository
git clone <repository-url>
cd GOOGLE_GCS

# 2. Setup GCP credentials
cp main_cluster/key.json.example main_cluster/key.json
# Edit key.json with your GCP service account key

# 3. Configure environment
vi main_cluster/env_variables.sh
# Update bucket names and credentials

# 4. Deploy main cluster
cd main_cluster/
./cortexMainCluster.sh

# 5. Deploy worker clusters
cd ../worker_clusters/
./workerCluster.sh

Deployment Options

Option	Use Case	Command
Local Development	Testing, development	Choose 'Y' when prompted
Server Production	Production deployment	Choose 'N' when prompted
Multi-Region	Geographic distribution	Deploy workers in different regions

🔐 Security Features

Credential Protection

✅ Template-based key management - No real credentials in git
✅ Comprehensive .gitignore - Prevents accidental commits
✅ Validation checks - Ensures proper credential setup
✅ Base64 encoding - Secure password storage

Network Security

✅ NGINX proxy - Centralized access control
✅ Basic authentication - Username/password protection
✅ Org-ID isolation - Multi-tenant data separation
✅ TLS support - HTTPS-ready configuration

Access Control

✅ RBAC configuration - Kubernetes role-based access
✅ Service account isolation - Minimal privilege principle
✅ Namespace separation - Logical resource isolation

📊 Monitoring Capabilities

Metrics Collection

System Metrics: CPU, memory, disk, network from all nodes
Kubernetes Metrics: Pod status, deployments, services, ingress
Application Metrics: Custom application metrics via remote write
Cortex Metrics: Internal component health and performance

Data Retention

Worker Clusters: 15 days local retention
Cortex Storage: Unlimited long-term storage in GCS
Query Performance: Optimized for historical data analysis

Alerting

Built-in Rules: Kubernetes and system-level alerts
Custom Rules: Support for custom alerting rules
Multi-channel Notifications: Email, Slack, webhook support
Alert Persistence: GCS-backed alert state management

🌐 Networking & Connectivity

Service Endpoints

Main Cluster:
  - NGINX Proxy: Port 80 (LoadBalancer)
  - Prometheus: Port 9092 (LoadBalancer)  
  - Consul: Port 8500 (Internal)

Worker Clusters:
  - Prometheus: Port 9090 (LoadBalancer)
  - Alertmanager: Port 9093 (LoadBalancer)

Remote Write Configuration

Endpoint: http://cortex-nginx.cortex/api/prom/push
Authentication: Basic auth with configurable credentials
Reliability: Queue configuration with retry logic
Performance: Batch sending with configurable timeouts

🧪 Testing & Validation

Automated Tests

✅ Connectivity validation - Tests Cortex endpoint accessibility
✅ Authentication verification - Validates credential setup
✅ Service health checks - Monitors component status
✅ Data flow validation - Confirms metrics ingestion

Manual Testing Commands

# Test main cluster
kubectl get pods -n cortex
curl -u openuser:openuser http://<nginx-ip>/ready

# Test worker cluster  
kubectl config use-context kind-monitoring
kubectl logs deployment/prometheus-server | grep remote_write

# Test data ingestion
curl -u openuser:openuser http://<nginx-ip>/api/v1/query?query=up

🔄 Upgrade & Maintenance

Component Updates

Helm Chart Updates: Automated via helm repo update
Container Images: Configurable in values files
Configuration Changes: Version-controlled YAML templates

Backup Procedures

# Backup configurations
kubectl get all -n cortex -o yaml > cortex-backup.yaml

# Backup GCS data (automatic via GCS versioning)
gsutil versioning set on gs://your-cortex-blocks-bucket

Monitoring Health

Component Status: ServiceMonitor integration
Storage Usage: GCS bucket monitoring
Performance Metrics: Built-in Cortex dashboards

🐛 Troubleshooting Guide

Common Issues & Solutions

Issue	Symptoms	Solution
Key validation failed	Script exits with credential error	Update key.json with real GCP credentials
Pod startup issues	Pods in Pending/Error state	Check GCS permissions and bucket existence
Remote write failure	No data in Cortex	Verify network connectivity and credentials
Authentication errors	401/403 responses	Check basic auth configuration

Debug Commands

# Check all components
kubectl get pods -n cortex
kubectl get services -n cortex

# View logs
kubectl logs -n cortex deployment/cortex-distributor
kubectl logs deployment/prometheus-server

# Test connectivity
kubectl exec -it <pod> -- wget -O- http://consul-server:8500

📈 Performance Metrics

Scalability Targets

Ingestion Rate: 100K+ samples/second per Cortex cluster
Query Performance: Sub-second response for recent data
Storage Efficiency: ~1.5 bytes per sample with compression
Retention: Unlimited via GCS lifecycle policies

Resource Utilization

Main Cluster: 2-4 CPU cores, 4-8GB RAM (minimum)
Worker Clusters: 0.5-1 CPU core, 1-2GB RAM per cluster
Network: ~10-50Mbps per worker cluster (depending on metrics volume)

🛣️ Roadmap & Future Enhancements

Planned Features

Grafana Integration - Pre-configured dashboards
Multi-cloud Support - AWS S3 and Azure Blob storage
Automated Scaling - HPA for Cortex components
Enhanced Security - mTLS and OAuth2 integration
Observability - Distributed tracing with Jaeger

Community Contributions

Terraform Modules - Infrastructure as Code
Ansible Playbooks - Configuration management
GitHub Actions - CI/CD workflows
Documentation - Video tutorials and examples

🤝 Support & Community

Getting Help

Issues: Submit GitHub issues for bugs and feature requests
Documentation: Comprehensive README and setup guides
Examples: Working configuration samples included

Contributing

Code: Submit pull requests with improvements
Documentation: Help improve guides and examples
Testing: Share deployment experiences and edge cases

📄 License & Attribution

License: [Specify your license]
Dependencies: All Helm charts and container images retain their original licenses
Attribution: Built with Cortex, Prometheus, Consul, and NGINX

🔗 References & Links

Cortex Documentation: https://cortexmetrics.io/
Prometheus Documentation: https://prometheus.io/docs/
Consul Documentation: https://www.consul.io/docs
GCP Storage Documentation: https://cloud.google.com/storage/docs
Kind Documentation: https://kind.sigs.k8s.io/

📋 Changelog

v1.0.0 (December 2024)

✨ NEW: Complete automated deployment system
✨ NEW: Multi-cluster support with worker nodes
✨ NEW: GCP integration with service account management
✨ NEW: Security-first credential management
✨ NEW: Comprehensive monitoring and alerting
✨ NEW: Production-ready NGINX proxy with authentication
✨ NEW: Automated validation and testing framework
🔧 IMPROVED: Enhanced error handling and user guidance
🔧 IMPROVED: Resource optimization and scalability
🔧 IMPROVED: Documentation and setup instructions
🛡️ SECURITY: Git protection against credential leaks
🛡️ SECURITY: Multi-layered authentication system

🎉 Thank you for using Cortex Multi-Cluster Monitoring System!

For questions, issues, or suggestions, please visit our repository or contact the development team.

Cortex Multi-Cluster Monitoring System v1.0.0