diff --git a/docs/operating/index.md b/docs/operating/index.md index d783f6329..6be192e62 100644 --- a/docs/operating/index.md +++ b/docs/operating/index.md @@ -1,5 +1,5 @@ --- -title: Operating +title: Operating Prometheus in Production sort_rank: 5 nav_icon: settings --- diff --git a/docs/operating/monitoring-prometheus.md b/docs/operating/monitoring-prometheus.md new file mode 100644 index 000000000..48657f108 --- /dev/null +++ b/docs/operating/monitoring-prometheus.md @@ -0,0 +1,514 @@ +--- +title: Monitoring Prometheus +sort_rank: 2 +--- + +# Monitoring Prometheus + +Meta-monitoring (monitoring your monitoring system) is critical for production reliability. This guide covers essential metrics, alerting rules, and dashboards for monitoring Prometheus infrastructure health. + +## Essential Prometheus Metrics + +### Memory and Performance Metrics + +```promql +# Memory usage by component +prometheus_tsdb_head_samples_appended_total +prometheus_tsdb_symbol_table_size_bytes +prometheus_engine_query_duration_seconds + +# Active series and cardinality +prometheus_tsdb_head_series +prometheus_tsdb_head_chunks + +# Storage utilization +prometheus_tsdb_blocks_loaded +prometheus_tsdb_compactions_total +prometheus_tsdb_compactions_failed_total +``` + +### Query Performance Monitoring + +```promql +# Query latency percentiles +histogram_quantile(0.95, + rate(prometheus_engine_query_duration_seconds_bucket[5m]) +) + +# Concurrent queries +prometheus_engine_queries_concurrent_max +prometheus_engine_queries + +# Slow queries (>30s) +increase(prometheus_engine_query_duration_seconds_bucket{le="30"}[5m]) +``` + +### Ingestion and Scraping Health + +```promql +# Samples ingested per second +rate(prometheus_tsdb_head_samples_appended_total[5m]) + +# Failed scrapes +up == 0 + +# Scrape duration +prometheus_target_scrapes_exceeded_sample_limit_total +prometheus_target_scrape_duration_seconds +``` + +### Storage Health + +```promql +# WAL disk usage +prometheus_tsdb_wal_fsync_duration_seconds +prometheus_tsdb_wal_corruptions_total + +# Compaction metrics +rate(prometheus_tsdb_compactions_total[5m]) +prometheus_tsdb_compactions_failed_total + +# Block loading issues +prometheus_tsdb_blocks_loaded +prometheus_tsdb_head_truncations_failed_total +``` + +## Critical Alerting Rules + +### **Prometheus Monitoring Mixins** + +Instead of maintaining alerting rules inline (which can become outdated), we recommend using the official Prometheus monitoring mixins that are maintained alongside the codebase: + +**📋 Official Prometheus Monitoring Mixin** +- **Repository**: [prometheus/prometheus](https://github.com/prometheus/prometheus/tree/main/documentation/prometheus-mixin) +- **Maintained**: Versioned with Prometheus releases +- **Coverage**: Production-ready alerts for Prometheus infrastructure health +- **Installation**: Follow the mixin documentation for your environment + +**Key Alert Categories Covered**: +- Prometheus instance health and availability +- High memory usage and resource constraints +- Query performance and latency issues +- Storage and WAL-related problems +- Target scraping failures and connectivity + +**🔗 Additional Community Mixins**: +- [monitoring-mixins/prometheus-mixin](https://monitoring.mixins.dev/prometheus/) - Community-maintained alerts +- [grafana/jsonnet-libs](https://github.com/grafana/jsonnet-libs) - Grafana Labs mixins + +### **Example Custom Alerting Rules** + +For organizations needing custom alerts beyond the mixins, here are example patterns. **Note**: These are templates that should be adapted and tested for your specific environment: + +```yaml +# Example: Custom capacity planning alerts +# ⚠️ Disclaimer: Test thoroughly in your environment before production use +groups: +- name: prometheus.capacity.examples + rules: + - alert: PrometheusHighMemoryUsageCustom + expr: | + ( + process_resident_memory_bytes{job="prometheus"} / + (1024^3) # Convert to GB + ) > 8 # Adjust threshold for your deployment + for: 15m + labels: + severity: warning + annotations: + summary: "Prometheus {{ $labels.instance }} memory usage is high" + description: "Memory usage is {{ $value }}GB, consider scaling or optimization." + + - alert: PrometheusIngestionRateIncreasing + expr: | + predict_linear( + rate(prometheus_tsdb_head_samples_appended_total[1h])[4h:], + 24*3600 + ) > 50000 # Adjust based on your capacity + for: 30m + labels: + severity: warning + annotations: + summary: "Prometheus ingestion rate trending high" + description: "Predicted to exceed 50k samples/sec within 24 hours." +``` + +**📝 Important Notes**: +- These are **example templates** - adapt thresholds for your environment +- Test thoroughly before deploying to production +- Consider contributing improvements back to the official mixins + +## Monitoring Dashboard + +### Grafana Dashboard JSON + +```json +{ + "dashboard": { + "title": "Prometheus Overview", + "panels": [ + { + "title": "Prometheus Instances Status", + "type": "stat", + "targets": [ + { + "expr": "up{job=\"prometheus\"}", + "legendFormat": "{{ instance }}" + } + ] + }, + { + "title": "Memory Usage", + "type": "graph", + "targets": [ + { + "expr": "process_resident_memory_bytes{job=\"prometheus\"}", + "legendFormat": "RSS Memory - {{ instance }}" + }, + { + "expr": "process_virtual_memory_bytes{job=\"prometheus\"}", + "legendFormat": "Virtual Memory - {{ instance }}" + } + ] + }, + { + "title": "Query Performance", + "type": "graph", + "targets": [ + { + "expr": "histogram_quantile(0.95, rate(prometheus_engine_query_duration_seconds_bucket[5m]))", + "legendFormat": "95th percentile" + }, + { + "expr": "histogram_quantile(0.50, rate(prometheus_engine_query_duration_seconds_bucket[5m]))", + "legendFormat": "50th percentile" + } + ] + }, + { + "title": "Active Series", + "type": "graph", + "targets": [ + { + "expr": "prometheus_tsdb_head_series", + "legendFormat": "{{ instance }}" + } + ] + }, + { + "title": "Ingestion Rate", + "type": "graph", + "targets": [ + { + "expr": "rate(prometheus_tsdb_head_samples_appended_total[5m])", + "legendFormat": "Samples/sec - {{ instance }}" + } + ] + }, + { + "title": "Storage Usage", + "type": "graph", + "targets": [ + { + "expr": "prometheus_tsdb_blocks_loaded", + "legendFormat": "Blocks Loaded - {{ instance }}" + } + ] + } + ] + } +} +``` + +## Health Check Endpoints + +### **Example HTTP Health Checks** + +The following are example scripts for monitoring Prometheus health endpoints. **⚠️ Disclaimer**: These are templates that should be tested and adapted for your specific environment - no CI validates these scripts. + +```bash +#!/bin/bash +# example-prometheus-health-check.sh +# ⚠️ Test thoroughly in your environment before production use + +PROMETHEUS_URL="http://localhost:9090" + +# Basic health check +echo "=== Basic Health Check ===" +curl -s "$PROMETHEUS_URL/-/healthy" || echo "Health check failed" + +# Readiness check +echo "=== Readiness Check ===" +curl -s "$PROMETHEUS_URL/-/ready" || echo "Readiness check failed" + +# Configuration reload status +echo "=== Configuration Status ===" +CONFIG_STATUS=$(curl -s "$PROMETHEUS_URL/api/v1/status/config" | jq '.status') +echo "Config reload status: $CONFIG_STATUS" + +# Target status +echo "=== Target Status ===" +UP_TARGETS=$(curl -s "$PROMETHEUS_URL/api/v1/targets" | jq '.data.activeTargets | map(select(.health == "up")) | length') +TOTAL_TARGETS=$(curl -s "$PROMETHEUS_URL/api/v1/targets" | jq '.data.activeTargets | length') +echo "Healthy targets: $UP_TARGETS/$TOTAL_TARGETS" + +# Runtime information +echo "=== Runtime Information ===" +curl -s "$PROMETHEUS_URL/api/v1/status/runtimeinfo" | jq '.' +``` + +**📝 Usage Notes**: +- Requires `curl` and `jq` to be installed +- Adjust `PROMETHEUS_URL` for your deployment +- Consider adding authentication headers if Prometheus is secured +- Test timeout and error handling for your environment + +### Kubernetes Health Checks + +```yaml +# Kubernetes probes for Prometheus StatefulSet +livenessProbe: + httpGet: + path: /-/healthy + port: 9090 + initialDelaySeconds: 30 + periodSeconds: 15 + timeoutSeconds: 10 + failureThreshold: 3 + +readinessProbe: + httpGet: + path: /-/ready + port: 9090 + initialDelaySeconds: 30 + periodSeconds: 5 + timeoutSeconds: 5 + failureThreshold: 3 +``` + +## Performance Monitoring Queries + +### Memory Analysis + +```promql +# Top metrics by memory usage +topk(10, + prometheus_tsdb_symbol_table_size_bytes + + prometheus_tsdb_head_chunks_bytes +) + +# Memory usage by component +sum by (job) (process_resident_memory_bytes{job="prometheus"}) + +# Memory growth rate +increase(process_resident_memory_bytes{job="prometheus"}[1h]) +``` + +### Query Analysis + +```promql +# Most expensive queries by duration +topk(10, + rate(prometheus_engine_query_duration_seconds_sum[5m]) / + rate(prometheus_engine_query_duration_seconds_count[5m]) +) + +# Query concurrency +prometheus_engine_queries_concurrent_max + +# Failed queries +rate(prometheus_engine_queries_total{result="error"}[5m]) +``` + +### Storage Analysis + +```promql +# WAL size growth +increase(prometheus_tsdb_wal_segment_current[1h]) + +# Compaction duration +prometheus_tsdb_compaction_duration_seconds + +# Block size distribution +histogram_quantile(0.95, prometheus_tsdb_compaction_chunk_size_bytes_bucket) +``` + +## Automated Monitoring Scripts + +### Daily Health Report + +```bash +#!/bin/bash +# daily-prometheus-report.sh + +PROMETHEUS_URL="http://localhost:9090" +REPORT_DATE=$(date +%Y-%m-%d) +REPORT_FILE="/var/log/prometheus/daily-report-$REPORT_DATE.txt" + +echo "Prometheus Daily Health Report - $REPORT_DATE" > $REPORT_FILE +echo "================================================" >> $REPORT_FILE + +# Instance status +echo "Instance Status:" >> $REPORT_FILE +curl -s "$PROMETHEUS_URL/api/v1/query?query=up{job=\"prometheus\"}" | \ + jq -r '.data.result[] | "\(.metric.instance): \(.value[1])"' >> $REPORT_FILE + +# Memory usage +echo -e "\nMemory Usage (GB):" >> $REPORT_FILE +curl -s "$PROMETHEUS_URL/api/v1/query?query=process_resident_memory_bytes{job=\"prometheus\"}/1024/1024/1024" | \ + jq -r '.data.result[] | "\(.metric.instance): \(.value[1])"' >> $REPORT_FILE + +# Active series +echo -e "\nActive Series:" >> $REPORT_FILE +curl -s "$PROMETHEUS_URL/api/v1/query?query=prometheus_tsdb_head_series" | \ + jq -r '.data.result[] | "\(.metric.instance): \(.value[1])"' >> $REPORT_FILE + +# Query performance +echo -e "\nQuery Performance (95th percentile, seconds):" >> $REPORT_FILE +curl -s "$PROMETHEUS_URL/api/v1/query?query=histogram_quantile(0.95, rate(prometheus_engine_query_duration_seconds_bucket[24h]))" | \ + jq -r '.data.result[] | "\(.metric.instance): \(.value[1])"' >> $REPORT_FILE + +# Failed scrapes +echo -e "\nFailed Scrapes:" >> $REPORT_FILE +curl -s "$PROMETHEUS_URL/api/v1/query?query=count by (job) (up == 0)" | \ + jq -r '.data.result[] | "\(.metric.job): \(.value[1])"' >> $REPORT_FILE + +echo "Report generated: $REPORT_FILE" +``` + +### Capacity Planning Script + +```bash +#!/bin/bash +# capacity-planning.sh + +PROMETHEUS_URL="http://localhost:9090" + +echo "Prometheus Capacity Planning Report" +echo "==================================" + +# Current metrics +CURRENT_SERIES=$(curl -s "$PROMETHEUS_URL/api/v1/query?query=prometheus_tsdb_head_series" | jq '.data.result[0].value[1] | tonumber') +CURRENT_MEMORY=$(curl -s "$PROMETHEUS_URL/api/v1/query?query=process_resident_memory_bytes{job=\"prometheus\"}" | jq '.data.result[0].value[1] | tonumber') +INGESTION_RATE=$(curl -s "$PROMETHEUS_URL/api/v1/query?query=rate(prometheus_tsdb_head_samples_appended_total[1h])" | jq '.data.result[0].value[1] | tonumber') + +echo "Current active series: $CURRENT_SERIES" +echo "Current memory usage: $(echo "$CURRENT_MEMORY / 1024 / 1024 / 1024" | bc) GB" +echo "Current ingestion rate: $(echo "$INGESTION_RATE" | bc) samples/sec" + +# Projected growth (30 days) +PROJECTED_SERIES=$(echo "$CURRENT_SERIES * 1.1" | bc) # 10% growth +PROJECTED_MEMORY=$(echo "$CURRENT_MEMORY * 1.1" | bc) + +echo -e "\nProjected in 30 days (10% growth):" +echo "Projected series: $PROJECTED_SERIES" +echo "Projected memory: $(echo "$PROJECTED_MEMORY / 1024 / 1024 / 1024" | bc) GB" + +# Recommendations +if (( $(echo "$CURRENT_SERIES > 500000" | bc -l) )); then + echo -e "\nRecommendation: Consider horizontal scaling or series optimization" +fi + +if (( $(echo "$CURRENT_MEMORY > 8589934592" | bc -l) )); then # 8GB + echo -e "\nRecommendation: Monitor memory usage closely, consider memory optimization" +fi +``` + +## Log Analysis + +### Important Log Patterns + +```bash +# Monitor Prometheus logs for issues +tail -f /var/log/prometheus/prometheus.log | grep -E "(error|warn|panic|fatal)" + +# Common error patterns to watch for: +# - "out of memory" +# - "too many open files" +# - "context deadline exceeded" +# - "compaction failed" +# - "WAL corruption" +``` + +### Log Aggregation Query (if using Loki) + +```logql +# Prometheus error analysis +{job="prometheus"} |= "error" | json | line_format "{{ .level }}: {{ .msg }}" + +# Memory pressure indicators +{job="prometheus"} |~ "memory|OOM|out of memory" + +# Query performance issues +{job="prometheus"} |~ "slow|timeout|deadline exceeded" +``` + +## Troubleshooting Playbook + +### High Memory Usage + +1. **Check active series**: `prometheus_tsdb_head_series` +2. **Identify high-cardinality metrics**: Use cardinality analysis queries +3. **Review scrape configurations**: Look for unnecessary labels +4. **Consider series dropping**: Use `metric_relabel_configs` + +### Slow Queries + +1. **Enable query logging**: `--query.log_file` flag +2. **Analyze query patterns**: Review most expensive queries +3. **Optimize query structure**: Use recording rules for complex queries +4. **Increase query timeout**: `--query.timeout` if appropriate + +### Storage Issues + +1. **Check disk space**: Monitor filesystem usage +2. **Review retention settings**: Adjust retention time/size +3. **Monitor compaction**: Check for failed compactions +4. **WAL monitoring**: Watch WAL size growth + +## Integration with External Monitoring + +### Exporting Metrics to Another Prometheus + +```yaml +# Remote write configuration for meta-monitoring +remote_write: + - url: "http://meta-prometheus:9090/api/v1/write" + queue_config: + capacity: 10000 + max_samples_per_send: 1000 + write_relabel_configs: + - source_labels: [__name__] + regex: "prometheus_.*" + action: keep +``` + +### Alertmanager Integration + +```yaml +# Alertmanager configuration for Prometheus alerts +route: + group_by: ['alertname', 'instance'] + group_wait: 10s + group_interval: 10s + repeat_interval: 1h + receiver: 'prometheus-alerts' + routes: + - match: + severity: critical + receiver: 'prometheus-critical' + +receivers: +- name: 'prometheus-alerts' + slack_configs: + - api_url: 'YOUR_SLACK_WEBHOOK' + channel: '#prometheus-alerts' + +- name: 'prometheus-critical' + pagerduty_configs: + - service_key: 'YOUR_PAGERDUTY_KEY' +``` + +--- + +This monitoring setup ensures your Prometheus infrastructure remains healthy and performant. Regular monitoring of these metrics and alerts will help you maintain reliable monitoring for your production environments. \ No newline at end of file diff --git a/docs/operating/production-deployment.md b/docs/operating/production-deployment.md new file mode 100644 index 000000000..78339ff2c --- /dev/null +++ b/docs/operating/production-deployment.md @@ -0,0 +1,497 @@ +--- +title: Production Deployment Guide +sort_rank: 1 +--- + +# Production Deployment Guide + +This guide provides comprehensive recommendations for deploying Prometheus in production environments. It covers hardware requirements, high availability patterns, configuration best practices, and operational considerations for running Prometheus at scale. + +## Hardware and Infrastructure Requirements + +### Server Specifications + +**Memory Requirements** +- **Minimum**: 4GB RAM for small deployments (< 10k active series) +- **Recommended**: 16-32GB RAM for medium deployments (10k-100k active series) +- **Large Scale**: 64GB+ RAM for large deployments (100k+ active series) + +**CPU Requirements** +- **Minimum**: 2 CPU cores +- **Recommended**: 4-8 CPU cores for most production workloads +- **Large Scale**: 16+ CPU cores for high-cardinality environments + +**Storage Requirements** +- **SSD strongly recommended** for data directory +- **Disk space calculation**: `retention_days * daily_ingestion_rate * compression_ratio` + - Typical compression ratio: 1.5-3x + - Example: 30 days * 1GB/day * 2 = 60GB storage needed +- **Separate disk** for WAL (Write-Ahead Log) recommended for high-throughput deployments + +### Network Considerations + +```yaml +# Recommended firewall rules +ingress: + - port: 9090 # Prometheus web UI and API + protocol: TCP + sources: ["monitoring-subnet", "admin-subnet"] + + - port: 9091 # Pushgateway (if used) + protocol: TCP + sources: ["application-subnets"] + +egress: + - port: 80/443 # Scraping HTTP/HTTPS targets + protocol: TCP + destinations: ["0.0.0.0/0"] + + - port: 9100 # Node exporter + protocol: TCP + destinations: ["infrastructure-subnets"] +``` + +## High Availability Deployment Patterns + +### Active-Active Configuration + +Deploy multiple identical Prometheus instances scraping the same targets: + +```yaml +# prometheus-1.yml +global: + scrape_interval: 15s + evaluation_interval: 15s + external_labels: + replica: 'prometheus-1' + cluster: 'production' + +scrape_configs: + - job_name: 'application-servers' + static_configs: + - targets: ['app1:8080', 'app2:8080', 'app3:8080'] +``` + +```yaml +# prometheus-2.yml +global: + scrape_interval: 15s + evaluation_interval: 15s + external_labels: + replica: 'prometheus-2' + cluster: 'production' + +scrape_configs: + - job_name: 'application-servers' + static_configs: + - targets: ['app1:8080', 'app2:8080', 'app3:8080'] +``` + +**Benefits:** +- No single point of failure +- Load distribution for queries +- Natural data redundancy + +**Considerations:** +- Requires deduplication in query layer (Thanos, Cortex, or VictoriaMetrics) +- Double storage requirements +- Alert rule evaluation happens on both instances + +### Federation for Hierarchical Scaling + +```yaml +# Global Prometheus configuration +scrape_configs: + - job_name: 'prometheus-federation' + scrape_interval: 15s + honor_labels: true + metrics_path: '/federate' + params: + 'match[]': + - '{job=~"prometheus|node|kubernetes-.*"}' + - 'up' + - 'prometheus_build_info' + static_configs: + - targets: + - 'prometheus-region-us-east:9090' + - 'prometheus-region-us-west:9090' + - 'prometheus-region-eu:9090' +``` + +## Production Configuration Best Practices + +### Storage Configuration + +```yaml +# Command line flags for storage optimization +--storage.tsdb.path=/prometheus/data +--storage.tsdb.retention.time=30d +--storage.tsdb.retention.size=100GB +--storage.tsdb.wal-compression +--storage.tsdb.no-lockfile +--web.enable-lifecycle +--web.enable-admin-api +``` + +### Memory Optimization + +```yaml +# Limit memory usage and optimize for large deployments +--storage.tsdb.head-chunks-write-queue-size=10000 +--query.max-concurrency=20 +--query.timeout=2m +--query.max-samples=50000000 +``` + +### Sample Configuration File + +```yaml +# /etc/prometheus/prometheus.yml +global: + scrape_interval: 30s + scrape_timeout: 10s + evaluation_interval: 30s + external_labels: + environment: 'production' + datacenter: 'us-east-1' + +rule_files: + - "/etc/prometheus/rules/*.yml" + +alerting: + alertmanagers: + - static_configs: + - targets: + - alertmanager-1:9093 + - alertmanager-2:9093 + timeout: 10s + +scrape_configs: + # Prometheus itself + - job_name: 'prometheus' + static_configs: + - targets: ['localhost:9090'] + scrape_interval: 30s + metrics_path: /metrics + + # Node exporter for system metrics + - job_name: 'node-exporter' + static_configs: + - targets: + - 'node1:9100' + - 'node2:9100' + - 'node3:9100' + scrape_interval: 30s + + # Application metrics + - job_name: 'application' + static_configs: + - targets: + - 'app1:8080' + - 'app2:8080' + scrape_interval: 15s + metrics_path: /metrics + scrape_timeout: 10s + +# Remote write for long-term storage (optional) +remote_write: + - url: "https://remote-storage-endpoint/api/v1/write" + queue_config: + capacity: 2500 + max_shards: 200 + min_shards: 1 + max_samples_per_send: 500 + batch_send_deadline: 5s +``` + +## Container Deployment + +### **Official Deployment Examples** + +For production-ready deployment configurations, we recommend using the official examples that are maintained and tested: + +**📁 Prometheus Examples Repository** +- **Location**: [prometheus/prometheus/documentation/examples](https://github.com/prometheus/prometheus/tree/main/documentation/examples) +- **Maintained**: Versioned with Prometheus releases +- **Tested**: Validated configurations for various deployment scenarios + +### **Docker Configuration** + +**📋 Basic Docker Setup Example** + +```dockerfile +# Example Dockerfile for production Prometheus +FROM prom/prometheus:latest + +# Copy configuration +COPY prometheus.yml /etc/prometheus/ +COPY rules/ /etc/prometheus/rules/ + +# Set proper ownership +USER root +RUN chown -R prometheus:prometheus /etc/prometheus/ +USER prometheus + +# Expose metrics port +EXPOSE 9090 + +# Use proper entrypoint with production flags +ENTRYPOINT ["/bin/prometheus", \ + "--config.file=/etc/prometheus/prometheus.yml", \ + "--storage.tsdb.path=/prometheus", \ + "--storage.tsdb.retention.time=30d", \ + "--storage.tsdb.wal-compression", \ + "--web.console.libraries=/etc/prometheus/console_libraries", \ + "--web.console.templates=/etc/prometheus/consoles", \ + "--web.enable-lifecycle", \ + "--web.external-url=https://prometheus.company.com"] +``` + +### **Kubernetes Deployment** + +**📋 Recommended Approach**: Use official Helm charts or kustomize examples + +**Official Resources**: +- **Prometheus Community Helm Chart**: [prometheus-community/helm-charts](https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus) +- **Prometheus Operator**: [prometheus-operator/prometheus-operator](https://github.com/prometheus-operator/prometheus-operator) +- **Official Examples**: [prometheus/prometheus examples](https://github.com/prometheus/prometheus/tree/main/documentation/examples) + +**📝 Key Kubernetes Considerations**: +- Use StatefulSets for data persistence +- Configure proper resource requests and limits +- Set up horizontal pod autoscaling carefully +- Use persistent volumes for data storage +- Configure proper security contexts +- Set up monitoring and alerting for the Kubernetes deployment itself + +**Example Resource Requirements**: +```yaml +# Example resource configuration - adjust for your needs +resources: + requests: + memory: "2Gi" + cpu: "500m" + limits: + memory: "4Gi" + cpu: "2" +``` + +### **High Availability with Helm** + +For production HA deployments, consider the prometheus-community Helm chart with these key configurations: + +```bash +# Example Helm installation with HA configuration +helm repo add prometheus-community https://prometheus-community.github.io/helm-charts +helm repo update + +# Install with custom values for HA +helm install prometheus prometheus-community/prometheus \ + --set server.replicaCount=2 \ + --set server.persistentVolume.size=100Gi \ + --set server.retention=30d \ + --namespace monitoring \ + --create-namespace +``` + +**📋 Important**: Always customize the values.yaml file for your specific requirements. See the [official chart documentation](https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus) for all available options. + +## Security Hardening + +### Authentication and Authorization + +```yaml +# Basic auth configuration +basic_auth_users: + admin: $2a$10$hYoOolb6tZyZQkEJ8T8jIuJ6U.4FK/8e8cDatYQ8F5U0QKa.4QKyC # admin + readonly: $2a$10$ZoOJlGqEEzOz5T8uFX5c8elZeT3cxBE8XuqD8qJ2z9F5x8c4U6Ty6 # readonly + +# TLS configuration +tls_server_config: + cert_file: /etc/prometheus/tls/server.crt + key_file: /etc/prometheus/tls/server.key + client_ca_file: /etc/prometheus/tls/ca.crt + client_auth_type: RequireAndVerifyClientCert +``` + +### Network Security + +```bash +# Firewall rules using iptables +# Allow Prometheus web interface from monitoring subnet only +iptables -A INPUT -p tcp --dport 9090 -s 10.0.1.0/24 -j ACCEPT +iptables -A INPUT -p tcp --dport 9090 -j DROP + +# Allow scraping from Prometheus to targets +iptables -A OUTPUT -p tcp --dport 9100 -d 10.0.0.0/16 -j ACCEPT +iptables -A OUTPUT -p tcp --dport 8080 -d 10.0.0.0/16 -j ACCEPT +``` + +## Monitoring Prometheus Performance + +Essential metrics to monitor for Prometheus health: + +```promql +# Memory usage +prometheus_tsdb_head_samples_appended_total +prometheus_engine_query_duration_seconds +prometheus_tsdb_symbol_table_size_bytes + +# Storage metrics +prometheus_tsdb_blocks_loaded +prometheus_tsdb_compactions_total +prometheus_tsdb_head_series + +# Query performance +prometheus_query_duration_seconds +prometheus_engine_queries_concurrent_max +``` + +## Backup and Disaster Recovery + +### Snapshot-based Backup + +```bash +#!/bin/bash +# backup-prometheus.sh + +PROMETHEUS_URL="http://localhost:9090" +BACKUP_DIR="/backup/prometheus" +DATE=$(date +%Y%m%d_%H%M%S) + +# Create snapshot +curl -XPOST $PROMETHEUS_URL/api/v1/admin/tsdb/snapshot + +# Get snapshot name +SNAPSHOT=$(ls -t /prometheus/snapshots/ | head -1) + +# Copy snapshot to backup location +mkdir -p $BACKUP_DIR/$DATE +cp -r /prometheus/snapshots/$SNAPSHOT $BACKUP_DIR/$DATE/ + +# Compress backup +tar -czf $BACKUP_DIR/prometheus_backup_$DATE.tar.gz -C $BACKUP_DIR/$DATE . + +# Clean up old backups (keep 30 days) +find $BACKUP_DIR -name "*.tar.gz" -mtime +30 -delete + +echo "Backup completed: $BACKUP_DIR/prometheus_backup_$DATE.tar.gz" +``` + +### Recovery Procedure + +```bash +#!/bin/bash +# restore-prometheus.sh + +BACKUP_FILE="$1" +PROMETHEUS_DATA_DIR="/prometheus" + +if [ -z "$BACKUP_FILE" ]; then + echo "Usage: $0 " + exit 1 +fi + +# Stop Prometheus +systemctl stop prometheus + +# Backup current data +mv $PROMETHEUS_DATA_DIR $PROMETHEUS_DATA_DIR.backup.$(date +%s) + +# Extract backup +mkdir -p $PROMETHEUS_DATA_DIR +tar -xzf $BACKUP_FILE -C $PROMETHEUS_DATA_DIR + +# Set proper permissions +chown -R prometheus:prometheus $PROMETHEUS_DATA_DIR + +# Start Prometheus +systemctl start prometheus + +echo "Recovery completed from $BACKUP_FILE" +``` + +## Performance Tuning + +### Memory Optimization + +```bash +# JVM-style memory flags for Go garbage collection +export GOGC=100 # Default garbage collection target +export GOMEMLIMIT=8GiB # Set memory limit (Go 1.19+) + +# Start Prometheus with memory optimizations +prometheus \ + --storage.tsdb.head-chunks-write-queue-size=10000 \ + --query.max-concurrency=20 \ + --storage.tsdb.min-block-duration=2h \ + --storage.tsdb.max-block-duration=2h +``` + +### Storage Optimization + +```yaml +# Reduce cardinality by dropping unnecessary labels +metric_relabel_configs: + - source_labels: [__name__] + regex: 'go_.*' + action: drop + - source_labels: [instance] + regex: '(.*):[0-9]+' + target_label: instance + replacement: '${1}' +``` + +## Troubleshooting Common Issues + +### High Memory Usage + +```promql +# Check for high cardinality series +topk(10, count by (__name__)({__name__=~".+"})) + +# Identify sources of cardinality +prometheus_tsdb_symbol_table_size_bytes +prometheus_tsdb_head_series +``` + +### Slow Queries + +```promql +# Monitor query performance +rate(prometheus_engine_query_duration_seconds_sum[5m]) / +rate(prometheus_engine_query_duration_seconds_count[5m]) + +# Check for expensive queries +prometheus_engine_queries_concurrent_max +``` + +### Storage Issues + +```bash +# Check disk space +df -h /prometheus + +# Monitor WAL size +du -sh /prometheus/wal/ + +# Check for corrupted blocks +prometheus_tsdb_blocks_loaded vs expected blocks +``` + +## Next Steps + +After deploying Prometheus in production: + +1. Set up [monitoring of Prometheus itself](monitoring-prometheus/) +2. Configure [alerting rules](../practices/alerting.md) +3. Implement [backup procedures](backup-recovery/) +4. Review [security configurations](security.md) +5. Plan for [scaling and performance tuning](performance-tuning/) + +--- + +**Additional Resources:** +- [Prometheus Configuration Reference](https://prometheus.io/docs/prometheus/latest/configuration/configuration/) +- [Storage Documentation](https://prometheus.io/docs/prometheus/latest/storage/) +- [Best Practices](../practices/) \ No newline at end of file