22
33This Helm chart deploys a comprehensive monitoring stack for the AI Event Concepter application using Prometheus, Grafana, and related monitoring components.
44
5+ ---
6+
7+ ## 🏗️ ** Architecture**
8+
9+ ```
10+ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
11+ │ Application │ │ Prometheus │ │ Alertmanager │
12+ │ Services │───▶│ (Metrics) │───▶│ (Alerts) │
13+ └─────────────────┘ └─────────────────┘ └─────────────────┘
14+ │ │
15+ ▼ ▼
16+ ┌─────────────────┐ ┌─────────────────┐
17+ │ Grafana │ │ Email/Slack │
18+ │ (Dashboards) │ │ (Notifications) │
19+ └─────────────────┘ └─────────────────┘
20+ ```
21+
22+ ---
23+
524## Components
625
726### Core Monitoring
8- - ** Prometheus** : Metrics collection and storage
9- - ** Grafana** : Metrics visualization and dashboards
10- - ** Alertmanager** : Alert routing and notification management
27+ - ** Prometheus** : Metrics collection and storage with persistent volume (10GB)
28+ - ** Grafana** : Metrics visualization and dashboards with persistent volume (5GB)
29+ - ** Alertmanager** : Alert routing and notification management with email notifications
30+
31+ ### Persistent Storage
32+ - ** Prometheus** : 10GB PVC for metrics retention
33+ - ** Grafana** : 5GB PVC for dashboard configs
1134
1235### Metrics Exporters
1336- ** Node Exporter** : System and hardware metrics
14- - ** cAdvisor ** : Container metrics
15- - ** PostgreSQL Exporter** : Database metrics
37+ - ** PostgreSQL Exporter ** : Database metrics for both user and concept databases
38+ - ** Blackbox Exporter** : External service availability monitoring
1639- ** Spring Boot Actuator** : Application metrics (built-in)
1740
1841### Monitored Services
19- - Gateway Service (Spring Boot)
20- - User Service (Spring Boot)
21- - Concept Service (Spring Boot)
22- - GenAI Service (Flask)
23- - PostgreSQL Database
24-
25- ## Configuration
26-
27- ### Values.yaml
28-
29- The main configuration file ` values.yaml ` contains settings for all monitoring components:
30-
31- ``` yaml
32- # Prometheus configuration
33- prometheus :
34- image :
35- repository : prom/prometheus
36- tag : v2.52.0
37- persistence :
38- enabled : true
39- size : 10Gi
40- retention :
41- time : " 15d"
42- size : " 10GB"
43-
44- # Node Exporter for system metrics
45- nodeExporter :
46- enabled : true
47- image :
48- repository : prom/node-exporter
49- tag : v1.6.1
50-
51- # cAdvisor for container metrics
52- cadvisor :
53- enabled : true
54- image :
55- repository : gcr.io/cadvisor/cadvisor
56- tag : v0.47.2
57-
58- # PostgreSQL Exporter
59- postgresExporter :
60- enabled : true
61- database :
62- host : postgres
63- port : 5432
64- name : postgres
65- user : postgres
66- password : password
67-
68- # Alertmanager
69- alertmanager :
70- enabled : true
71- persistence :
72- enabled : true
73- size : 1Gi
74- ` ` `
42+ - ** Gateway Service** (Spring Boot): API gateway with metrics at ` /actuator/prometheus `
43+ - ** User Service** (Spring Boot): User management with metrics at ` /actuator/prometheus `
44+ - ** Concept Service** (Spring Boot): Concept management with metrics at ` /actuator/prometheus `
45+ - ** GenAI Service** (Flask): AI processing with metrics at ` /metrics `
46+ - ** PostgreSQL Databases** : User database (eventdb) and concept database (conceptdb)
47+ - ** MinIO Object Storage** : File storage metrics
48+ - ** Weaviate Vector Database** : Vector search metrics
49+ - ** T2V Transformers** : Text-to-vector processing metrics
50+
51+ ### Service Discovery strategy
52+ 1 . ** Declarative Configuration** : ServiceMonitor resources define monitoring requirements
53+ 2 . ** Automatic Discovery** : Prometheus Operator automatically finds and monitors services
54+ 3 . ** Cross-Namespace Support** : RBAC enables monitoring across multiple namespaces
55+ 4 . ** Rich Metadata** : Relabeling provides context-rich metrics
56+ 5 . ** Zero-Configuration** : New services are automatically monitored when properly labeled
7557
76- ## Deployment
7758
78- ### Prerequisites
79- - Kubernetes cluster with Helm 3.x
80- - Ingress controller (nginx-ingress)
81- - Storage class for persistent volumes
8259
83- ### Installation
60+ ## Deployment
8461
85621 . ** Deploy the monitoring stack:**
8663 ``` bash
87- helm install monitor ./helm/monitor
64+ helm install monitor ./helm/monitor --namespace team-git-push-force-monitor
8865 ```
8966
90672 . ** Update with custom values:**
9168 ``` bash
92- helm upgrade monitor ./helm/monitor -f custom-values.yaml
93- ```
69+ helm upgrade monitor ./helm/monitor --namespace team-git-push-force-monitor
70+ ```
9471
95723 . ** Uninstall:**
9673 ``` bash
@@ -105,158 +82,84 @@ After deployment, the monitoring components will be available at:
10582- ** Grafana** : ` https://grafana.dev-aieventconcepter.student.k8s.aet.cit.tum.de `
10683- ** Alertmanager** : ` https://alertmanager.dev-aieventconcepter.student.k8s.aet.cit.tum.de `
10784
108- ## Dashboards
10985
110- The monitoring stack includes two comprehensive dashboards:
11186
112- ### 1. Application Overview Dashboard
87+ ## Grafana Dashboard
88+
89+ ### ** Credentials**
90+ - ** Username** : admin
91+ - ** Password** : strongpassword
92+
93+ The monitoring stack includes a comprehensive application dashboard with detailed metrics across all system layers:
94+
95+ ### Application Overview Dashboard
11396** Title** : AI Event Concepter - Application Overview
114- ** UID** : ` ai-event-concepter `
115-
116- ** Panels** :
117- - ** Service Health Overview** : Real-time status of all services
118- - ** Request Rate** : HTTP request rates by service, method, and endpoint
119- - ** Response Time (95th percentile)** : Application performance metrics
120- - ** Error Rate** : 4xx and 5xx error rates by service
121- - ** Memory Usage** : JVM memory utilization
122- - ** CPU Usage** : Process CPU consumption
123- - ** Database Connections** : HikariCP connection pool metrics
124- - ** Service Version Overview** : Application version information
125-
126- ### 2. Infrastructure Overview Dashboard
127- ** Title** : AI Event Concepter - Infrastructure Overview
128- ** UID** : ` ai-event-concepter-infrastructure `
129-
130- ** Panels** :
131- - ** Node CPU Usage** : Host CPU utilization
132- - ** Node Memory Usage** : Host memory consumption
133- - ** Disk Usage** : Filesystem utilization
134- - ** Network Traffic** : Network I/O metrics
135- - ** Container CPU Usage** : Container-level CPU metrics
136- - ** Container Memory Usage** : Container memory consumption
137- - ** PostgreSQL Active Connections** : Database connection monitoring
138- - ** PostgreSQL Transaction Rate** : Database transaction metrics
139- - ** System Load Average** : System load (1m, 5m, 15m)
140-
141- ## Metrics Collection
142-
143- ### Spring Boot Services
144- Spring Boot services expose metrics via Actuator endpoints:
145- - Metrics path: ` /actuator/prometheus `
146- - Default port: ` 8080 `
147- - Services: gateway, user-svc, concept-svc
148-
149- ### GenAI Service
150- Flask service exposes Prometheus metrics:
151- - Metrics path: ` /metrics `
152- - Port: ` 8083 `
153-
154- ### System Metrics
155- - ** Node Exporter** : Host system metrics (CPU, memory, disk, network)
156- - ** cAdvisor** : Container and Kubernetes metrics
157- - ** PostgreSQL Exporter** : Database performance metrics
158-
159- ## Alerting Rules
160-
161- The monitoring stack includes predefined alerting rules for:
162-
163- ### Infrastructure Alerts
164- - Service availability (ServiceDown)
165- - High memory usage (>85%)
166- - High CPU usage (>80%)
167- - Disk space filling up (>90%)
168-
169- ### Application Alerts
170- - Spring Boot high error rate (>0.1 errors/sec)
171- - Spring Boot high response time (>2s 95th percentile)
172- - PostgreSQL high connections (>80)
173-
174- ### Container Alerts
175- - Container high memory usage (>85% of limit)
97+ ** UID** : ` ai-event-concepter-example `
17698
177- ## Customization
99+ ** Dashboard Sections ** :
178100
179- ### Adding Custom Alerts
180- Edit ` helm/monitor/templates/prometheus-rules-configmap.yaml ` to add custom alerting rules.
101+ #### 1. Service Health
102+ - ** Service Health Overview** : Real-time status of all monitored services with color-coded indicators
103+ - ** All Services Healthy?** : Aggregate health status across all services
104+
105+ #### 2. Traffic & Errors
106+ - ** Request Count** : HTTP request rates by service (Spring Boot + Flask services)
107+ - ** HTTP Success Rate (%)** : Percentage of successful requests (non-5xx responses)
108+ - ** Error Rate** : 4xx and 5xx error rates by service with detailed breakdown
109+
110+ #### 3. Latency
111+ - ** Response Time (95th percentile)** : Application performance metrics for all services
112+ - ** P50 & P99** : Median and 99th percentile response times
113+ - ** Max Observed Latency** : Peak latency observations with threshold indicators
181114
182- ### Modifying Service Discovery
183- Update the Prometheus configuration in ` helm/monitor/templates/prometheus-configmap.yaml ` to modify service discovery and scraping rules.
115+ #### 4. Resource Usage
116+ - ** Memory Usage** : JVM memory utilization for Spring Boot services + Python memory for Flask
117+ - ** CPU Usage** : Process CPU consumption across all application services
184118
185- ### Database Configuration
186- Update the PostgreSQL exporter configuration in ` values.yaml ` to match your database settings.
119+ #### 5. Database
120+ - ** Database Connections** : Active PostgreSQL connections by database and state
121+ - ** DB Connection Saturation** : Connection pool utilization percentage
122+ - ** Database Transaction Rate** : Commit and rollback rates by database
123+
124+ ** Key Features** :
125+ - ** Multi-Service Support** : Monitors Spring Boot (gateway, user-svc, concept-svc) and Flask (genai-svc) services
126+ - ** Real-time Metrics** : 5-minute rate calculations for responsive monitoring
127+ - ** Threshold Indicators** : Color-coded alerts for performance degradation
128+ - ** Database Monitoring** : Comprehensive PostgreSQL metrics for both user and concept databases
187129
188130### Customizing Dashboards
189131To modify or add new dashboards:
190132
191- 1 . ** Edit existing dashboards** :
192- - Application dashboard: ` helm/monitor/templates/grafana-dashboards-configmap.yaml `
193- - Infrastructure dashboard: ` helm/monitor/templates/grafana-infrastructure-dashboard.yaml `
133+ Edit existing dashboard:
134+ - Dashboard JSON: ` helm/monitor/dashboards/AI Event Concepter - Application Overview.json `
194135
195- 2 . ** Add new dashboards** :
196- - Create a new ConfigMap with the ` grafana_dashboard: "1" ` label
197- - Include the dashboard JSON in the ConfigMap data
198- - The dashboard will be automatically loaded by Grafana
136+ and upload to Grafana
199137
200- 3 . ** Dashboard JSON format** :
201- - Export dashboards from Grafana UI as JSON
202- - Include required metadata (` __inputs ` , ` __requires ` )
203- - Set appropriate UID and title
204138
205- ## Troubleshooting
139+ ## Alertmanager
140+
141+ The monitoring stack includes ** 18 comprehensive alert rules** across 6 categories for proactive issue detection:
142+ - ** Service Availability** : 2 rules (ServiceDown, ServiceSlow)
143+ - ** Infrastructure** : 4 rules (Memory, CPU usage)
144+ - ** Application Performance** : 4 rules (Error rates, Response times)
145+ - ** JVM Monitoring** : 3 rules (Heap usage, GC frequency)
146+ - ** Database** : 3 rules (Connections, Query performance)
147+ - ** Business Logic** : 2 rules (Request volume, Client errors)
148+
149+ ** Notification** via Gmail SMTP:
150+ - ** Email** : teamgitpushforce@gmail.com
151+
152+ ** Alert Features** :
153+ - ** Progressive Severity** : Warning → Critical escalation
154+ - ** Team Labels** : All alerts tagged with ` team: ai-event-concepter `
155+ - ** Detailed Descriptions** : Clear explanations with runbook URLs
156+
157+ ## Customization
158+
159+ ### Adding Custom Alerts
160+ Edit ` helm/monitor/templates/prometheus-rules-configmap.yaml ` to add custom alerting rules.
206161
207- ### Check Pod Status
208- ``` bash
209- kubectl get pods -l app=prometheus
210- kubectl get pods -l app=grafana
211- kubectl get pods -l app=alertmanager
212- ```
213162
214- ### View Logs
215- ``` bash
216- kubectl logs -l app=prometheus
217- kubectl logs -l app=grafana
218- kubectl logs -l app=alertmanager
219- ```
220163
221- ### Check Metrics Endpoints
222- ``` bash
223- # Test Prometheus metrics endpoint
224- kubectl port-forward svc/prometheus 9090:9090
225164
226- # Test service metrics
227- kubectl port-forward svc/gateway 8080:8080
228- curl http://localhost:8080/actuator/prometheus
229- ```
230165
231- ### Storage Issues
232- If persistent volumes are not being created:
233- 1 . Check available storage classes: ` kubectl get storageclass `
234- 2 . Update ` storageClassName ` in ` values.yaml `
235- 3 . Ensure sufficient storage capacity
236-
237- ## Security Considerations
238-
239- - Change default passwords in production
240- - Use secrets for sensitive configuration
241- - Enable RBAC for service accounts
242- - Configure network policies
243- - Use TLS for all external access
244-
245- ## Performance Tuning
246-
247- ### Prometheus
248- - Adjust retention settings based on storage capacity
249- - Configure scrape intervals based on metrics importance
250- - Use recording rules for frequently used queries
251-
252- ### Resource Limits
253- Monitor resource usage and adjust limits in ` values.yaml ` :
254- ``` yaml
255- resources :
256- requests :
257- memory : " 256Mi"
258- cpu : " 100m"
259- limits :
260- memory : " 1Gi"
261- cpu : " 500m"
262- ` ` `
0 commit comments