Skip to content

Commit 83541a6

Browse files
committed
Merge remote-tracking branch 'origin/dev' into dev
2 parents 27578c3 + d7bfb39 commit 83541a6

File tree

13 files changed

+1528
-1538
lines changed

13 files changed

+1528
-1538
lines changed

genai-svc/.openapi-generator/FILES

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -36,9 +36,6 @@ genai_models/models/upload_and_process_documents200_response.py
3636
genai_models/models/user_preferences.py
3737
genai_models/openapi/openapi.yaml
3838
genai_models/test/__init__.py
39-
genai_models/test/test_chat_interface_controller.py
40-
genai_models/test/test_document_processing_controller.py
41-
genai_models/test/test_health_controller.py
4239
genai_models/typing_utils.py
4340
genai_models/util.py
4441
setup.py

genai-svc/test-requirements.txt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
pytest~=7.1.0
2+
pytest-cov>=2.8.1
3+
pytest-randomly>=1.2.3
4+
Flask-Testing==0.8.1

genai-svc/tox.ini

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ skipsdist=True
44

55
[testenv]
66
deps=-r{toxinidir}/requirements.txt
7-
-r{toxinidir}/requirements-dev.txt
7+
-r{toxinidir}/test-requirements.txt
88
{toxinidir}
99

1010
commands=

helm/monitor/README.md

Lines changed: 109 additions & 206 deletions
Original file line numberDiff line numberDiff line change
@@ -2,95 +2,72 @@
22

33
This Helm chart deploys a comprehensive monitoring stack for the AI Event Concepter application using Prometheus, Grafana, and related monitoring components.
44

5+
---
6+
7+
## 🏗️ **Architecture**
8+
9+
```
10+
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
11+
│ Application │ │ Prometheus │ │ Alertmanager │
12+
│ Services │───▶│ (Metrics) │───▶│ (Alerts) │
13+
└─────────────────┘ └─────────────────┘ └─────────────────┘
14+
│ │
15+
▼ ▼
16+
┌─────────────────┐ ┌─────────────────┐
17+
│ Grafana │ │ Email/Slack │
18+
│ (Dashboards) │ │ (Notifications) │
19+
└─────────────────┘ └─────────────────┘
20+
```
21+
22+
---
23+
524
## Components
625

726
### Core Monitoring
8-
- **Prometheus**: Metrics collection and storage
9-
- **Grafana**: Metrics visualization and dashboards
10-
- **Alertmanager**: Alert routing and notification management
27+
- **Prometheus**: Metrics collection and storage with persistent volume (10GB)
28+
- **Grafana**: Metrics visualization and dashboards with persistent volume (5GB)
29+
- **Alertmanager**: Alert routing and notification management with email notifications
30+
31+
### Persistent Storage
32+
- **Prometheus**: 10GB PVC for metrics retention
33+
- **Grafana**: 5GB PVC for dashboard configs
1134

1235
### Metrics Exporters
1336
- **Node Exporter**: System and hardware metrics
14-
- **cAdvisor**: Container metrics
15-
- **PostgreSQL Exporter**: Database metrics
37+
- **PostgreSQL Exporter**: Database metrics for both user and concept databases
38+
- **Blackbox Exporter**: External service availability monitoring
1639
- **Spring Boot Actuator**: Application metrics (built-in)
1740

1841
### Monitored Services
19-
- Gateway Service (Spring Boot)
20-
- User Service (Spring Boot)
21-
- Concept Service (Spring Boot)
22-
- GenAI Service (Flask)
23-
- PostgreSQL Database
24-
25-
## Configuration
26-
27-
### Values.yaml
28-
29-
The main configuration file `values.yaml` contains settings for all monitoring components:
30-
31-
```yaml
32-
# Prometheus configuration
33-
prometheus:
34-
image:
35-
repository: prom/prometheus
36-
tag: v2.52.0
37-
persistence:
38-
enabled: true
39-
size: 10Gi
40-
retention:
41-
time: "15d"
42-
size: "10GB"
43-
44-
# Node Exporter for system metrics
45-
nodeExporter:
46-
enabled: true
47-
image:
48-
repository: prom/node-exporter
49-
tag: v1.6.1
50-
51-
# cAdvisor for container metrics
52-
cadvisor:
53-
enabled: true
54-
image:
55-
repository: gcr.io/cadvisor/cadvisor
56-
tag: v0.47.2
57-
58-
# PostgreSQL Exporter
59-
postgresExporter:
60-
enabled: true
61-
database:
62-
host: postgres
63-
port: 5432
64-
name: postgres
65-
user: postgres
66-
password: password
67-
68-
# Alertmanager
69-
alertmanager:
70-
enabled: true
71-
persistence:
72-
enabled: true
73-
size: 1Gi
74-
```
42+
- **Gateway Service** (Spring Boot): API gateway with metrics at `/actuator/prometheus`
43+
- **User Service** (Spring Boot): User management with metrics at `/actuator/prometheus`
44+
- **Concept Service** (Spring Boot): Concept management with metrics at `/actuator/prometheus`
45+
- **GenAI Service** (Flask): AI processing with metrics at `/metrics`
46+
- **PostgreSQL Databases**: User database (eventdb) and concept database (conceptdb)
47+
- **MinIO Object Storage**: File storage metrics
48+
- **Weaviate Vector Database**: Vector search metrics
49+
- **T2V Transformers**: Text-to-vector processing metrics
50+
51+
### Service Discovery strategy
52+
1. **Declarative Configuration**: ServiceMonitor resources define monitoring requirements
53+
2. **Automatic Discovery**: Prometheus Operator automatically finds and monitors services
54+
3. **Cross-Namespace Support**: RBAC enables monitoring across multiple namespaces
55+
4. **Rich Metadata**: Relabeling provides context-rich metrics
56+
5. **Zero-Configuration**: New services are automatically monitored when properly labeled
7557

76-
## Deployment
7758

78-
### Prerequisites
79-
- Kubernetes cluster with Helm 3.x
80-
- Ingress controller (nginx-ingress)
81-
- Storage class for persistent volumes
8259

83-
### Installation
60+
## Deployment
8461

8562
1. **Deploy the monitoring stack:**
8663
```bash
87-
helm install monitor ./helm/monitor
64+
helm install monitor ./helm/monitor --namespace team-git-push-force-monitor
8865
```
8966

9067
2. **Update with custom values:**
9168
```bash
92-
helm upgrade monitor ./helm/monitor -f custom-values.yaml
93-
```
69+
helm upgrade monitor ./helm/monitor --namespace team-git-push-force-monitor
70+
```
9471

9572
3. **Uninstall:**
9673
```bash
@@ -105,158 +82,84 @@ After deployment, the monitoring components will be available at:
10582
- **Grafana**: `https://grafana.dev-aieventconcepter.student.k8s.aet.cit.tum.de`
10683
- **Alertmanager**: `https://alertmanager.dev-aieventconcepter.student.k8s.aet.cit.tum.de`
10784

108-
## Dashboards
10985

110-
The monitoring stack includes two comprehensive dashboards:
11186

112-
### 1. Application Overview Dashboard
87+
## Grafana Dashboard
88+
89+
### **Credentials**
90+
- **Username**: admin
91+
- **Password**: strongpassword
92+
93+
The monitoring stack includes a comprehensive application dashboard with detailed metrics across all system layers:
94+
95+
### Application Overview Dashboard
11396
**Title**: AI Event Concepter - Application Overview
114-
**UID**: `ai-event-concepter`
115-
116-
**Panels**:
117-
- **Service Health Overview**: Real-time status of all services
118-
- **Request Rate**: HTTP request rates by service, method, and endpoint
119-
- **Response Time (95th percentile)**: Application performance metrics
120-
- **Error Rate**: 4xx and 5xx error rates by service
121-
- **Memory Usage**: JVM memory utilization
122-
- **CPU Usage**: Process CPU consumption
123-
- **Database Connections**: HikariCP connection pool metrics
124-
- **Service Version Overview**: Application version information
125-
126-
### 2. Infrastructure Overview Dashboard
127-
**Title**: AI Event Concepter - Infrastructure Overview
128-
**UID**: `ai-event-concepter-infrastructure`
129-
130-
**Panels**:
131-
- **Node CPU Usage**: Host CPU utilization
132-
- **Node Memory Usage**: Host memory consumption
133-
- **Disk Usage**: Filesystem utilization
134-
- **Network Traffic**: Network I/O metrics
135-
- **Container CPU Usage**: Container-level CPU metrics
136-
- **Container Memory Usage**: Container memory consumption
137-
- **PostgreSQL Active Connections**: Database connection monitoring
138-
- **PostgreSQL Transaction Rate**: Database transaction metrics
139-
- **System Load Average**: System load (1m, 5m, 15m)
140-
141-
## Metrics Collection
142-
143-
### Spring Boot Services
144-
Spring Boot services expose metrics via Actuator endpoints:
145-
- Metrics path: `/actuator/prometheus`
146-
- Default port: `8080`
147-
- Services: gateway, user-svc, concept-svc
148-
149-
### GenAI Service
150-
Flask service exposes Prometheus metrics:
151-
- Metrics path: `/metrics`
152-
- Port: `8083`
153-
154-
### System Metrics
155-
- **Node Exporter**: Host system metrics (CPU, memory, disk, network)
156-
- **cAdvisor**: Container and Kubernetes metrics
157-
- **PostgreSQL Exporter**: Database performance metrics
158-
159-
## Alerting Rules
160-
161-
The monitoring stack includes predefined alerting rules for:
162-
163-
### Infrastructure Alerts
164-
- Service availability (ServiceDown)
165-
- High memory usage (>85%)
166-
- High CPU usage (>80%)
167-
- Disk space filling up (>90%)
168-
169-
### Application Alerts
170-
- Spring Boot high error rate (>0.1 errors/sec)
171-
- Spring Boot high response time (>2s 95th percentile)
172-
- PostgreSQL high connections (>80)
173-
174-
### Container Alerts
175-
- Container high memory usage (>85% of limit)
97+
**UID**: `ai-event-concepter-example`
17698

177-
## Customization
99+
**Dashboard Sections**:
178100

179-
### Adding Custom Alerts
180-
Edit `helm/monitor/templates/prometheus-rules-configmap.yaml` to add custom alerting rules.
101+
#### 1. Service Health
102+
- **Service Health Overview**: Real-time status of all monitored services with color-coded indicators
103+
- **All Services Healthy?**: Aggregate health status across all services
104+
105+
#### 2. Traffic & Errors
106+
- **Request Count**: HTTP request rates by service (Spring Boot + Flask services)
107+
- **HTTP Success Rate (%)**: Percentage of successful requests (non-5xx responses)
108+
- **Error Rate**: 4xx and 5xx error rates by service with detailed breakdown
109+
110+
#### 3. Latency
111+
- **Response Time (95th percentile)**: Application performance metrics for all services
112+
- **P50 & P99**: Median and 99th percentile response times
113+
- **Max Observed Latency**: Peak latency observations with threshold indicators
181114

182-
### Modifying Service Discovery
183-
Update the Prometheus configuration in `helm/monitor/templates/prometheus-configmap.yaml` to modify service discovery and scraping rules.
115+
#### 4. Resource Usage
116+
- **Memory Usage**: JVM memory utilization for Spring Boot services + Python memory for Flask
117+
- **CPU Usage**: Process CPU consumption across all application services
184118

185-
### Database Configuration
186-
Update the PostgreSQL exporter configuration in `values.yaml` to match your database settings.
119+
#### 5. Database
120+
- **Database Connections**: Active PostgreSQL connections by database and state
121+
- **DB Connection Saturation**: Connection pool utilization percentage
122+
- **Database Transaction Rate**: Commit and rollback rates by database
123+
124+
**Key Features**:
125+
- **Multi-Service Support**: Monitors Spring Boot (gateway, user-svc, concept-svc) and Flask (genai-svc) services
126+
- **Real-time Metrics**: 5-minute rate calculations for responsive monitoring
127+
- **Threshold Indicators**: Color-coded alerts for performance degradation
128+
- **Database Monitoring**: Comprehensive PostgreSQL metrics for both user and concept databases
187129

188130
### Customizing Dashboards
189131
To modify or add new dashboards:
190132

191-
1. **Edit existing dashboards**:
192-
- Application dashboard: `helm/monitor/templates/grafana-dashboards-configmap.yaml`
193-
- Infrastructure dashboard: `helm/monitor/templates/grafana-infrastructure-dashboard.yaml`
133+
Edit existing dashboard:
134+
- Dashboard JSON: `helm/monitor/dashboards/AI Event Concepter - Application Overview.json`
194135

195-
2. **Add new dashboards**:
196-
- Create a new ConfigMap with the `grafana_dashboard: "1"` label
197-
- Include the dashboard JSON in the ConfigMap data
198-
- The dashboard will be automatically loaded by Grafana
136+
and upload to Grafana
199137

200-
3. **Dashboard JSON format**:
201-
- Export dashboards from Grafana UI as JSON
202-
- Include required metadata (`__inputs`, `__requires`)
203-
- Set appropriate UID and title
204138

205-
## Troubleshooting
139+
## Alertmanager
140+
141+
The monitoring stack includes **18 comprehensive alert rules** across 6 categories for proactive issue detection:
142+
- **Service Availability**: 2 rules (ServiceDown, ServiceSlow)
143+
- **Infrastructure**: 4 rules (Memory, CPU usage)
144+
- **Application Performance**: 4 rules (Error rates, Response times)
145+
- **JVM Monitoring**: 3 rules (Heap usage, GC frequency)
146+
- **Database**: 3 rules (Connections, Query performance)
147+
- **Business Logic**: 2 rules (Request volume, Client errors)
148+
149+
**Notification** via Gmail SMTP:
150+
- **Email**: teamgitpushforce@gmail.com
151+
152+
**Alert Features**:
153+
- **Progressive Severity**: Warning → Critical escalation
154+
- **Team Labels**: All alerts tagged with `team: ai-event-concepter`
155+
- **Detailed Descriptions**: Clear explanations with runbook URLs
156+
157+
## Customization
158+
159+
### Adding Custom Alerts
160+
Edit `helm/monitor/templates/prometheus-rules-configmap.yaml` to add custom alerting rules.
206161

207-
### Check Pod Status
208-
```bash
209-
kubectl get pods -l app=prometheus
210-
kubectl get pods -l app=grafana
211-
kubectl get pods -l app=alertmanager
212-
```
213162

214-
### View Logs
215-
```bash
216-
kubectl logs -l app=prometheus
217-
kubectl logs -l app=grafana
218-
kubectl logs -l app=alertmanager
219-
```
220163

221-
### Check Metrics Endpoints
222-
```bash
223-
# Test Prometheus metrics endpoint
224-
kubectl port-forward svc/prometheus 9090:9090
225164

226-
# Test service metrics
227-
kubectl port-forward svc/gateway 8080:8080
228-
curl http://localhost:8080/actuator/prometheus
229-
```
230165

231-
### Storage Issues
232-
If persistent volumes are not being created:
233-
1. Check available storage classes: `kubectl get storageclass`
234-
2. Update `storageClassName` in `values.yaml`
235-
3. Ensure sufficient storage capacity
236-
237-
## Security Considerations
238-
239-
- Change default passwords in production
240-
- Use secrets for sensitive configuration
241-
- Enable RBAC for service accounts
242-
- Configure network policies
243-
- Use TLS for all external access
244-
245-
## Performance Tuning
246-
247-
### Prometheus
248-
- Adjust retention settings based on storage capacity
249-
- Configure scrape intervals based on metrics importance
250-
- Use recording rules for frequently used queries
251-
252-
### Resource Limits
253-
Monitor resource usage and adjust limits in `values.yaml`:
254-
```yaml
255-
resources:
256-
requests:
257-
memory: "256Mi"
258-
cpu: "100m"
259-
limits:
260-
memory: "1Gi"
261-
cpu: "500m"
262-
```

0 commit comments

Comments
 (0)