AET-DevOps25
diff --git a/‎genai-svc/.openapi-generator/FILES‎
Lines changed: 0 additions & 3 deletions b/‎genai-svc/.openapi-generator/FILES‎
Lines changed: 0 additions & 3 deletions
diff --git a/‎genai-svc/test-requirements.txt‎
Lines changed: 4 additions & 0 deletions b/‎genai-svc/test-requirements.txt‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎genai-svc/tox.ini‎
Lines changed: 1 addition & 1 deletion b/‎genai-svc/tox.ini‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎helm/monitor/README.md‎
Lines changed: 109 additions & 206 deletions b/‎helm/monitor/README.md‎
Lines changed: 109 additions & 206 deletions
@@ -36,9 +36,6 @@ genai_models/models/upload_and_process_documents200_response.py
 genai_models/models/user_preferences.py
 genai_models/openapi/openapi.yaml
 genai_models/test/__init__.py
-genai_models/test/test_chat_interface_controller.py
-genai_models/test/test_document_processing_controller.py
-genai_models/test/test_health_controller.py
 genai_models/typing_utils.py
 genai_models/util.py
 setup.py
 
@@ -0,0 +1,4 @@
+pytest~=7.1.0
+pytest-cov>=2.8.1
+pytest-randomly>=1.2.3
+Flask-Testing==0.8.1
@@ -4,7 +4,7 @@ skipsdist=True
 
 [testenv]
 deps=-r{toxinidir}/requirements.txt
-     -r{toxinidir}/requirements-dev.txt
+     -r{toxinidir}/test-requirements.txt
      {toxinidir}
 
 commands=
 
@@ -2,95 +2,72 @@
 
 This Helm chart deploys a comprehensive monitoring stack for the AI Event Concepter application using Prometheus, Grafana, and related monitoring components.
 
+---
+
+## 🏗️ **Architecture**
+
+```
+┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
+│   Application   │    │   Prometheus    │    │   Alertmanager  │
+│   Services      │───▶│   (Metrics)     │───▶│   (Alerts)      │
+└─────────────────┘    └─────────────────┘    └─────────────────┘
+                                │                        │
+                                ▼                        ▼
+                       ┌─────────────────┐    ┌─────────────────┐
+                       │    Grafana      │    │   Email/Slack   │
+                       │ (Dashboards)    │    │ (Notifications) │
+                       └─────────────────┘    └─────────────────┘
+```
+
+---
+
 ## Components
 
 ### Core Monitoring
-- **Prometheus**: Metrics collection and storage
-- **Grafana**: Metrics visualization and dashboards
-- **Alertmanager**: Alert routing and notification management
+- **Prometheus**: Metrics collection and storage with persistent volume (10GB)
+- **Grafana**: Metrics visualization and dashboards with persistent volume (5GB)
+- **Alertmanager**: Alert routing and notification management with email notifications
+
+### Persistent Storage
+- **Prometheus**: 10GB PVC for metrics retention
+- **Grafana**: 5GB PVC for dashboard configs
 
 ### Metrics Exporters
 - **Node Exporter**: System and hardware metrics
-- **cAdvisor**: Container metrics
-- **PostgreSQL Exporter**: Database metrics
+- **PostgreSQL Exporter**: Database metrics for both user and concept databases
+- **Blackbox Exporter**: External service availability monitoring
 - **Spring Boot Actuator**: Application metrics (built-in)
 
 ### Monitored Services
-- Gateway Service (Spring Boot)
-- User Service (Spring Boot)
-- Concept Service (Spring Boot)
-- GenAI Service (Flask)
-- PostgreSQL Database
-
-## Configuration
-
-### Values.yaml
-
-The main configuration file `values.yaml` contains settings for all monitoring components:
-
-```yaml
-# Prometheus configuration
-prometheus:
-  image:
-    repository: prom/prometheus
-    tag: v2.52.0
-  persistence:
-    enabled: true
-    size: 10Gi
-  retention:
-    time: "15d"
-    size: "10GB"
-
-# Node Exporter for system metrics
-nodeExporter:
-  enabled: true
-  image:
-    repository: prom/node-exporter
-    tag: v1.6.1
-
-# cAdvisor for container metrics
-cadvisor:
-  enabled: true
-  image:
-    repository: gcr.io/cadvisor/cadvisor
-    tag: v0.47.2
-
-# PostgreSQL Exporter
-postgresExporter:
-  enabled: true
-  database:
-    host: postgres
-    port: 5432
-    name: postgres
-    user: postgres
-    password: password
-
-# Alertmanager
-alertmanager:
-  enabled: true
-  persistence:
-    enabled: true
-    size: 1Gi
-```
+- **Gateway Service** (Spring Boot): API gateway with metrics at `/actuator/prometheus`
+- **User Service** (Spring Boot): User management with metrics at `/actuator/prometheus`
+- **Concept Service** (Spring Boot): Concept management with metrics at `/actuator/prometheus`
+- **GenAI Service** (Flask): AI processing with metrics at `/metrics`
+- **PostgreSQL Databases**: User database (eventdb) and concept database (conceptdb)
+- **MinIO Object Storage**: File storage metrics
+- **Weaviate Vector Database**: Vector search metrics
+- **T2V Transformers**: Text-to-vector processing metrics
+
+### Service Discovery strategy
+1. **Declarative Configuration**: ServiceMonitor resources define monitoring requirements
+2. **Automatic Discovery**: Prometheus Operator automatically finds and monitors services
+3. **Cross-Namespace Support**: RBAC enables monitoring across multiple namespaces
+4. **Rich Metadata**: Relabeling provides context-rich metrics
+5. **Zero-Configuration**: New services are automatically monitored when properly labeled
 
-## Deployment
 
-### Prerequisites
-- Kubernetes cluster with Helm 3.x
-- Ingress controller (nginx-ingress)
-- Storage class for persistent volumes
 
-### Installation
+## Deployment
 
 1. **Deploy the monitoring stack:**
    ```bash
-   helm install monitor ./helm/monitor
+   helm install monitor ./helm/monitor --namespace team-git-push-force-monitor 
    ```
 
 2. **Update with custom values:**
    ```bash
-   helm upgrade monitor ./helm/monitor -f custom-values.yaml
-   ```
+  helm upgrade  monitor ./helm/monitor --namespace team-git-push-force-monitor    
+  ```
 
 3. **Uninstall:**
    ```bash
@@ -105,158 +82,84 @@ After deployment, the monitoring components will be available at:
 - **Grafana**: `https://grafana.dev-aieventconcepter.student.k8s.aet.cit.tum.de`
 - **Alertmanager**: `https://alertmanager.dev-aieventconcepter.student.k8s.aet.cit.tum.de`
 
-## Dashboards
 
-The monitoring stack includes two comprehensive dashboards:
 
-### 1. Application Overview Dashboard
+## Grafana Dashboard
+
+### **Credentials**
+- **Username**: admin
+- **Password**: strongpassword
+
+The monitoring stack includes a comprehensive application dashboard with detailed metrics across all system layers:
+
+### Application Overview Dashboard
 **Title**: AI Event Concepter - Application Overview  
-**UID**: `ai-event-concepter`
-
-**Panels**:
-- **Service Health Overview**: Real-time status of all services
-- **Request Rate**: HTTP request rates by service, method, and endpoint
-- **Response Time (95th percentile)**: Application performance metrics
-- **Error Rate**: 4xx and 5xx error rates by service
-- **Memory Usage**: JVM memory utilization
-- **CPU Usage**: Process CPU consumption
-- **Database Connections**: HikariCP connection pool metrics
-- **Service Version Overview**: Application version information
-
-### 2. Infrastructure Overview Dashboard
-**Title**: AI Event Concepter - Infrastructure Overview  
-**UID**: `ai-event-concepter-infrastructure`
-
-**Panels**:
-- **Node CPU Usage**: Host CPU utilization
-- **Node Memory Usage**: Host memory consumption
-- **Disk Usage**: Filesystem utilization
-- **Network Traffic**: Network I/O metrics
-- **Container CPU Usage**: Container-level CPU metrics
-- **Container Memory Usage**: Container memory consumption
-- **PostgreSQL Active Connections**: Database connection monitoring
-- **PostgreSQL Transaction Rate**: Database transaction metrics
-- **System Load Average**: System load (1m, 5m, 15m)
-
-## Metrics Collection
-
-### Spring Boot Services
-Spring Boot services expose metrics via Actuator endpoints:
-- Metrics path: `/actuator/prometheus`
-- Default port: `8080`
-- Services: gateway, user-svc, concept-svc
-
-### GenAI Service
-Flask service exposes Prometheus metrics:
-- Metrics path: `/metrics`
-- Port: `8083`
-
-### System Metrics
-- **Node Exporter**: Host system metrics (CPU, memory, disk, network)
-- **cAdvisor**: Container and Kubernetes metrics
-- **PostgreSQL Exporter**: Database performance metrics
-
-## Alerting Rules
-
-The monitoring stack includes predefined alerting rules for:
-
-### Infrastructure Alerts
-- Service availability (ServiceDown)
-- High memory usage (>85%)
-- High CPU usage (>80%)
-- Disk space filling up (>90%)
-
-### Application Alerts
-- Spring Boot high error rate (>0.1 errors/sec)
-- Spring Boot high response time (>2s 95th percentile)
-- PostgreSQL high connections (>80)
-
-### Container Alerts
-- Container high memory usage (>85% of limit)
+**UID**: `ai-event-concepter-example`
 
-## Customization
+**Dashboard Sections**:
 
-### Adding Custom Alerts
-Edit `helm/monitor/templates/prometheus-rules-configmap.yaml` to add custom alerting rules.
+#### 1. Service Health
+- **Service Health Overview**: Real-time status of all monitored services with color-coded indicators
+- **All Services Healthy?**: Aggregate health status across all services
+
+#### 2. Traffic & Errors
+- **Request Count**: HTTP request rates by service (Spring Boot + Flask services)
+- **HTTP Success Rate (%)**: Percentage of successful requests (non-5xx responses)
+- **Error Rate**: 4xx and 5xx error rates by service with detailed breakdown
+
+#### 3. Latency
+- **Response Time (95th percentile)**: Application performance metrics for all services
+- **P50 & P99**: Median and 99th percentile response times
+- **Max Observed Latency**: Peak latency observations with threshold indicators
 
-### Modifying Service Discovery
-Update the Prometheus configuration in `helm/monitor/templates/prometheus-configmap.yaml` to modify service discovery and scraping rules.
+#### 4. Resource Usage
+- **Memory Usage**: JVM memory utilization for Spring Boot services + Python memory for Flask
+- **CPU Usage**: Process CPU consumption across all application services
 
-### Database Configuration
-Update the PostgreSQL exporter configuration in `values.yaml` to match your database settings.
+#### 5. Database
+- **Database Connections**: Active PostgreSQL connections by database and state
+- **DB Connection Saturation**: Connection pool utilization percentage
+- **Database Transaction Rate**: Commit and rollback rates by database
+
+**Key Features**:
+- **Multi-Service Support**: Monitors Spring Boot (gateway, user-svc, concept-svc) and Flask (genai-svc) services
+- **Real-time Metrics**: 5-minute rate calculations for responsive monitoring
+- **Threshold Indicators**: Color-coded alerts for performance degradation
+- **Database Monitoring**: Comprehensive PostgreSQL metrics for both user and concept databases
 
 ### Customizing Dashboards
 To modify or add new dashboards:
 
-1. **Edit existing dashboards**:
-   - Application dashboard: `helm/monitor/templates/grafana-dashboards-configmap.yaml`
-   - Infrastructure dashboard: `helm/monitor/templates/grafana-infrastructure-dashboard.yaml`
+Edit existing dashboard:
+   - Dashboard JSON: `helm/monitor/dashboards/AI Event Concepter - Application Overview.json`
 
-2. **Add new dashboards**:
-   - Create a new ConfigMap with the `grafana_dashboard: "1"` label
-   - Include the dashboard JSON in the ConfigMap data
-   - The dashboard will be automatically loaded by Grafana
+and upload to Grafana
 
-3. **Dashboard JSON format**:
-   - Export dashboards from Grafana UI as JSON
-   - Include required metadata (`__inputs`, `__requires`)
-   - Set appropriate UID and title
 
-## Troubleshooting
+## Alertmanager
+
+The monitoring stack includes **18 comprehensive alert rules** across 6 categories for proactive issue detection:
+- **Service Availability**: 2 rules (ServiceDown, ServiceSlow)
+- **Infrastructure**: 4 rules (Memory, CPU usage)
+- **Application Performance**: 4 rules (Error rates, Response times)
+- **JVM Monitoring**: 3 rules (Heap usage, GC frequency)
+- **Database**: 3 rules (Connections, Query performance)
+- **Business Logic**: 2 rules (Request volume, Client errors)
+
+**Notification** via Gmail SMTP:
+- **Email**: teamgitpushforce@gmail.com
+
+**Alert Features**:
+- **Progressive Severity**: Warning → Critical escalation
+- **Team Labels**: All alerts tagged with `team: ai-event-concepter`
+- **Detailed Descriptions**: Clear explanations with runbook URLs
+
+## Customization
+
+### Adding Custom Alerts
+Edit `helm/monitor/templates/prometheus-rules-configmap.yaml` to add custom alerting rules.
 
-### Check Pod Status
-```bash
-kubectl get pods -l app=prometheus
-kubectl get pods -l app=grafana
-kubectl get pods -l app=alertmanager
-```
 
-### View Logs
-```bash
-kubectl logs -l app=prometheus
-kubectl logs -l app=grafana
-kubectl logs -l app=alertmanager
-```
 
-### Check Metrics Endpoints
-```bash
-# Test Prometheus metrics endpoint
-kubectl port-forward svc/prometheus 9090:9090
 
-# Test service metrics
-kubectl port-forward svc/gateway 8080:8080
-curl http://localhost:8080/actuator/prometheus
-```
 
-### Storage Issues
-If persistent volumes are not being created:
-1. Check available storage classes: `kubectl get storageclass`
-2. Update `storageClassName` in `values.yaml`
-3. Ensure sufficient storage capacity
-
-## Security Considerations
-
-- Change default passwords in production
-- Use secrets for sensitive configuration
-- Enable RBAC for service accounts
-- Configure network policies
-- Use TLS for all external access
-
-## Performance Tuning
-
-### Prometheus
-- Adjust retention settings based on storage capacity
-- Configure scrape intervals based on metrics importance
-- Use recording rules for frequently used queries
-
-### Resource Limits
-Monitor resource usage and adjust limits in `values.yaml`:
-```yaml
-resources:
-  requests:
-    memory: "256Mi"
-    cpu: "100m"
-  limits:
-    memory: "1Gi"
-    cpu: "500m"
-```