feat: [#238] add Prometheus integration documentation (Phase 8)

josecelano · josecelano · commit 2a820e2ca403 · 2025-12-15T18:16:45.000Z
- Create ADR documenting Prometheus integration architectural decisions: - Enabled-by-default with opt-out approach (monitoring best practice) - Independent template rendering pattern (each service renders own templates) - ServiceValidation struct for extensible E2E testing (supports future services) - Document alternatives considered and consequences - Update user guide with Prometheus configuration section: - Document prometheus.scrape_interval configuration - Explain enabled-by-default behavior and opt-out pattern - Add Prometheus UI access instructions (port 9090) - Link to manual verification guide for detailed testing - Add technical terms to project dictionary: - Alertmanager, entr, flatlined, promtool, tulpn - All linters passing, all tests passing (1507+ tests) Documentation completes Phase 8 of issue #238 implementation.
diff --git a/docs/decisions/prometheus-integration-pattern.md b/docs/decisions/prometheus-integration-pattern.md
@@ -0,0 +1,216 @@
+# Decision: Prometheus Integration Pattern - Enabled by Default with Opt-Out
+
+## Status
+
+Accepted
+
+## Date
+
+2025-01-22
+
+## Context
+
+The tracker deployment system needed to add Prometheus as a metrics collection service. Several design decisions were required:
+
+1. **Enablement Strategy**: Should Prometheus be mandatory, opt-in, or enabled-by-default?
+2. **Template Rendering**: How should Prometheus templates be rendered in the release workflow?
+3. **Service Validation**: How should E2E tests validate optional services like Prometheus?
+
+The decision impacts:
+
+- User experience (ease of getting started with monitoring)
+- System architecture (template rendering patterns)
+- Testing patterns (extensibility for future optional services)
+
+## Decision
+
+### 1. Enabled-by-Default with Opt-Out
+
+Prometheus is **included by default** in generated environment templates but can be disabled by removing the configuration section.
+
+**Implementation**:
+
+```rust
+pub struct UserInputs {
+    pub prometheus: Option<PrometheusConfig>, // Some by default, None to disable
+}
+```
+
+**Configuration**:
+
+```json
+{
+  "prometheus": {
+    "scrape_interval": 15
+  }
+}
+```
+
+**Disabling**: Remove the entire `prometheus` section from the environment config.
+
+**Rationale**:
+
+- Monitoring is a best practice - users should get it by default
+- Opt-out is simple - just remove the config section
+- No complex feature flags or enablement parameters needed
+- Follows principle of least surprise (monitoring expected for production deployments)
+
+### 2. Independent Template Rendering Pattern
+
+Each service renders its templates **independently** in the release handler, not from within other service's template rendering.
+
+**Architecture**:
+
+```text
+ReleaseCommandHandler::execute()
+├─ Step 1: Create tracker storage
+├─ Step 2: Render tracker templates (tracker/*.toml)
+├─ Step 3: Deploy tracker configs
+├─ Step 4: Create Prometheus storage (if enabled)
+├─ Step 5: Render Prometheus templates (prometheus.yml) - INDEPENDENT STEP
+├─ Step 6: Deploy Prometheus configs
+├─ Step 7: Render Docker Compose templates (docker-compose.yml)
+└─ Step 8: Deploy compose files
+```
+
+**Rationale**:
+
+- Each service is responsible for its own template rendering
+- Docker Compose templates only define service orchestration, not content generation
+- Environment configuration is the source of truth for which services are enabled
+- Follows Single Responsibility Principle (each step does one thing)
+- Makes it easy to add future services (Grafana, Alertmanager, etc.)
+
+**Anti-Pattern Avoided**: Rendering Prometheus templates from within Docker Compose template rendering step.
+
+### 3. ServiceValidation Struct for Extensible Testing
+
+E2E validation uses a `ServiceValidation` struct with boolean flags instead of function parameters.
+
+**Implementation**:
+
+```rust
+pub struct ServiceValidation {
+    pub prometheus: bool,
+    // Future: pub grafana: bool,
+    // Future: pub alertmanager: bool,
+}
+
+pub fn run_release_validation(
+    socket_addr: SocketAddr,
+    ssh_credentials: &SshCredentials,
+    services: Option<ServiceValidation>,
+) -> Result<(), String>
+```
+
+**Rationale**:
+
+- Extensible for future services without API changes
+- More semantic than boolean parameters
+- Clear intent: `ServiceValidation { prometheus: true }`
+- Follows Open-Closed Principle (open for extension, closed for modification)
+
+**Anti-Pattern Avoided**: `run_release_validation_with_prometheus_check(addr, creds, true)` - too specific and not extensible.
+
+## Consequences
+
+### Positive
+
+1. **Better User Experience**:
+
+   - Users get monitoring by default without manual setup
+   - Simple opt-out (remove config section)
+   - Production-ready deployments out of the box
+
+2. **Cleaner Architecture**:
+
+   - Each service manages its own templates independently
+   - Clear separation of concerns in release handler
+   - Easy to add future services (Grafana, Alertmanager, Loki, etc.)
+
+3. **Extensible Testing**:
+
+   - ServiceValidation struct easily extended for new services
+   - Consistent pattern for optional service validation
+   - Type-safe validation configuration
+
+4. **Maintenance Benefits**:
+   - Independent template rendering simplifies debugging
+   - Each service's templates can be modified independently
+   - Clear workflow steps make issues easier to trace
+
+### Negative
+
+1. **Default Overhead**:
+
+   - Users who don't want monitoring must manually remove the section
+   - Prometheus container always included in default deployments
+   - Slightly more disk/memory usage for minimal deployments
+
+2. **Configuration Discovery**:
+   - Users must learn that removing the section disables the service
+   - Not immediately obvious from JSON schema alone
+   - Requires documentation of the opt-out pattern
+
+### Risks
+
+1. **Breaking Changes**: Future Prometheus config schema changes require careful migration planning
+2. **Service Dependencies**: Adding services that depend on Prometheus requires proper ordering logic
+3. **Template Complexity**: As services grow, need to ensure independent rendering doesn't duplicate logic
+
+## Alternatives Considered
+
+### Alternative 1: Mandatory Prometheus
+
+**Approach**: Always deploy Prometheus, no opt-out.
+
+**Rejected Because**:
+
+- Forces monitoring on users who don't want it
+- Increases minimum resource requirements
+- Violates principle of least astonishment for minimal deployments
+
+### Alternative 2: Opt-In with Feature Flag
+
+**Approach**: Prometheus disabled by default, enabled with `"prometheus": { "enabled": true }`.
+
+**Rejected Because**:
+
+- Requires users to discover and enable monitoring manually
+- Most production deployments should have monitoring - opt-in makes it less likely
+- Adds complexity with enabled/disabled flags
+
+### Alternative 3: Render Prometheus Templates from Docker Compose Step
+
+**Approach**: Docker Compose template rendering step also renders Prometheus templates.
+
+**Rejected Because**:
+
+- Violates Single Responsibility Principle
+- Makes Docker Compose step dependent on Prometheus internals
+- Harder to add future services independently
+- Couples service orchestration with service configuration
+
+### Alternative 4: Boolean Parameters for Service Validation
+
+**Approach**: `run_release_validation(addr, creds, check_prometheus: bool)`.
+
+**Rejected Because**:
+
+- Not extensible - adding Grafana requires API change
+- Less semantic - what does `true` mean?
+- Becomes unwieldy with multiple services
+- Violates Open-Closed Principle
+
+## Related Decisions
+
+- [Template System Architecture](../technical/template-system-architecture.md) - Project Generator pattern
+- [Environment Variable Injection](environment-variable-injection-in-docker-compose.md) - Configuration passing
+- [DDD Layer Placement](../contributing/ddd-layer-placement.md) - Module organization
+
+## References
+
+- Issue: [#238 - Prometheus Slice - Release and Run Commands](../issues/238-prometheus-slice-release-run-commands.md)
+- Manual Testing Guide: [Prometheus Verification](../e2e-testing/manual/prometheus-verification.md)
+- Prometheus Documentation: https://prometheus.io/docs/
+- torrust-demo Reference: Existing Prometheus integration patterns
diff --git a/docs/issues/238-prometheus-slice-release-run-commands.md b/docs/issues/238-prometheus-slice-release-run-commands.md
@@ -92,32 +92,40 @@ This task adds Prometheus as a metrics collection service for the Torrust Tracke
   - **Pattern**: Independent Prometheus deployment following tracker pattern
 
 - ✅ **Phase 6**: Ansible Deployment (commit: 9c1b91a)
-- ✅ **Phase 7**: Testing & Verification (commit: pending)
 
-  - Added E2E test validation for Prometheus configuration files
+- ✅ **Phase 7**: Testing & Verification (commit: a257fcf)
+
+  - Refactored validation with `ServiceValidation` struct for extensibility
+    - Replaces boolean parameter with flags struct for future services (Grafana, etc.)
+    - Supports selective validation based on enabled services
   - Created `PrometheusConfigValidator` to verify prometheus.yml deployment
-  - Created `ServiceValidation` struct for extensible service validation flags
-  - Added `run_release_validation()` function with optional service validation
-  - Updated e2e-deployment-workflow-tests to validate Prometheus files when enabled
+    - Validates file exists at `/opt/torrust/storage/prometheus/etc/prometheus.yml`
+    - Checks file permissions and ownership via SSH
+  - Updated e2e-deployment-workflow-tests to use ServiceValidation pattern
   - Created test environment configs:
     - `envs/e2e-deployment.json` - With Prometheus enabled (scrape_interval: 15)
     - `envs/e2e-deployment-no-prometheus.json` - Without Prometheus (disabled scenario)
   - E2E tests validate:
-    - Prometheus configuration file exists at `/opt/torrust/storage/prometheus/etc/prometheus.yml`
+    - Prometheus configuration file exists at correct path
     - Docker Compose files are deployed correctly
     - File permissions and ownership are correct
-  - Manual E2E testing verified (environment: manual-test-prometheus):
+  - Manual E2E testing completed (environment: manual-test-prometheus):
     - ✅ Prometheus container running (`docker ps` shows prom/prometheus:v3.0.1)
     - ✅ Prometheus scraping both tracker endpoints successfully
       - `/api/v1/stats` endpoint: health="up", scraping every 15s
       - `/api/v1/metrics` endpoint: health="up", scraping every 15s
     - ✅ Prometheus UI accessible at `http://<vm-ip>:9090`
     - ✅ Tracker metrics available and being collected
     - ✅ Configuration file correctly deployed with admin token and port
-  - All linters passing, all E2E tests passing
-  - **Architecture validated**: Each service renders templates independently, Prometheus fully functional
-
-- ⏳ **Phase 8**: Documentation (pending)
+  - Created comprehensive manual testing documentation:
+    - `docs/e2e-testing/manual/prometheus-verification.md` (450+ lines)
+    - Documents 7 verification steps with exact commands and expected outputs
+    - Includes troubleshooting guide for common issues
+    - Provides success criteria checklist
+  - All linters passing, all E2E tests passing (1507+ tests)
+  - **Architecture validated**: Independent service rendering pattern working correctly
+
+- ⏳ **Phase 8**: Documentation (in progress)
 
 ## 🏗️ Architecture Requirements
 
diff --git a/docs/user-guide/README.md b/docs/user-guide/README.md
@@ -236,6 +236,9 @@ The environment configuration file is in JSON format:
     "public_key_path": "/path/to/public/key",
     "username": "ssh-username",
     "port": 22
+  },
+  "prometheus": {
+    "scrape_interval": 15
   }
 }
 ```
@@ -271,6 +274,56 @@ The environment configuration file is in JSON format:
 - SSH port number
 - Default: `22`
 
+**prometheus.scrape_interval** (optional):
+
+- Metrics collection interval in seconds
+- Default: `15` (included in generated templates)
+- Prometheus service enabled by default for monitoring
+- To disable: Remove the entire `prometheus` section from config
+
+### Monitoring with Prometheus
+
+The deployer includes Prometheus for metrics collection by default. Prometheus automatically scrapes metrics from the tracker's HTTP API endpoints.
+
+**Default Behavior**:
+
+- Prometheus is **enabled by default** in generated environment templates
+- Metrics collected from both `/api/v1/stats` and `/api/v1/metrics` endpoints
+- Accessible via web UI on port `9090`
+
+**Configuration**:
+
+```json
+{
+  "prometheus": {
+    "scrape_interval": 15
+  }
+}
+```
+
+**Disabling Prometheus**:
+
+To deploy without Prometheus monitoring, remove the entire `prometheus` section from your environment config:
+
+```json
+{
+  "environment": { "name": "my-env" },
+  "ssh_credentials": { ... }
+  // No prometheus section = monitoring disabled
+}
+```
+
+**Accessing Prometheus**:
+
+After deployment, access the Prometheus UI at `http://<vm-ip>:9090` where you can:
+
+- View current metrics from tracker endpoints
+- Query historical data
+- Check target health status
+- Explore available metrics
+
+See [Prometheus Verification Guide](../e2e-testing/manual/prometheus-verification.md) for detailed verification steps.
+
 ### Logging Configuration
 
 Control logging output with command-line options:
diff --git a/project-words.txt b/project-words.txt
@@ -2,6 +2,7 @@ AAAAB
 AAAAC
 AAAAI
 AGENTS
+Alertmanager
 Ashburn
 Avalonia
 CIFS
@@ -86,13 +87,15 @@ ehthumbs
 elif
 endfor
 endraw
+entr
 epel
 eprint
 eprintln
 equalto
 executability
 exfiltration
 exitcode
+flatlined
 frontends
 getent
 getopt
@@ -169,6 +172,7 @@ preconfigured
 preinstalls
 prereq
 println
+promtool
 publickey
 pytest
 readlink
@@ -234,6 +238,7 @@ tmpfiles
 tmpfs
 tmptu
 torrust
+tulpn
 tulnp
 turbofish
 tést