|
| 1 | +# Decision: Prometheus Integration Pattern - Enabled by Default with Opt-Out |
| 2 | + |
| 3 | +## Status |
| 4 | + |
| 5 | +Accepted |
| 6 | + |
| 7 | +## Date |
| 8 | + |
| 9 | +2025-01-22 |
| 10 | + |
| 11 | +## Context |
| 12 | + |
| 13 | +The tracker deployment system needed to add Prometheus as a metrics collection service. Several design decisions were required: |
| 14 | + |
| 15 | +1. **Enablement Strategy**: Should Prometheus be mandatory, opt-in, or enabled-by-default? |
| 16 | +2. **Template Rendering**: How should Prometheus templates be rendered in the release workflow? |
| 17 | +3. **Service Validation**: How should E2E tests validate optional services like Prometheus? |
| 18 | + |
| 19 | +The decision impacts: |
| 20 | + |
| 21 | +- User experience (ease of getting started with monitoring) |
| 22 | +- System architecture (template rendering patterns) |
| 23 | +- Testing patterns (extensibility for future optional services) |
| 24 | + |
| 25 | +## Decision |
| 26 | + |
| 27 | +### 1. Enabled-by-Default with Opt-Out |
| 28 | + |
| 29 | +Prometheus is **included by default** in generated environment templates but can be disabled by removing the configuration section. |
| 30 | + |
| 31 | +**Implementation**: |
| 32 | + |
| 33 | +```rust |
| 34 | +pub struct UserInputs { |
| 35 | + pub prometheus: Option<PrometheusConfig>, // Some by default, None to disable |
| 36 | +} |
| 37 | +``` |
| 38 | + |
| 39 | +**Configuration**: |
| 40 | + |
| 41 | +```json |
| 42 | +{ |
| 43 | + "prometheus": { |
| 44 | + "scrape_interval": 15 |
| 45 | + } |
| 46 | +} |
| 47 | +``` |
| 48 | + |
| 49 | +**Disabling**: Remove the entire `prometheus` section from the environment config. |
| 50 | + |
| 51 | +**Rationale**: |
| 52 | + |
| 53 | +- Monitoring is a best practice - users should get it by default |
| 54 | +- Opt-out is simple - just remove the config section |
| 55 | +- No complex feature flags or enablement parameters needed |
| 56 | +- Follows principle of least surprise (monitoring expected for production deployments) |
| 57 | + |
| 58 | +### 2. Independent Template Rendering Pattern |
| 59 | + |
| 60 | +Each service renders its templates **independently** in the release handler, not from within other service's template rendering. |
| 61 | + |
| 62 | +**Architecture**: |
| 63 | + |
| 64 | +```text |
| 65 | +ReleaseCommandHandler::execute() |
| 66 | +├─ Step 1: Create tracker storage |
| 67 | +├─ Step 2: Render tracker templates (tracker/*.toml) |
| 68 | +├─ Step 3: Deploy tracker configs |
| 69 | +├─ Step 4: Create Prometheus storage (if enabled) |
| 70 | +├─ Step 5: Render Prometheus templates (prometheus.yml) - INDEPENDENT STEP |
| 71 | +├─ Step 6: Deploy Prometheus configs |
| 72 | +├─ Step 7: Render Docker Compose templates (docker-compose.yml) |
| 73 | +└─ Step 8: Deploy compose files |
| 74 | +``` |
| 75 | + |
| 76 | +**Rationale**: |
| 77 | + |
| 78 | +- Each service is responsible for its own template rendering |
| 79 | +- Docker Compose templates only define service orchestration, not content generation |
| 80 | +- Environment configuration is the source of truth for which services are enabled |
| 81 | +- Follows Single Responsibility Principle (each step does one thing) |
| 82 | +- Makes it easy to add future services (Grafana, Alertmanager, etc.) |
| 83 | + |
| 84 | +**Anti-Pattern Avoided**: Rendering Prometheus templates from within Docker Compose template rendering step. |
| 85 | + |
| 86 | +### 3. ServiceValidation Struct for Extensible Testing |
| 87 | + |
| 88 | +E2E validation uses a `ServiceValidation` struct with boolean flags instead of function parameters. |
| 89 | + |
| 90 | +**Implementation**: |
| 91 | + |
| 92 | +```rust |
| 93 | +pub struct ServiceValidation { |
| 94 | + pub prometheus: bool, |
| 95 | + // Future: pub grafana: bool, |
| 96 | + // Future: pub alertmanager: bool, |
| 97 | +} |
| 98 | + |
| 99 | +pub fn run_release_validation( |
| 100 | + socket_addr: SocketAddr, |
| 101 | + ssh_credentials: &SshCredentials, |
| 102 | + services: Option<ServiceValidation>, |
| 103 | +) -> Result<(), String> |
| 104 | +``` |
| 105 | + |
| 106 | +**Rationale**: |
| 107 | + |
| 108 | +- Extensible for future services without API changes |
| 109 | +- More semantic than boolean parameters |
| 110 | +- Clear intent: `ServiceValidation { prometheus: true }` |
| 111 | +- Follows Open-Closed Principle (open for extension, closed for modification) |
| 112 | + |
| 113 | +**Anti-Pattern Avoided**: `run_release_validation_with_prometheus_check(addr, creds, true)` - too specific and not extensible. |
| 114 | + |
| 115 | +## Consequences |
| 116 | + |
| 117 | +### Positive |
| 118 | + |
| 119 | +1. **Better User Experience**: |
| 120 | + |
| 121 | + - Users get monitoring by default without manual setup |
| 122 | + - Simple opt-out (remove config section) |
| 123 | + - Production-ready deployments out of the box |
| 124 | + |
| 125 | +2. **Cleaner Architecture**: |
| 126 | + |
| 127 | + - Each service manages its own templates independently |
| 128 | + - Clear separation of concerns in release handler |
| 129 | + - Easy to add future services (Grafana, Alertmanager, Loki, etc.) |
| 130 | + |
| 131 | +3. **Extensible Testing**: |
| 132 | + |
| 133 | + - ServiceValidation struct easily extended for new services |
| 134 | + - Consistent pattern for optional service validation |
| 135 | + - Type-safe validation configuration |
| 136 | + |
| 137 | +4. **Maintenance Benefits**: |
| 138 | + - Independent template rendering simplifies debugging |
| 139 | + - Each service's templates can be modified independently |
| 140 | + - Clear workflow steps make issues easier to trace |
| 141 | + |
| 142 | +### Negative |
| 143 | + |
| 144 | +1. **Default Overhead**: |
| 145 | + |
| 146 | + - Users who don't want monitoring must manually remove the section |
| 147 | + - Prometheus container always included in default deployments |
| 148 | + - Slightly more disk/memory usage for minimal deployments |
| 149 | + |
| 150 | +2. **Configuration Discovery**: |
| 151 | + - Users must learn that removing the section disables the service |
| 152 | + - Not immediately obvious from JSON schema alone |
| 153 | + - Requires documentation of the opt-out pattern |
| 154 | + |
| 155 | +### Risks |
| 156 | + |
| 157 | +1. **Breaking Changes**: Future Prometheus config schema changes require careful migration planning |
| 158 | +2. **Service Dependencies**: Adding services that depend on Prometheus requires proper ordering logic |
| 159 | +3. **Template Complexity**: As services grow, need to ensure independent rendering doesn't duplicate logic |
| 160 | + |
| 161 | +## Alternatives Considered |
| 162 | + |
| 163 | +### Alternative 1: Mandatory Prometheus |
| 164 | + |
| 165 | +**Approach**: Always deploy Prometheus, no opt-out. |
| 166 | + |
| 167 | +**Rejected Because**: |
| 168 | + |
| 169 | +- Forces monitoring on users who don't want it |
| 170 | +- Increases minimum resource requirements |
| 171 | +- Violates principle of least astonishment for minimal deployments |
| 172 | + |
| 173 | +### Alternative 2: Opt-In with Feature Flag |
| 174 | + |
| 175 | +**Approach**: Prometheus disabled by default, enabled with `"prometheus": { "enabled": true }`. |
| 176 | + |
| 177 | +**Rejected Because**: |
| 178 | + |
| 179 | +- Requires users to discover and enable monitoring manually |
| 180 | +- Most production deployments should have monitoring - opt-in makes it less likely |
| 181 | +- Adds complexity with enabled/disabled flags |
| 182 | + |
| 183 | +### Alternative 3: Render Prometheus Templates from Docker Compose Step |
| 184 | + |
| 185 | +**Approach**: Docker Compose template rendering step also renders Prometheus templates. |
| 186 | + |
| 187 | +**Rejected Because**: |
| 188 | + |
| 189 | +- Violates Single Responsibility Principle |
| 190 | +- Makes Docker Compose step dependent on Prometheus internals |
| 191 | +- Harder to add future services independently |
| 192 | +- Couples service orchestration with service configuration |
| 193 | + |
| 194 | +### Alternative 4: Boolean Parameters for Service Validation |
| 195 | + |
| 196 | +**Approach**: `run_release_validation(addr, creds, check_prometheus: bool)`. |
| 197 | + |
| 198 | +**Rejected Because**: |
| 199 | + |
| 200 | +- Not extensible - adding Grafana requires API change |
| 201 | +- Less semantic - what does `true` mean? |
| 202 | +- Becomes unwieldy with multiple services |
| 203 | +- Violates Open-Closed Principle |
| 204 | + |
| 205 | +## Related Decisions |
| 206 | + |
| 207 | +- [Template System Architecture](../technical/template-system-architecture.md) - Project Generator pattern |
| 208 | +- [Environment Variable Injection](environment-variable-injection-in-docker-compose.md) - Configuration passing |
| 209 | +- [DDD Layer Placement](../contributing/ddd-layer-placement.md) - Module organization |
| 210 | + |
| 211 | +## References |
| 212 | + |
| 213 | +- Issue: [#238 - Prometheus Slice - Release and Run Commands](../issues/238-prometheus-slice-release-run-commands.md) |
| 214 | +- Manual Testing Guide: [Prometheus Verification](../e2e-testing/manual/prometheus-verification.md) |
| 215 | +- Prometheus Documentation: https://prometheus.io/docs/ |
| 216 | +- torrust-demo Reference: Existing Prometheus integration patterns |
0 commit comments