Skip to content

Commit 2a820e2

Browse files
committed
feat: [#238] add Prometheus integration documentation (Phase 8)
- Create ADR documenting Prometheus integration architectural decisions: - Enabled-by-default with opt-out approach (monitoring best practice) - Independent template rendering pattern (each service renders own templates) - ServiceValidation struct for extensible E2E testing (supports future services) - Document alternatives considered and consequences - Update user guide with Prometheus configuration section: - Document prometheus.scrape_interval configuration - Explain enabled-by-default behavior and opt-out pattern - Add Prometheus UI access instructions (port 9090) - Link to manual verification guide for detailed testing - Add technical terms to project dictionary: - Alertmanager, entr, flatlined, promtool, tulpn - All linters passing, all tests passing (1507+ tests) Documentation completes Phase 8 of issue #238 implementation.
1 parent a257fcf commit 2a820e2

File tree

4 files changed

+293
-11
lines changed

4 files changed

+293
-11
lines changed
Lines changed: 216 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,216 @@
1+
# Decision: Prometheus Integration Pattern - Enabled by Default with Opt-Out
2+
3+
## Status
4+
5+
Accepted
6+
7+
## Date
8+
9+
2025-01-22
10+
11+
## Context
12+
13+
The tracker deployment system needed to add Prometheus as a metrics collection service. Several design decisions were required:
14+
15+
1. **Enablement Strategy**: Should Prometheus be mandatory, opt-in, or enabled-by-default?
16+
2. **Template Rendering**: How should Prometheus templates be rendered in the release workflow?
17+
3. **Service Validation**: How should E2E tests validate optional services like Prometheus?
18+
19+
The decision impacts:
20+
21+
- User experience (ease of getting started with monitoring)
22+
- System architecture (template rendering patterns)
23+
- Testing patterns (extensibility for future optional services)
24+
25+
## Decision
26+
27+
### 1. Enabled-by-Default with Opt-Out
28+
29+
Prometheus is **included by default** in generated environment templates but can be disabled by removing the configuration section.
30+
31+
**Implementation**:
32+
33+
```rust
34+
pub struct UserInputs {
35+
pub prometheus: Option<PrometheusConfig>, // Some by default, None to disable
36+
}
37+
```
38+
39+
**Configuration**:
40+
41+
```json
42+
{
43+
"prometheus": {
44+
"scrape_interval": 15
45+
}
46+
}
47+
```
48+
49+
**Disabling**: Remove the entire `prometheus` section from the environment config.
50+
51+
**Rationale**:
52+
53+
- Monitoring is a best practice - users should get it by default
54+
- Opt-out is simple - just remove the config section
55+
- No complex feature flags or enablement parameters needed
56+
- Follows principle of least surprise (monitoring expected for production deployments)
57+
58+
### 2. Independent Template Rendering Pattern
59+
60+
Each service renders its templates **independently** in the release handler, not from within other service's template rendering.
61+
62+
**Architecture**:
63+
64+
```text
65+
ReleaseCommandHandler::execute()
66+
├─ Step 1: Create tracker storage
67+
├─ Step 2: Render tracker templates (tracker/*.toml)
68+
├─ Step 3: Deploy tracker configs
69+
├─ Step 4: Create Prometheus storage (if enabled)
70+
├─ Step 5: Render Prometheus templates (prometheus.yml) - INDEPENDENT STEP
71+
├─ Step 6: Deploy Prometheus configs
72+
├─ Step 7: Render Docker Compose templates (docker-compose.yml)
73+
└─ Step 8: Deploy compose files
74+
```
75+
76+
**Rationale**:
77+
78+
- Each service is responsible for its own template rendering
79+
- Docker Compose templates only define service orchestration, not content generation
80+
- Environment configuration is the source of truth for which services are enabled
81+
- Follows Single Responsibility Principle (each step does one thing)
82+
- Makes it easy to add future services (Grafana, Alertmanager, etc.)
83+
84+
**Anti-Pattern Avoided**: Rendering Prometheus templates from within Docker Compose template rendering step.
85+
86+
### 3. ServiceValidation Struct for Extensible Testing
87+
88+
E2E validation uses a `ServiceValidation` struct with boolean flags instead of function parameters.
89+
90+
**Implementation**:
91+
92+
```rust
93+
pub struct ServiceValidation {
94+
pub prometheus: bool,
95+
// Future: pub grafana: bool,
96+
// Future: pub alertmanager: bool,
97+
}
98+
99+
pub fn run_release_validation(
100+
socket_addr: SocketAddr,
101+
ssh_credentials: &SshCredentials,
102+
services: Option<ServiceValidation>,
103+
) -> Result<(), String>
104+
```
105+
106+
**Rationale**:
107+
108+
- Extensible for future services without API changes
109+
- More semantic than boolean parameters
110+
- Clear intent: `ServiceValidation { prometheus: true }`
111+
- Follows Open-Closed Principle (open for extension, closed for modification)
112+
113+
**Anti-Pattern Avoided**: `run_release_validation_with_prometheus_check(addr, creds, true)` - too specific and not extensible.
114+
115+
## Consequences
116+
117+
### Positive
118+
119+
1. **Better User Experience**:
120+
121+
- Users get monitoring by default without manual setup
122+
- Simple opt-out (remove config section)
123+
- Production-ready deployments out of the box
124+
125+
2. **Cleaner Architecture**:
126+
127+
- Each service manages its own templates independently
128+
- Clear separation of concerns in release handler
129+
- Easy to add future services (Grafana, Alertmanager, Loki, etc.)
130+
131+
3. **Extensible Testing**:
132+
133+
- ServiceValidation struct easily extended for new services
134+
- Consistent pattern for optional service validation
135+
- Type-safe validation configuration
136+
137+
4. **Maintenance Benefits**:
138+
- Independent template rendering simplifies debugging
139+
- Each service's templates can be modified independently
140+
- Clear workflow steps make issues easier to trace
141+
142+
### Negative
143+
144+
1. **Default Overhead**:
145+
146+
- Users who don't want monitoring must manually remove the section
147+
- Prometheus container always included in default deployments
148+
- Slightly more disk/memory usage for minimal deployments
149+
150+
2. **Configuration Discovery**:
151+
- Users must learn that removing the section disables the service
152+
- Not immediately obvious from JSON schema alone
153+
- Requires documentation of the opt-out pattern
154+
155+
### Risks
156+
157+
1. **Breaking Changes**: Future Prometheus config schema changes require careful migration planning
158+
2. **Service Dependencies**: Adding services that depend on Prometheus requires proper ordering logic
159+
3. **Template Complexity**: As services grow, need to ensure independent rendering doesn't duplicate logic
160+
161+
## Alternatives Considered
162+
163+
### Alternative 1: Mandatory Prometheus
164+
165+
**Approach**: Always deploy Prometheus, no opt-out.
166+
167+
**Rejected Because**:
168+
169+
- Forces monitoring on users who don't want it
170+
- Increases minimum resource requirements
171+
- Violates principle of least astonishment for minimal deployments
172+
173+
### Alternative 2: Opt-In with Feature Flag
174+
175+
**Approach**: Prometheus disabled by default, enabled with `"prometheus": { "enabled": true }`.
176+
177+
**Rejected Because**:
178+
179+
- Requires users to discover and enable monitoring manually
180+
- Most production deployments should have monitoring - opt-in makes it less likely
181+
- Adds complexity with enabled/disabled flags
182+
183+
### Alternative 3: Render Prometheus Templates from Docker Compose Step
184+
185+
**Approach**: Docker Compose template rendering step also renders Prometheus templates.
186+
187+
**Rejected Because**:
188+
189+
- Violates Single Responsibility Principle
190+
- Makes Docker Compose step dependent on Prometheus internals
191+
- Harder to add future services independently
192+
- Couples service orchestration with service configuration
193+
194+
### Alternative 4: Boolean Parameters for Service Validation
195+
196+
**Approach**: `run_release_validation(addr, creds, check_prometheus: bool)`.
197+
198+
**Rejected Because**:
199+
200+
- Not extensible - adding Grafana requires API change
201+
- Less semantic - what does `true` mean?
202+
- Becomes unwieldy with multiple services
203+
- Violates Open-Closed Principle
204+
205+
## Related Decisions
206+
207+
- [Template System Architecture](../technical/template-system-architecture.md) - Project Generator pattern
208+
- [Environment Variable Injection](environment-variable-injection-in-docker-compose.md) - Configuration passing
209+
- [DDD Layer Placement](../contributing/ddd-layer-placement.md) - Module organization
210+
211+
## References
212+
213+
- Issue: [#238 - Prometheus Slice - Release and Run Commands](../issues/238-prometheus-slice-release-run-commands.md)
214+
- Manual Testing Guide: [Prometheus Verification](../e2e-testing/manual/prometheus-verification.md)
215+
- Prometheus Documentation: https://prometheus.io/docs/
216+
- torrust-demo Reference: Existing Prometheus integration patterns

docs/issues/238-prometheus-slice-release-run-commands.md

Lines changed: 19 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -92,32 +92,40 @@ This task adds Prometheus as a metrics collection service for the Torrust Tracke
9292
- **Pattern**: Independent Prometheus deployment following tracker pattern
9393

9494
-**Phase 6**: Ansible Deployment (commit: 9c1b91a)
95-
-**Phase 7**: Testing & Verification (commit: pending)
9695

97-
- Added E2E test validation for Prometheus configuration files
96+
-**Phase 7**: Testing & Verification (commit: a257fcf)
97+
98+
- Refactored validation with `ServiceValidation` struct for extensibility
99+
- Replaces boolean parameter with flags struct for future services (Grafana, etc.)
100+
- Supports selective validation based on enabled services
98101
- Created `PrometheusConfigValidator` to verify prometheus.yml deployment
99-
- Created `ServiceValidation` struct for extensible service validation flags
100-
- Added `run_release_validation()` function with optional service validation
101-
- Updated e2e-deployment-workflow-tests to validate Prometheus files when enabled
102+
- Validates file exists at `/opt/torrust/storage/prometheus/etc/prometheus.yml`
103+
- Checks file permissions and ownership via SSH
104+
- Updated e2e-deployment-workflow-tests to use ServiceValidation pattern
102105
- Created test environment configs:
103106
- `envs/e2e-deployment.json` - With Prometheus enabled (scrape_interval: 15)
104107
- `envs/e2e-deployment-no-prometheus.json` - Without Prometheus (disabled scenario)
105108
- E2E tests validate:
106-
- Prometheus configuration file exists at `/opt/torrust/storage/prometheus/etc/prometheus.yml`
109+
- Prometheus configuration file exists at correct path
107110
- Docker Compose files are deployed correctly
108111
- File permissions and ownership are correct
109-
- Manual E2E testing verified (environment: manual-test-prometheus):
112+
- Manual E2E testing completed (environment: manual-test-prometheus):
110113
- ✅ Prometheus container running (`docker ps` shows prom/prometheus:v3.0.1)
111114
- ✅ Prometheus scraping both tracker endpoints successfully
112115
- `/api/v1/stats` endpoint: health="up", scraping every 15s
113116
- `/api/v1/metrics` endpoint: health="up", scraping every 15s
114117
- ✅ Prometheus UI accessible at `http://<vm-ip>:9090`
115118
- ✅ Tracker metrics available and being collected
116119
- ✅ Configuration file correctly deployed with admin token and port
117-
- All linters passing, all E2E tests passing
118-
- **Architecture validated**: Each service renders templates independently, Prometheus fully functional
119-
120-
-**Phase 8**: Documentation (pending)
120+
- Created comprehensive manual testing documentation:
121+
- `docs/e2e-testing/manual/prometheus-verification.md` (450+ lines)
122+
- Documents 7 verification steps with exact commands and expected outputs
123+
- Includes troubleshooting guide for common issues
124+
- Provides success criteria checklist
125+
- All linters passing, all E2E tests passing (1507+ tests)
126+
- **Architecture validated**: Independent service rendering pattern working correctly
127+
128+
-**Phase 8**: Documentation (in progress)
121129

122130
## 🏗️ Architecture Requirements
123131

docs/user-guide/README.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -236,6 +236,9 @@ The environment configuration file is in JSON format:
236236
"public_key_path": "/path/to/public/key",
237237
"username": "ssh-username",
238238
"port": 22
239+
},
240+
"prometheus": {
241+
"scrape_interval": 15
239242
}
240243
}
241244
```
@@ -271,6 +274,56 @@ The environment configuration file is in JSON format:
271274
- SSH port number
272275
- Default: `22`
273276

277+
**prometheus.scrape_interval** (optional):
278+
279+
- Metrics collection interval in seconds
280+
- Default: `15` (included in generated templates)
281+
- Prometheus service enabled by default for monitoring
282+
- To disable: Remove the entire `prometheus` section from config
283+
284+
### Monitoring with Prometheus
285+
286+
The deployer includes Prometheus for metrics collection by default. Prometheus automatically scrapes metrics from the tracker's HTTP API endpoints.
287+
288+
**Default Behavior**:
289+
290+
- Prometheus is **enabled by default** in generated environment templates
291+
- Metrics collected from both `/api/v1/stats` and `/api/v1/metrics` endpoints
292+
- Accessible via web UI on port `9090`
293+
294+
**Configuration**:
295+
296+
```json
297+
{
298+
"prometheus": {
299+
"scrape_interval": 15
300+
}
301+
}
302+
```
303+
304+
**Disabling Prometheus**:
305+
306+
To deploy without Prometheus monitoring, remove the entire `prometheus` section from your environment config:
307+
308+
```json
309+
{
310+
"environment": { "name": "my-env" },
311+
"ssh_credentials": { ... }
312+
// No prometheus section = monitoring disabled
313+
}
314+
```
315+
316+
**Accessing Prometheus**:
317+
318+
After deployment, access the Prometheus UI at `http://<vm-ip>:9090` where you can:
319+
320+
- View current metrics from tracker endpoints
321+
- Query historical data
322+
- Check target health status
323+
- Explore available metrics
324+
325+
See [Prometheus Verification Guide](../e2e-testing/manual/prometheus-verification.md) for detailed verification steps.
326+
274327
### Logging Configuration
275328

276329
Control logging output with command-line options:

project-words.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ AAAAB
22
AAAAC
33
AAAAI
44
AGENTS
5+
Alertmanager
56
Ashburn
67
Avalonia
78
CIFS
@@ -86,13 +87,15 @@ ehthumbs
8687
elif
8788
endfor
8889
endraw
90+
entr
8991
epel
9092
eprint
9193
eprintln
9294
equalto
9395
executability
9496
exfiltration
9597
exitcode
98+
flatlined
9699
frontends
97100
getent
98101
getopt
@@ -169,6 +172,7 @@ preconfigured
169172
preinstalls
170173
prereq
171174
println
175+
promtool
172176
publickey
173177
pytest
174178
readlink
@@ -234,6 +238,7 @@ tmpfiles
234238
tmpfs
235239
tmptu
236240
torrust
241+
tulpn
237242
tulnp
238243
turbofish
239244
tést

0 commit comments

Comments
 (0)