Skip to content

Commit b32a043

Browse files
committed
feat: add platform monitoring guide
1 parent 7327627 commit b32a043

File tree

1 file changed

+217
-0
lines changed

1 file changed

+217
-0
lines changed
Lines changed: 217 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,217 @@
1+
---
2+
title: Seqera Platform Monitoring
3+
headline: "Seqera Platform Monitoring"
4+
description: "A guide on relevant platform metrics"
5+
---
6+
7+
# Seqera Platform Monitoring
8+
9+
## Enabling Observability Metrics
10+
11+
The Seqera Platform Backend has built-in observability metrics which can be enabled by adding `prometheus` to the `MICRONAUT_ENVIRONMENTS` environment variable. This exposes a Prometheus endpoint at `/prometheus` on the default listen port (e.g., `http://localhost:8080/prometheus`).
12+
13+
Combined with infrastructure monitoring tools such as Node Exporter, you can monitor relevant metrics across your deployment.
14+
15+
---
16+
17+
## Key Metrics to Monitor
18+
19+
### JVM Memory Metrics
20+
21+
| Metric | Description |
22+
| ------------------------------ | -------------------------------------------------------- |
23+
| `jvm_buffer_memory_used_bytes` | Memory used by JVM buffer pools (direct, mapped) |
24+
| `jvm_memory_used_bytes` | Amount of used memory by area (heap/non-heap) and region |
25+
| `jvm_memory_committed_bytes` | Memory committed for JVM use |
26+
| `jvm_memory_max_bytes` | Maximum memory available for memory management |
27+
| `jvm_gc_live_data_size_bytes` | Size of long-lived heap memory pool after reclamation |
28+
| `jvm_gc_max_data_size_bytes` | Max size of long-lived heap memory pool |
29+
30+
### JVM Garbage Collection
31+
32+
| Metric | Description |
33+
| ------------------------------------- | ----------------------------------------- |
34+
| `jvm_gc_pause_seconds_sum` | Total time spent in GC pauses |
35+
| `jvm_gc_pause_seconds_count` | Number of GC pause events |
36+
| `jvm_gc_pause_seconds_max` | Maximum GC pause duration |
37+
| `jvm_gc_memory_allocated_bytes_total` | Total bytes allocated in young generation |
38+
| `jvm_gc_memory_promoted_bytes_total` | Bytes promoted to old generation |
39+
40+
### JVM Threads
41+
42+
| Metric | Description |
43+
| ---------------------------- | ----------------------------------------------------------------- |
44+
| `jvm_threads_live_threads` | Current number of live threads (daemon + non-daemon) |
45+
| `jvm_threads_daemon_threads` | Current number of daemon threads |
46+
| `jvm_threads_peak_threads` | Peak thread count since JVM start |
47+
| `jvm_threads_states_threads` | Thread count by state (runnable, blocked, waiting, timed-waiting) |
48+
49+
### JVM Classes
50+
51+
| Metric | Description |
52+
| ------------------------------------ | -------------------------------------- |
53+
| `jvm_classes_loaded_classes` | Currently loaded classes |
54+
| `jvm_classes_unloaded_classes_total` | Total classes unloaded since JVM start |
55+
56+
### HTTP Server Requests
57+
58+
| Metric | Description |
59+
| ------------------------------------------ | ------------------------------------------------- |
60+
| `http_server_requests_seconds_count` | Total request count by method, status, and URI |
61+
| `http_server_requests_seconds_sum` | Total request duration by method, status, and URI |
62+
| `http_server_requests_seconds_max` | Maximum request duration |
63+
| `http_server_requests_seconds` (quantiles) | Request latency percentiles (p50, p95, p99, p999) |
64+
65+
### HTTP Client Requests
66+
67+
| Metric | Description |
68+
| ------------------------------------ | --------------------------------- |
69+
| `http_client_requests_seconds_count` | Outbound request count |
70+
| `http_client_requests_seconds_sum` | Total outbound request duration |
71+
| `http_client_requests_seconds_max` | Maximum outbound request duration |
72+
73+
### Process Metrics
74+
75+
| Metric | Description |
76+
| ---------------------------- | ------------------------------------ |
77+
| `process_cpu_usage` | Recent CPU usage for the JVM process |
78+
| `process_cpu_time_ns_total` | Total CPU time used by the JVM |
79+
| `process_files_open_files` | Open file descriptor count |
80+
| `process_files_max_files` | Maximum file descriptor limit |
81+
| `process_uptime_seconds` | JVM uptime |
82+
| `process_start_time_seconds` | Process start time (unix epoch) |
83+
84+
### System Metrics
85+
86+
| Metric | Description |
87+
| ------------------------ | ------------------------------------- |
88+
| `system_cpu_usage` | System-wide CPU usage |
89+
| `system_cpu_count` | Number of processors available to JVM |
90+
| `system_load_average_1m` | 1-minute load average |
91+
92+
### Executor Thread Pools
93+
94+
| Metric | Description |
95+
| -------------------------------- | ---------------------------------------------------------- |
96+
| `executor_active_threads` | Currently active threads by pool (io, blocking, scheduled) |
97+
| `executor_pool_size_threads` | Current thread pool size |
98+
| `executor_pool_max_threads` | Maximum allowed threads in pool |
99+
| `executor_queued_tasks` | Tasks queued for execution |
100+
| `executor_completed_tasks_total` | Total completed tasks |
101+
| `executor_seconds_sum` | Total execution time |
102+
103+
### Cache Metrics
104+
105+
| Metric | Description |
106+
| ----------------------- | ----------------------------------- |
107+
| `cache_size` | Number of entries in cache |
108+
| `cache_gets_total` | Cache hits and misses by cache name |
109+
| `cache_puts_total` | Cache entries added |
110+
| `cache_evictions_total` | Cache eviction count |
111+
112+
### Hibernate/Database Metrics
113+
114+
| Metric | Description |
115+
| ---------------------------------------- | ---------------------------------------------------- |
116+
| `hibernate_sessions_open_total` | Total sessions opened |
117+
| `hibernate_sessions_closed_total` | Total sessions closed |
118+
| `hibernate_connections_obtained_total` | Database connections obtained |
119+
| `hibernate_query_executions_total` | Total queries executed |
120+
| `hibernate_query_executions_max_seconds` | Slowest query time |
121+
| `hibernate_entities_inserts_total` | Entity insert operations |
122+
| `hibernate_entities_updates_total` | Entity update operations |
123+
| `hibernate_entities_deletes_total` | Entity delete operations |
124+
| `hibernate_entities_loads_total` | Entity load operations |
125+
| `hibernate_transactions_total` | Transaction count |
126+
| `hibernate_flushes_total` | Session flush count |
127+
| `hibernate_optimistic_failures_total` | Optimistic lock failures (StaleObjectStateException) |
128+
129+
### Seqera Platform-Specific Metrics
130+
131+
#### Workflow Metrics
132+
133+
| Metric | Description |
134+
| ----------------------------------------- | ------------------- |
135+
| `credits_estimation_workflow_added_total` | Workflows added |
136+
| `credits_estimation_workflow_ended_total` | Workflows completed |
137+
| `credits_estimation_task_started_total` | Tasks started |
138+
| `credits_estimation_task_ended_total` | Tasks ended |
139+
140+
#### Data Studio Metrics
141+
142+
| Metric | Description |
143+
| ------------------------------------------------ | ------------------------------------ |
144+
| `data_studio_startup_time_failure_seconds_sum` | Time for failed Data Studio startups |
145+
| `data_studio_startup_time_failure_seconds_count` | Failed Data Studio startup count |
146+
147+
#### Error Tracking
148+
149+
| Metric | Description |
150+
| ------------------------------ | ------------------------- |
151+
| `tower_logs_errors_10secCount` | Errors in last 10 seconds |
152+
| `tower_logs_errors_1minCount` | Errors in last minute |
153+
| `tower_logs_errors_5minCount` | Errors in last 5 minutes |
154+
155+
### Logging Metrics
156+
157+
| Metric | Description |
158+
| ---------------------- | ----------------------------------------------------- |
159+
| `logback_events_total` | Log events by level (debug, info, warn, error, trace) |
160+
161+
---
162+
163+
## Recommended Alerting Thresholds
164+
165+
### Critical Alerts
166+
167+
- `jvm_memory_used_bytes{area="heap"}` > 90% of `jvm_memory_max_bytes`
168+
- `process_files_open_files` > 90% of `process_files_max_files`
169+
- `logback_events_total{level="error"}` rate > threshold
170+
- `tower_logs_errors_1minCount` > 0
171+
172+
### Warning Alerts
173+
174+
- `jvm_gc_pause_seconds_sum` rate increasing significantly
175+
- `executor_queued_tasks` > threshold
176+
- `hibernate_optimistic_failures_total` rate increasing
177+
- `http_server_requests_seconds` p99 > acceptable latency
178+
179+
---
180+
181+
## Example PromQL Queries
182+
183+
### Request Rate (requests per second)
184+
185+
```promql
186+
rate(http_server_requests_seconds_count[5m])
187+
```
188+
189+
### Average Request Latency
190+
191+
```promql
192+
rate(http_server_requests_seconds_sum[5m]) / rate(http_server_requests_seconds_count[5m])
193+
```
194+
195+
### JVM Heap Usage Percentage
196+
197+
```promql
198+
sum(jvm_memory_used_bytes{area="heap"}) / sum(jvm_memory_max_bytes{area="heap"}) * 100
199+
```
200+
201+
### GC Pause Rate
202+
203+
```promql
204+
rate(jvm_gc_pause_seconds_sum[5m])
205+
```
206+
207+
### Error Rate
208+
209+
```promql
210+
rate(logback_events_total{level="error"}[5m])
211+
```
212+
213+
### Thread Pool Utilization
214+
215+
```promql
216+
executor_active_threads / executor_pool_size_threads * 100
217+
```

0 commit comments

Comments
 (0)