Skip to content

Commit f87a9e6

Browse files
committed
feat: add queue metrics and improve concurrency
- Added comprehensive queue metrics for monitoring concurrency system health - Implemented periodic queue validation to detect and repair inconsistencies - Enhanced queue manager with validation and repair capabilities - Updated documentation with new metrics and monitoring guidance - Improved semaphore implementation with better locking and error handling - Added test coverage for new queue validation functionality Signed-off-by: Chmouel Boudjnah <chmouel@redhat.com>
1 parent e7a2b7b commit f87a9e6

File tree

10 files changed

+1198
-89
lines changed

10 files changed

+1198
-89
lines changed

docs/content/docs/install/metrics.md

Lines changed: 156 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,12 +10,19 @@ The metrics for pipelines-as-code can be accessed through the `pipelines-as-code
1010
pipelines-as-code supports various exporters, such as Prometheus, Google Stackdriver, and more.
1111
You can configure these exporters by referring to the [observability configuration](../config/config-observability.yaml).
1212

13-
| Name | Type | Description |
14-
|------------------------------------------------------|---------|--------------------------------------------------------------------|
15-
| `pipelines_as_code_git_provider_api_request_count` | Counter | Number of API requests submitted to git providers |
16-
| `pipelines_as_code_pipelinerun_count` | Counter | Number of pipelineruns created by pipelines-as-code |
17-
| `pipelines_as_code_pipelinerun_duration_seconds_sum` | Counter | Number of seconds all pipelineruns have taken in pipelines-as-code |
18-
| `pipelines_as_code_running_pipelineruns_count` | Gauge | Number of running pipelineruns in pipelines-as-code |
13+
## Core PipelineRun Metrics
14+
15+
| Name | Type | Description | Labels |
16+
|------------------------------------------------------|---------|--------------------------------------------------------------------|---------|
17+
| `pipelines_as_code_pipelinerun_count` | Counter | Number of pipelineruns created by pipelines-as-code | `provider`, `event-type`, `namespace`, `repository` |
18+
| `pipelines_as_code_pipelinerun_duration_seconds_sum` | Counter | Number of seconds all pipelineruns have taken in pipelines-as-code | `namespace`, `repository`, `status`, `reason` |
19+
| `pipelines_as_code_running_pipelineruns_count` | Gauge | Number of running pipelineruns in pipelines-as-code | `namespace`, `repository` |
20+
21+
## Git Provider API Metrics
22+
23+
| Name | Type | Description | Labels |
24+
|------------------------------------------------------|---------|--------------------------------------------------------------------|---------|
25+
| `pipelines_as_code_git_provider_api_request_count` | Counter | Number of API requests submitted to git providers | `provider`, `event-type`, `namespace`, `repository` |
1926

2027
**Note:** The metric `pipelines_as_code_git_provider_api_request_count`
2128
is emitted by both the Controller and the Watcher, since both services
@@ -26,3 +33,146 @@ combine both services' metrics. For example, using PromQL:
2633
- `sum (rate(pac_controller_pipelines_as_code_git_provider_api_request_count[1m]) or rate(pac_watcher_pipelines_as_code_git_provider_api_request_count[1m]))`
2734

2835
![Prometheus query for git provider API usage metrics combined from both the Watcher and the Controller](/images/git-api-usage-metrics-prometheus-query.png)
36+
37+
## Queue Concurrency Metrics
38+
39+
The following metrics are available for monitoring the concurrency queue system that manages PipelineRun execution:
40+
41+
| Name | Type | Description | Labels |
42+
|-----------------------------------------|---------|--------------------------------------------------------------------|---------|
43+
| `pac_queue_validation_errors_total` | Gauge | Number of queue validation errors per repository | `repository`, `namespace` |
44+
| `pac_queue_validation_warnings_total` | Gauge | Number of queue validation warnings per repository | `repository`, `namespace` |
45+
| `pac_queue_repair_operations_total` | Counter | Number of queue repair operations | `repository`, `namespace`, `status` |
46+
| `pac_queue_state` | Gauge | Current state of concurrency queues | `repository`, `namespace`, `state` |
47+
| `pac_queue_utilization_percentage` | Gauge | Queue utilization as percentage of concurrency limit | `repository`, `namespace` |
48+
| `pac_queue_recovery_duration_seconds` | Histogram | Time taken to recover queue state | `repository`, `namespace` |
49+
50+
### Queue Metrics Details
51+
52+
#### Validation Metrics
53+
54+
- **`pac_queue_validation_errors_total`**: Tracks the number of validation errors found during periodic queue consistency checks. High values indicate queue inconsistencies that need attention.
55+
- **`pac_queue_validation_warnings_total`**: Tracks the number of validation warnings found during periodic queue consistency checks. Warnings indicate potential issues but are less severe than errors.
56+
57+
#### Repair Metrics
58+
59+
- **`pac_queue_repair_operations_total`**: Counts the number of repair operations performed. The `status` label indicates whether the repair was `success` or `failed`.
60+
61+
#### State Metrics
62+
63+
- **`pac_queue_state`**: Shows the current state of queues with the following `state` labels:
64+
- `running`: Number of PipelineRuns currently executing
65+
- `pending`: Number of PipelineRuns waiting in the queue
66+
67+
#### Utilization Metrics
68+
69+
- **`pac_queue_utilization_percentage`**: Shows queue utilization as a percentage of the configured concurrency limit. Values close to 100% indicate high queue usage.
70+
71+
#### Performance Metrics
72+
73+
- **`pac_queue_recovery_duration_seconds`**: Measures the time taken to recover queue state during initialization or repair operations. This helps identify performance issues.
74+
75+
### Queue Metrics Use Cases
76+
77+
#### Monitoring Queue Health
78+
79+
```promql
80+
# Check for repositories with validation errors
81+
pac_queue_validation_errors_total > 0
82+
83+
# Monitor queue utilization
84+
pac_queue_utilization_percentage > 80
85+
86+
# Track repair success rate
87+
rate(pac_queue_repair_operations_total{status="success"}[5m]) / rate(pac_queue_repair_operations_total[5m])
88+
```
89+
90+
#### Alerting Examples
91+
92+
```yaml
93+
# Alert on high validation errors
94+
- alert: QueueValidationErrors
95+
expr: pac_queue_validation_errors_total > 5
96+
for: 5m
97+
labels:
98+
severity: warning
99+
annotations:
100+
summary: "High queue validation errors detected"
101+
102+
# Alert on high queue utilization
103+
- alert: HighQueueUtilization
104+
expr: pac_queue_utilization_percentage > 90
105+
for: 2m
106+
labels:
107+
severity: warning
108+
annotations:
109+
summary: "Queue utilization is high"
110+
```
111+
112+
## Metric Collection Frequency
113+
114+
- **Core PipelineRun metrics**: Emitted in real-time as PipelineRuns are created and completed
115+
- **Git Provider API metrics**: Emitted for each API request to Git providers
116+
- **Queue metrics**: Collected every 1 minute during periodic queue validation
117+
118+
## Configuration
119+
120+
### Enabling Metrics
121+
122+
Metrics are enabled by default. You can configure the metrics endpoint and collection settings through the observability configuration:
123+
124+
```yaml
125+
apiVersion: v1
126+
kind: ConfigMap
127+
metadata:
128+
name: config-observability
129+
namespace: pipelines-as-code
130+
data:
131+
metrics.backend-destination: prometheus
132+
metrics.request-metrics-backend-destination: prometheus
133+
prometheus-host: 0.0.0.0
134+
prometheus-port: "9090"
135+
```
136+
137+
### Customizing Queue Validation Frequency
138+
139+
The queue validation frequency can be adjusted by modifying the controller code. Currently set to run every 1 minute:
140+
141+
```go
142+
ticker := time.NewTicker(1 * time.Minute) // Run every 1 minute
143+
```
144+
145+
## Troubleshooting
146+
147+
### Common Issues
148+
149+
1. **High validation errors**: Indicates queue inconsistencies, often due to controller restarts or partial failures
150+
2. **High queue utilization**: May indicate insufficient concurrency limits or high pipeline demand
151+
3. **Long recovery times**: Suggests performance issues during queue initialization or repair
152+
153+
### Debugging Commands
154+
155+
```bash
156+
# Start a proxy to access the metrics endpoint
157+
kubectl proxy &
158+
159+
# Find the controller pod name
160+
kubectl get pod -n pipelines-as-code | grep controller
161+
162+
# Access metrics via the proxy (replace POD_NAME with the actual name from above)
163+
curl http://127.0.0.1:8001/api/v1/namespaces/pipelines-as-code/pods/POD_NAME:9090/proxy/metrics
164+
165+
# Filter queue metrics
166+
curl http://127.0.0.1:8001/api/v1/namespaces/pipelines-as-code/pods/POD_NAME:9090/proxy/metrics | grep pac_queue
167+
168+
# Check specific repository metrics
169+
curl http://127.0.0.1:8001/api/v1/namespaces/pipelines-as-code/pods/POD_NAME:9090/proxy/metrics | grep "repository=\"your-repo-name\""
170+
```
171+
172+
## Best Practices
173+
174+
1. **Monitor queue validation errors**: Set up alerts for repositories with persistent validation errors
175+
2. **Track utilization trends**: Monitor queue utilization to optimize concurrency limits
176+
3. **Watch repair operations**: High repair rates may indicate underlying issues
177+
4. **Set appropriate concurrency limits**: Balance resource usage with pipeline throughput
178+
5. **Regular metric review**: Periodically review metrics to identify optimization opportunities

0 commit comments

Comments
 (0)