Skip to content

Commit 23f91e4

Browse files
committed
feat(observability): 添加HTTP请求时延指标收集功能
实现HTTP中间件记录请求时延并导出到Prometheus指标 添加服务信息到指标收集器 统一代码格式和修复缩进问题 添加Prometheus Adapter API文档
1 parent e631114 commit 23f91e4

File tree

12 files changed

+519
-173
lines changed

12 files changed

+519
-173
lines changed

docs/prometheus_adapter/API.md

Lines changed: 271 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,271 @@
1+
# Prometheus Adapter API 文档
2+
3+
## 概述
4+
5+
Prometheus Adapter 提供从 Prometheus 获取服务 QPS 和平均时延指标的 RESTful API 接口。支持按服务名称和版本进行查询。
6+
7+
> **当前状态**
8+
> - QPS 指标:已实现,使用 `system_network_qps` 指标(基于网络包统计)
9+
> - 时延指标:已实现,使用 `http.server.request.duration_seconds` 指标(HTTP 请求真实时延)
10+
11+
## API
12+
13+
### 1. 获取服务 QPS 指标
14+
15+
**GET** `/v1/metrics/:service/qps`
16+
17+
获取指定服务的 QPS(每秒请求数)指标数据。
18+
19+
#### 路径参数
20+
- `service` (string, required): 服务名称
21+
22+
#### 查询参数
23+
- `version` (string, optional): 服务版本,不指定则返回所有版本
24+
- `start` (string, optional): 开始时间 (RFC3339 格式,如: 2024-01-01T00:00:00Z)
25+
- `end` (string, optional): 结束时间 (RFC3339 格式,如: 2024-01-01T01:00:00Z)
26+
- `step` (string, optional): 时间步长 (如: 1m, 5m, 1h),默认 1m
27+
28+
#### 请求示例
29+
```bash
30+
GET /v1/metrics/metadata-service/qps?version=1.0.0&start=2024-01-01T00:00:00Z&end=2024-01-01T01:00:00Z&step=1m
31+
```
32+
33+
#### 响应示例
34+
```json
35+
{
36+
"service": "metadata-service",
37+
"version": "1.0.0",
38+
"metric_type": "qps",
39+
"data": [
40+
{
41+
"timestamp": "2024-01-01T00:00:00Z",
42+
"value": 150.5
43+
},
44+
{
45+
"timestamp": "2024-01-01T00:01:00Z",
46+
"value": 148.2
47+
}
48+
],
49+
"summary": {
50+
"min": 120.1,
51+
"max": 180.3,
52+
"avg": 152.8,
53+
"total_points": 60
54+
}
55+
}
56+
```
57+
58+
### 2. 获取服务平均时延指标
59+
60+
**GET** `/v1/metrics/:service/latency`
61+
62+
获取指定服务的平均响应时延指标数据(单位:秒)。
63+
64+
#### 路径参数
65+
- `service` (string, required): 服务名称
66+
67+
#### 查询参数
68+
- `version` (string, optional): 服务版本,不指定则返回所有版本
69+
- `start` (string, optional): 开始时间 (RFC3339 格式)
70+
- `end` (string, optional): 结束时间 (RFC3339 格式)
71+
- `step` (string, optional): 时间步长,默认 1m
72+
- `percentile` (string, optional): 百分位数 (p50, p95, p99),默认 p50
73+
74+
#### 请求示例
75+
```bash
76+
GET /v1/metrics/storage-service/latency?version=1.0.0&percentile=p95&start=2024-01-01T00:00:00Z&end=2024-01-01T01:00:00Z
77+
```
78+
79+
#### 响应示例
80+
```json
81+
{
82+
"service": "storage-service",
83+
"version": "1.0.0",
84+
"metric_type": "latency",
85+
"percentile": "p95",
86+
"data": [
87+
{
88+
"timestamp": "2024-01-01T00:00:00Z",
89+
"value": 125.8
90+
},
91+
{
92+
"timestamp": "2024-01-01T00:01:00Z",
93+
"value": 132.1
94+
}
95+
],
96+
"summary": {
97+
"min": 98.5,
98+
"max": 201.2,
99+
"avg": 128.9,
100+
"total_points": 60
101+
}
102+
}
103+
```
104+
105+
### 3. 获取服务综合指标
106+
107+
**GET** `/v1/metrics/:service/overview`
108+
109+
同时获取指定服务的 QPS 和时延指标概览。
110+
111+
#### 路径参数
112+
- `service` (string, required): 服务名称
113+
114+
#### 查询参数
115+
- `version` (string, optional): 服务版本
116+
- `start` (string, optional): 开始时间 (RFC3339 格式)
117+
- `end` (string, optional): 结束时间 (RFC3339 格式)
118+
119+
#### 响应示例
120+
```json
121+
{
122+
"service": "queue-service",
123+
"version": "1.0.0",
124+
"time_range": {
125+
"start": "2024-01-01T00:00:00Z",
126+
"end": "2024-01-01T01:00:00Z"
127+
},
128+
"metrics": {
129+
"qps": {
130+
"current": 152.8,
131+
"avg": 148.5,
132+
"max": 180.3,
133+
"min": 120.1
134+
},
135+
"latency": {
136+
"p50": 85.2,
137+
"p95": 128.9,
138+
"p99": 201.2
139+
}
140+
}
141+
}
142+
```
143+
144+
### 4. 获取可用服务列表
145+
146+
**GET** `/v1/services`
147+
148+
获取 Prometheus 中可监控的服务列表。
149+
150+
#### 查询参数
151+
- `prefix` (string, optional): 服务名前缀过滤
152+
153+
#### 响应示例
154+
```json
155+
{
156+
"services": [
157+
{
158+
"name": "metadata-service",
159+
"versions": ["1.0.0"],
160+
"active_versions": ["1.0.0"],
161+
"last_updated": "2024-01-01T01:00:00Z"
162+
},
163+
{
164+
"name": "storage-service",
165+
"versions": ["1.0.0"],
166+
"active_versions": ["1.0.0"],
167+
"last_updated": "2024-01-01T00:45:00Z"
168+
},
169+
{
170+
"name": "queue-service",
171+
"versions": ["1.0.0"],
172+
"active_versions": ["1.0.0"],
173+
"last_updated": "2024-01-01T00:30:00Z"
174+
},
175+
{
176+
"name": "third-party-service",
177+
"versions": ["1.0.0"],
178+
"active_versions": ["1.0.0"],
179+
"last_updated": "2024-01-01T00:20:00Z"
180+
},
181+
{
182+
"name": "mock-error-service",
183+
"versions": ["1.0.0"],
184+
"active_versions": ["1.0.0"],
185+
"last_updated": "2024-01-01T00:15:00Z"
186+
}
187+
],
188+
"total": 5
189+
}
190+
```
191+
192+
## 错误响应
193+
194+
所有 API 在出错时返回统一的错误格式:
195+
196+
```json
197+
{
198+
"error": "error_code",
199+
"message": "详细错误描述",
200+
"details": {
201+
"field": "具体错误字段"
202+
}
203+
}
204+
```
205+
206+
### 常见错误码
207+
208+
- `400 Bad Request`: 请求参数错误
209+
- `404 Not Found`: 服务或版本不存在
210+
- `500 Internal Server Error`: 内部服务器错误
211+
- `503 Service Unavailable`: Prometheus 连接失败
212+
213+
## 实现说明
214+
215+
### Prometheus 查询语法
216+
217+
API 内部使用的 Prometheus 查询示例:
218+
219+
#### QPS 查询
220+
```promql
221+
# 网络包 QPS(当前实现)
222+
system_network_qps{exported_job="metadata-service",service_version="1.0.0"}
223+
224+
# 计算5分钟平均 QPS
225+
rate(system_network_qps{exported_job="metadata-service",service_version="1.0.0"}[5m])
226+
```
227+
228+
#### 平均时延查询
229+
```promql
230+
# P95 时延(95分位数)
231+
histogram_quantile(0.95, rate(http.server.request.duration_seconds_bucket{exported_job="metadata-service",service_version="1.0.0"}[5m]))
232+
233+
# P50 时延(中位数)
234+
histogram_quantile(0.50, rate(http.server.request.duration_seconds_bucket{exported_job="metadata-service",service_version="1.0.0"}[5m]))
235+
236+
# P99 时延(99分位数)
237+
histogram_quantile(0.99, rate(http.server.request.duration_seconds_bucket{exported_job="metadata-service",service_version="1.0.0"}[5m]))
238+
239+
# 平均时延
240+
rate(http.server.request.duration_seconds_sum{exported_job="metadata-service",service_version="1.0.0"}[5m])
241+
/
242+
rate(http.server.request.duration_seconds_count{exported_job="metadata-service",service_version="1.0.0"}[5m])
243+
```
244+
245+
### 配置要求
246+
247+
需要在配置文件中指定:
248+
- Prometheus 服务器地址:`http://10.210.10.33:9090`
249+
- 查询超时时间:30秒
250+
- 默认时间范围:最近1小时
251+
- 服务标签映射:
252+
- 服务名:`exported_job`(在指标中作为标签)
253+
- 版本号:`service_version`(在指标中作为标签)
254+
- 实例标识:通过 OpenTelemetry 的 `service.instance.id` 属性设置
255+
256+
### 支持的服务列表
257+
258+
当前 mock/s3 环境中支持的服务:
259+
- `metadata-service` - 元数据管理服务(版本:1.0.0)
260+
- `storage-service` - 存储服务(版本:1.0.0)
261+
- `queue-service` - 消息队列服务(版本:1.0.0)
262+
- `third-party-service` - 第三方集成服务(版本:1.0.0)
263+
- `mock-error-service` - 错误模拟服务(版本:1.0.0)
264+
265+
所有服务的版本信息通过 `service_version` 标签暴露。
266+
267+
### 缓存策略
268+
269+
- 指标数据缓存时间:30秒
270+
- 服务列表缓存时间:5分钟
271+
- 支持 ETag 缓存验证

mock/s3/services/mock-error/internal/handler/mock_error_handler.go

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -85,28 +85,28 @@ func (h *MockErrorHandler) deleteMetricAnomaly(c *gin.Context) {
8585

8686
// checkMetricInjection 检查是否应该注入指标异常
8787
func (h *MockErrorHandler) checkMetricInjection(c *gin.Context) {
88-
ctx := c.Request.Context()
88+
ctx := c.Request.Context()
8989

90-
var request struct {
91-
Service string `json:"service" binding:"required"`
92-
MetricName string `json:"metric_name" binding:"required"`
93-
Instance string `json:"instance"`
94-
}
90+
var request struct {
91+
Service string `json:"service" binding:"required"`
92+
MetricName string `json:"metric_name" binding:"required"`
93+
Instance string `json:"instance"`
94+
}
9595

9696
if err := c.ShouldBindJSON(&request); err != nil {
9797
h.logger.Error(ctx, "Failed to bind metric injection check request", observability.Error(err))
9898
c.JSON(http.StatusBadRequest, gin.H{"error": err.Error()})
9999
return
100100
}
101101

102-
anomaly, shouldInject := h.errorService.ShouldInjectError(ctx, request.Service, request.MetricName, request.Instance)
102+
anomaly, shouldInject := h.errorService.ShouldInjectError(ctx, request.Service, request.MetricName, request.Instance)
103103

104-
response := gin.H{
105-
"should_inject": shouldInject,
106-
"service": request.Service,
107-
"metric_name": request.MetricName,
108-
"instance": request.Instance,
109-
}
104+
response := gin.H{
105+
"should_inject": shouldInject,
106+
"service": request.Service,
107+
"metric_name": request.MetricName,
108+
"instance": request.Instance,
109+
}
110110

111111
if shouldInject {
112112
response["anomaly"] = anomaly

mock/s3/services/mock-error/internal/service/mock_error_service.go

Lines changed: 21 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -118,19 +118,19 @@ func (s *MockErrorService) ShouldInjectError(ctx context.Context, service, metri
118118
s.stats.TotalRequests++
119119
s.stats.LastUpdated = time.Now()
120120

121-
for _, rule := range s.rules {
122-
if !rule.Enabled {
123-
continue
124-
}
125-
126-
// 检查服务匹配
127-
if rule.Service != "" && rule.Service != service {
128-
continue
129-
}
130-
// 检查实例匹配(如果指定了实例,则必须匹配)
131-
if rule.Instance != "" && rule.Instance != instance {
132-
continue
133-
}
121+
for _, rule := range s.rules {
122+
if !rule.Enabled {
123+
continue
124+
}
125+
126+
// 检查服务匹配
127+
if rule.Service != "" && rule.Service != service {
128+
continue
129+
}
130+
// 检查实例匹配(如果指定了实例,则必须匹配)
131+
if rule.Instance != "" && rule.Instance != instance {
132+
continue
133+
}
134134

135135
// 检查指标名称匹配
136136
if rule.MetricName != "" && rule.MetricName != metricName {
@@ -167,14 +167,14 @@ func (s *MockErrorService) ShouldInjectError(ctx context.Context, service, metri
167167
"rule_id": rule.ID,
168168
}
169169

170-
s.logger.Info(ctx, "Metric anomaly injected",
171-
observability.String("rule_id", rule.ID),
172-
observability.String("service", service),
173-
observability.String("instance", instance),
174-
observability.String("metric_name", metricName),
175-
observability.String("anomaly_type", rule.AnomalyType),
176-
observability.Float64("target_value", rule.TargetValue),
177-
observability.Int("triggered_count", rule.Triggered))
170+
s.logger.Info(ctx, "Metric anomaly injected",
171+
observability.String("rule_id", rule.ID),
172+
observability.String("service", service),
173+
observability.String("instance", instance),
174+
observability.String("metric_name", metricName),
175+
observability.String("anomaly_type", rule.AnomalyType),
176+
observability.Float64("target_value", rule.TargetValue),
177+
observability.Int("triggered_count", rule.Triggered))
178178

179179
return anomaly, true
180180
}

mock/s3/shared/interfaces/error_injector.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ type MetricAnomalyService interface {
1515
ListRules(ctx context.Context) ([]*models.MetricAnomalyRule, error)
1616

1717
// 指标异常注入核心功能
18-
ShouldInjectError(ctx context.Context, service, metricName, instance string) (map[string]any, bool)
18+
ShouldInjectError(ctx context.Context, service, metricName, instance string) (map[string]any, bool)
1919
}
2020

2121
// MetricInjector HTTP指标异常注入器接口

0 commit comments

Comments
 (0)