Skip to content

Commit 54b8f4d

Browse files
committed
feat(prometheus_adapter): 实现告警规则增量更新功能
ps:未跑通
1 parent 855c514 commit 54b8f4d

File tree

10 files changed

+1071
-63
lines changed

10 files changed

+1071
-63
lines changed

cmd/prometheus_adapter/main.go

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
package main
2+
3+
import (
4+
"os"
5+
6+
"github.com/fox-gonic/fox"
7+
"github.com/qiniu/zeroops/internal/config"
8+
prometheusadapter "github.com/qiniu/zeroops/internal/prometheus_adapter"
9+
"github.com/rs/zerolog"
10+
"github.com/rs/zerolog/log"
11+
)
12+
13+
func main() {
14+
// 配置日志
15+
log.Logger = log.Output(zerolog.ConsoleWriter{Out: os.Stderr})
16+
17+
log.Info().Msg("Starting Prometheus Adapter server")
18+
19+
// 加载配置
20+
cfg := &config.Config{
21+
Server: config.ServerConfig{
22+
BindAddr: ":9999", // 默认端口
23+
},
24+
}
25+
26+
// 如果有环境变量,使用环境变量的端口
27+
if port := os.Getenv("ADAPTER_PORT"); port != "" {
28+
cfg.Server.BindAddr = ":" + port
29+
}
30+
31+
// 创建 Prometheus Adapter 服务器
32+
adapter, err := prometheusadapter.NewPrometheusAdapterServer(cfg)
33+
if err != nil {
34+
log.Fatal().Err(err).Msg("Failed to create Prometheus Adapter server")
35+
}
36+
37+
// 创建路由
38+
router := fox.New()
39+
40+
// 启动 API
41+
if err := adapter.UseApi(router); err != nil {
42+
log.Fatal().Err(err).Msg("Failed to setup API routes")
43+
}
44+
45+
// 启动服务器
46+
log.Info().Msgf("Starting Prometheus Adapter on %s", cfg.Server.BindAddr)
47+
if err := router.Run(cfg.Server.BindAddr); err != nil {
48+
log.Fatal().Err(err).Msg("Failed to start server")
49+
}
50+
}

docs/prometheus_adapter/README.md

Lines changed: 79 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -139,10 +139,11 @@ internal/prometheus_adapter/
139139

140140
### 告警规则同步
141141

142+
#### 1. 全量同步规则
142143
- 方法与路径:`POST /v1/alert-rules/sync`
143144
- 功能:接收监控告警模块发送的完整规则列表,生成 Prometheus 规则文件并触发重载(全量同步)
144145
- 请求体示例:
145-
```
146+
```json
146147
{
147148
"rules": [
148149
{
@@ -155,23 +156,95 @@ internal/prometheus_adapter/
155156
],
156157
"rule_metas": [
157158
{
158-
"alert_name": "high_cpu_usage_storage_v1",
159+
"alert_name": "high_cpu_usage", // 与规则模板的name字段保持一致
159160
"labels": "{\"service\":\"storage-service\",\"version\":\"1.0.0\"}",
160161
"threshold": 90,
161-
"watch_time": 300,
162-
"match_time": "5m"
162+
"watch_time": 300
163163
}
164164
]
165165
}
166166
```
167167
- 响应示例:
168-
```
168+
```json
169169
{
170170
"status": "success",
171171
"message": "Rules synced to Prometheus"
172172
}
173173
```
174174

175+
#### 2. 更新单个规则模板
176+
- 方法与路径:`PUT /v1/alert-rules/:rule_name`
177+
- 功能:更新指定的告警规则模板,系统会自动查找所有使用该规则的元信息并重新生成 Prometheus 规则
178+
- 路径参数:
179+
- `rule_name`:规则名称(如 `high_cpu_usage`
180+
- 请求体示例:
181+
```json
182+
{
183+
"description": "CPU使用率异常告警(更新后)",
184+
"expr": "avg(system_cpu_usage_percent)",
185+
"op": ">=",
186+
"severity": "critical"
187+
}
188+
```
189+
- 响应示例:
190+
```json
191+
{
192+
"status": "success",
193+
"message": "Rule 'high_cpu_usage' updated and synced to Prometheus",
194+
"affected_metas": 3 // 影响的元信息数量
195+
}
196+
```
197+
198+
#### 3. 更新单个规则元信息
199+
- 方法与路径:`PUT /v1/alert-rules/meta`
200+
- 功能:更新指定规则的元信息,系统会根据对应的规则模板重新生成 Prometheus 规则
201+
- 请求体示例:
202+
```json
203+
{
204+
"rule_name": "high_cpu_usage", // 必填,对应规则模板的name
205+
"labels": "{\"service\":\"storage-service\",\"version\":\"2.0.0\"}", // 必填,用于唯一标识
206+
"threshold": 85,
207+
"watch_time": 600
208+
}
209+
```
210+
- 响应示例:
211+
```json
212+
{
213+
"status": "success",
214+
"message": "Rule meta updated and synced to Prometheus",
215+
"rule_name": "high_cpu_usage",
216+
"labels": "{\"service\":\"storage-service\",\"version\":\"2.0.0\"}"
217+
}
218+
```
219+
220+
#### 规则生成机制
221+
- **规则模板与元信息关联**:通过 `alert_name` 字段关联
222+
- `AlertRule.name` = `AlertRuleMeta.alert_name`
223+
- **元信息唯一标识**:通过 `alert_name` + `labels` 的组合唯一确定一个元信息记录
224+
- **Prometheus 告警生成**
225+
- 所有基于同一规则模板的告警使用相同的 `alert` 名称(即规则模板的 `name`
226+
- 通过 `labels` 区分不同的服务实例
227+
228+
#### 字段说明
229+
- **AlertRule(规则模板)**
230+
- `name`:规则名称,作为 Prometheus 的 alert 名称
231+
- `description`:规则描述,可读的 title
232+
- `expr`:PromQL 表达式,如 `sum(apitime) by (service, version)`,可包含时间范围
233+
- `op`:比较操作符(`>`, `<`, `=`, `!=`
234+
- `severity`:告警等级,通常进入告警的 labels.severity
235+
- **AlertRuleMeta(元信息)**
236+
- `alert_name`:关联的规则名称(对应 alert_rules.name)
237+
- `labels`:JSON 格式的标签,用于筛选特定服务(如 `{"service":"s3","version":"v1"}`
238+
- `threshold`:告警阈值
239+
- `watch_time`:持续时间(秒),对应 Prometheus 的 `for` 字段
240+
241+
#### 增量更新说明
242+
- **增量更新**:新接口支持增量更新,只需传入需要修改的字段
243+
- **自动匹配**
244+
- 更新规则模板时,系统自动查找所有 `alert_name` 匹配的元信息并重新生成规则
245+
- 更新元信息时,系统根据 `alert_name` + `labels` 查找并更新对应的元信息
246+
- **缓存机制**:系统在内存中缓存当前的规则和元信息,支持快速增量更新
247+
175248
## Alertmanager 集成
176249

177250
- 目标:将 Prometheus 触发的告警通过 Alertmanager 转发到监控告警模块
@@ -203,7 +276,7 @@ receivers:
203276
- `metadata-service`
204277
- `storage-service`
205278
- `queue-service`
206-
- `third-party-service`(原文为 third-party-servrice,已更正)
279+
- `third-party-service`
207280
- `mock-error-service`
208281

209282
所有服务的版本信息通过标签 `service_version` 暴露。

go.mod

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ require (
1111
github.com/prometheus/common v0.66.1
1212
github.com/redis/go-redis/v9 v9.5.1
1313
github.com/rs/zerolog v1.34.0
14+
gopkg.in/yaml.v3 v3.0.1
1415
)
1516

1617
require (
@@ -52,5 +53,4 @@ require (
5253
golang.org/x/sys v0.35.0 // indirect
5354
golang.org/x/text v0.28.0 // indirect
5455
google.golang.org/protobuf v1.36.8 // indirect
55-
gopkg.in/yaml.v3 v3.0.1 // indirect
5656
)

internal/prometheus_adapter/api/alert_api.go

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
package api
22

33
import (
4+
"fmt"
45
"net/http"
56

67
"github.com/fox-gonic/fox"
@@ -10,6 +11,8 @@ import (
1011
// setupAlertRouters 设置告警相关路由
1112
func (api *Api) setupAlertRouters(router *fox.Engine) {
1213
router.POST("/v1/alert-rules/sync", api.SyncRules)
14+
router.PUT("/v1/alert-rules/:rule_name", api.UpdateRule)
15+
router.PUT("/v1/alert-rules/meta", api.UpdateRuleMeta)
1316
}
1417

1518
// SyncRules 同步规则到Prometheus
@@ -34,3 +37,86 @@ func (api *Api) SyncRules(c *fox.Context) {
3437
"message": "Rules synced to Prometheus",
3538
})
3639
}
40+
41+
// UpdateRule 更新单个规则模板
42+
// 只更新指定的规则,系统会自动查找所有使用该规则的元信息并重新生成
43+
func (api *Api) UpdateRule(c *fox.Context) {
44+
ruleName := c.Param("rule_name")
45+
if ruleName == "" {
46+
SendErrorResponse(c, http.StatusBadRequest, model.ErrorCodeInvalidParameter,
47+
"Rule name is required", nil)
48+
return
49+
}
50+
51+
var req model.UpdateAlertRuleRequest
52+
if err := c.ShouldBindJSON(&req); err != nil {
53+
SendErrorResponse(c, http.StatusBadRequest, model.ErrorCodeInvalidParameter,
54+
"Invalid request body: "+err.Error(), nil)
55+
return
56+
}
57+
58+
// 构建完整的规则对象
59+
rule := model.AlertRule{
60+
Name: ruleName,
61+
Description: req.Description,
62+
Expr: req.Expr,
63+
Op: req.Op,
64+
Severity: req.Severity,
65+
}
66+
67+
err := api.alertService.UpdateRule(rule)
68+
if err != nil {
69+
SendErrorResponse(c, http.StatusInternalServerError, model.ErrorCodeInternalError,
70+
"Failed to update rule: "+err.Error(), nil)
71+
return
72+
}
73+
74+
// 获取受影响的元信息数量
75+
affectedCount := api.alertService.GetAffectedMetas(ruleName)
76+
77+
c.JSON(http.StatusOK, map[string]interface{}{
78+
"status": "success",
79+
"message": fmt.Sprintf("Rule '%s' updated and synced to Prometheus", ruleName),
80+
"affected_metas": affectedCount,
81+
})
82+
}
83+
84+
// UpdateRuleMeta 更新单个规则元信息
85+
// 通过 alert_name + labels 唯一确定一个元信息记录
86+
func (api *Api) UpdateRuleMeta(c *fox.Context) {
87+
var req model.UpdateAlertRuleMetaRequest
88+
if err := c.ShouldBindJSON(&req); err != nil {
89+
SendErrorResponse(c, http.StatusBadRequest, model.ErrorCodeInvalidParameter,
90+
"Invalid request body: "+err.Error(), nil)
91+
return
92+
}
93+
94+
// alert_name 和 labels 是必填的
95+
if req.AlertName == "" || req.Labels == "" {
96+
SendErrorResponse(c, http.StatusBadRequest, model.ErrorCodeInvalidParameter,
97+
"alert_name and labels are required", nil)
98+
return
99+
}
100+
101+
// 构建完整的元信息对象
102+
meta := model.AlertRuleMeta{
103+
AlertName: req.AlertName,
104+
Labels: req.Labels,
105+
Threshold: req.Threshold,
106+
WatchTime: req.WatchTime,
107+
}
108+
109+
err := api.alertService.UpdateRuleMeta(meta)
110+
if err != nil {
111+
SendErrorResponse(c, http.StatusInternalServerError, model.ErrorCodeInternalError,
112+
"Failed to update rule meta: "+err.Error(), nil)
113+
return
114+
}
115+
116+
c.JSON(http.StatusOK, map[string]interface{}{
117+
"status": "success",
118+
"message": "Rule meta updated and synced to Prometheus",
119+
"alert_name": req.AlertName,
120+
"labels": req.Labels,
121+
})
122+
}

internal/prometheus_adapter/model/alert.go

Lines changed: 9 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2,19 +2,18 @@ package model
22

33
// AlertRule 告警规则表 - 定义告警规则模板
44
type AlertRule struct {
5-
Name string `json:"name" gorm:"type:varchar(255);primaryKey"`
6-
Description string `json:"description" gorm:"type:text"`
7-
Expr string `json:"expr" gorm:"type:text;not null"`
8-
Op string `json:"op" gorm:"type:enum('>', '<', '=', '!=');not null"`
9-
Severity string `json:"severity" gorm:"type:varchar(50);not null"`
5+
Name string `json:"name" gorm:"type:varchar(255);primaryKey"` // 主键,告警规则名称
6+
Description string `json:"description" gorm:"type:text"` // 可读标题,可拼接渲染为可读的 title
7+
Expr string `json:"expr" gorm:"type:text;not null"` // 左侧业务指标表达式,如 sum(apitime) by (service, version)
8+
Op string `json:"op" gorm:"type:varchar(4);not null"` // 阈值比较方式(>, <, =, !=)
9+
Severity string `json:"severity" gorm:"type:varchar(32);not null"` // 告警等级,通常进入告警的 labels.severity
1010
}
1111

1212
// AlertRuleMeta 告警规则元信息表 - 存储服务级别的告警配置
1313
// 用于将告警规则模板实例化为具体的服务告警
1414
type AlertRuleMeta struct {
15-
AlertName string `json:"alert_name" gorm:"type:varchar(255);primaryKey"`
16-
Labels string `json:"labels" gorm:"type:text"` // JSON格式的服务标签,如:{"service":"storage-service","version":"1.0.0"}
17-
Threshold float64 `json:"threshold"` // 告警阈值
18-
WatchTime int `json:"watch_time"` // 持续时间(秒),对应Prometheus的for字段
19-
MatchTime string `json:"match_time" gorm:"type:text"` // 时间范围表达式
15+
AlertName string `json:"alert_name" gorm:"type:varchar(255);index"` // 关联 alert_rules.name
16+
Labels string `json:"labels" gorm:"type:jsonb"` // 适用标签,如 {"service":"s3","version":"v1"},为空表示全局
17+
Threshold float64 `json:"threshold"` // 阈值(会被渲染成特定规则的 threshold metric 数值)
18+
WatchTime int `json:"watch_time"` // 持续时长(映射 Prometheus rule 的 for)
2019
}

internal/prometheus_adapter/model/api.go

Lines changed: 8 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -42,16 +42,10 @@ type CreateAlertRuleRequest struct {
4242

4343
// UpdateAlertRuleRequest 更新告警规则请求
4444
type UpdateAlertRuleRequest struct {
45-
Description *string `json:"description,omitempty"`
46-
Expr *string `json:"expr,omitempty"`
47-
Op *string `json:"op,omitempty" binding:"omitempty,oneof=> < = !="`
48-
Severity *string `json:"severity,omitempty"`
49-
50-
// 元信息字段(可选)
51-
Labels map[string]string `json:"labels,omitempty"`
52-
Threshold *float64 `json:"threshold,omitempty"`
53-
WatchTime *int `json:"watch_time,omitempty"`
54-
MatchTime *string `json:"match_time,omitempty"`
45+
Description string `json:"description,omitempty"`
46+
Expr string `json:"expr,omitempty"`
47+
Op string `json:"op,omitempty" binding:"omitempty,oneof=> < = !="`
48+
Severity string `json:"severity,omitempty"`
5549
}
5650

5751
// CreateAlertRuleMetaRequest 创建告警规则元信息请求
@@ -65,10 +59,10 @@ type CreateAlertRuleMetaRequest struct {
6559

6660
// UpdateAlertRuleMetaRequest 更新告警规则元信息请求
6761
type UpdateAlertRuleMetaRequest struct {
68-
Labels map[string]string `json:"labels,omitempty"`
69-
Threshold *float64 `json:"threshold,omitempty"`
70-
WatchTime *int `json:"watch_time,omitempty"`
71-
MatchTime *string `json:"match_time,omitempty"`
62+
AlertName string `json:"alert_name" binding:"required"`
63+
Labels string `json:"labels" binding:"required"`
64+
Threshold float64 `json:"threshold"`
65+
WatchTime int `json:"watch_time"`
7266
}
7367

7468
// SyncRulesRequest 同步规则请求

0 commit comments

Comments
 (0)