-
Notifications
You must be signed in to change notification settings - Fork 6
feat: Add comprehensive rule management tools for Prometheus #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add create_recording_rule and create_alerting_rule tools - Add update_rule and delete_rule tools for rule management - Add validate_rule tool for PromQL expression validation - Add list_rule_files and get_rule_file_content utilities - Add reload_config tool for triggering Prometheus configuration reloads - Implement comprehensive file-based rule management with YAML support - Add automatic rule group creation and cleanup - Add JSON-based label and annotation support - Include comprehensive test suite with 11 test cases - Add detailed documentation in RULE_MANAGEMENT_TOOLS.md This addresses the missing rule management capability identified in the original MCP server, providing a complete solution for creating, updating, and deleting both recording and alerting rules. Signed-off-by: Orcun Berkem <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I took a first pass and have a few thoughts.
Like we discussed on slack, considering Prometheus itself doesn't have any kind of rule management support in the API, we kinda need to resort to file management and triggered reloads to do manage rules. It's awkward, but we're working with what we have.
Philosophical notes:
- this approach currently side steps any form of gitops/configuration management style approaches, and may even conflict if the given instance is already under any kind of rule file management
- this will only work if the prometheus-mcp-server has write access to the same storage that the prometheus instance has access to. this means like it likely needs to run alongside prometheus and/or potentially have volume mounts adjusted etc
- this would only update the prometheus it is connected to (theoretically), yet it's very common to run prometheus in HA duplicate pairs, thoughts on how to handle that scenario? idk if I'd trust LLMs enough to re-make the exact same rules/edits against multiple prometheus instances if eg a user connects the prometheus-mcp-server multiple times on different ports with different prometheus urls configured, etc.
Technical notes:
- was this AI generated? things hinting at it to me:
- the doc style, and the addition of a doc file for just a few tools that discusses everything from usage to go tests, which feels AI summation-like
- regurgitation of
ioutil.ReadFile()
calls that have been deprecated for years (yet littered across the internet and making great training material for AI lol) - some scattered odd syntax/formattings that should've been caught by the linter, but apparently I have it enabled to approve workflow runs, so should've done that sooner 😅
- left some notes on the /-/reload endpoint support, but that needs some adjustments please
- a very general request to not re-invent the wheel and try to base off of and use upstream as much as possible. the beautiful part about having prometheus written in go is that we can literally use prometheus as a library, and they've already done a lot of the hard work for us. the rules and rulefmt packages will be extremely useful here. this should eliminate the need for a lot of the custom types here, dramatically reduce the test coverage needed since we benefit from prometheus' testing, etc.
func validateRuleToolHandler(ctx context.Context, request mcp.CallToolRequest) (*mcp.CallToolResult, error) { | ||
expr, err := request.RequireString("expr") | ||
if err != nil { | ||
return mcp.NewToolResultError("expr must be a string"), nil | ||
} | ||
|
||
// Use the existing parse query API call if available | ||
isValid, err := validatePromQLExpression(ctx, expr) | ||
if err != nil { | ||
return mcp.NewToolResultError(fmt.Sprintf("Failed to validate expression: %v", err)), nil | ||
} | ||
|
||
if isValid { | ||
return mcp.NewToolResultText(fmt.Sprintf("PromQL expression is valid: %s", expr)), nil | ||
} | ||
|
||
return mcp.NewToolResultText(fmt.Sprintf("PromQL expression is invalid: %s", expr)), nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we refactor this to better lean on upstream rule group validation?
https://pkg.go.dev/github.com/prometheus/[email protected]/model/rulefmt#RuleGroups.Validate
return &PrometheusRule{Groups: []RuleGroup{}}, nil | ||
} | ||
|
||
data, err := ioutil.ReadFile(filePath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ioutil has been deprecated for years now
if prometheusURL == "" { | ||
// For now, we'll assume the user provides the URL | ||
// In a full implementation, you'd get this from the global configuration | ||
return fmt.Errorf("prometheus_url must be provided") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static error doesn't need formatting string -- linter should catch this
// Make HTTP POST request to reload endpoint | ||
req, err := http.NewRequestWithContext(ctx, "POST", prometheusURL, nil) | ||
if err != nil { | ||
return fmt.Errorf("failed to create request: %w", err) | ||
} | ||
|
||
client := &http.Client{} | ||
resp, err := client.Do(req) | ||
if err != nil { | ||
return fmt.Errorf("failed to make request: %w", err) | ||
} | ||
defer resp.Body.Close() | ||
|
||
if resp.StatusCode != http.StatusOK { | ||
return fmt.Errorf("reload request failed with status %d", resp.StatusCode) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should reuse the client/transport from the v1 API client -- that's what has any HTTP/auth configs set on it from the --http.config
flag. Creating a naked request and using the default client negates all of that. This will fail if a user has basic auth set, etc etc.
Better yet, we should discuss getting first party support for Prometheus' management API (which includes the /-/reload
endpoint) into prometheus/client_golang so we don't have to put together our own HTTP handling here. I've already brought this up and discussed with @bwplotka a bit on slack, but there's more work to do there:
https://cloud-native.slack.com/archives/C01AUBA4PFE/p1747158200164229
This addresses the missing rule management capability identified in the original MCP server, providing a complete solution for creating, updating, and deleting both recording and alerting rules.