Skip to content

Conversation

oberkem
Copy link

@oberkem oberkem commented Jul 18, 2025

  • Add create_recording_rule and create_alerting_rule tools
  • Add update_rule and delete_rule tools for rule management
  • Add validate_rule tool for PromQL expression validation
  • Add list_rule_files and get_rule_file_content utilities
  • Add reload_config tool for triggering Prometheus configuration reloads
  • Implement comprehensive file-based rule management with YAML support
  • Add automatic rule group creation and cleanup
  • Add JSON-based label and annotation support
  • Include comprehensive test suite with 11 test cases
  • Add detailed documentation in RULE_MANAGEMENT_TOOLS.md

This addresses the missing rule management capability identified in the original MCP server, providing a complete solution for creating, updating, and deleting both recording and alerting rules.

- Add create_recording_rule and create_alerting_rule tools
- Add update_rule and delete_rule tools for rule management
- Add validate_rule tool for PromQL expression validation
- Add list_rule_files and get_rule_file_content utilities
- Add reload_config tool for triggering Prometheus configuration reloads
- Implement comprehensive file-based rule management with YAML support
- Add automatic rule group creation and cleanup
- Add JSON-based label and annotation support
- Include comprehensive test suite with 11 test cases
- Add detailed documentation in RULE_MANAGEMENT_TOOLS.md

This addresses the missing rule management capability identified in the
original MCP server, providing a complete solution for creating,
updating, and deleting both recording and alerting rules.

Signed-off-by: Orcun Berkem <[email protected]>
Copy link
Owner

@tjhop tjhop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I took a first pass and have a few thoughts.

Like we discussed on slack, considering Prometheus itself doesn't have any kind of rule management support in the API, we kinda need to resort to file management and triggered reloads to do manage rules. It's awkward, but we're working with what we have.

Philosophical notes:

  • this approach currently side steps any form of gitops/configuration management style approaches, and may even conflict if the given instance is already under any kind of rule file management
  • this will only work if the prometheus-mcp-server has write access to the same storage that the prometheus instance has access to. this means like it likely needs to run alongside prometheus and/or potentially have volume mounts adjusted etc
  • this would only update the prometheus it is connected to (theoretically), yet it's very common to run prometheus in HA duplicate pairs, thoughts on how to handle that scenario? idk if I'd trust LLMs enough to re-make the exact same rules/edits against multiple prometheus instances if eg a user connects the prometheus-mcp-server multiple times on different ports with different prometheus urls configured, etc.

Technical notes:

  • was this AI generated? things hinting at it to me:
    • the doc style, and the addition of a doc file for just a few tools that discusses everything from usage to go tests, which feels AI summation-like
    • regurgitation of ioutil.ReadFile() calls that have been deprecated for years (yet littered across the internet and making great training material for AI lol)
    • some scattered odd syntax/formattings that should've been caught by the linter, but apparently I have it enabled to approve workflow runs, so should've done that sooner 😅
  • left some notes on the /-/reload endpoint support, but that needs some adjustments please
  • a very general request to not re-invent the wheel and try to base off of and use upstream as much as possible. the beautiful part about having prometheus written in go is that we can literally use prometheus as a library, and they've already done a lot of the hard work for us. the rules and rulefmt packages will be extremely useful here. this should eliminate the need for a lot of the custom types here, dramatically reduce the test coverage needed since we benefit from prometheus' testing, etc.

Comment on lines +346 to +363
func validateRuleToolHandler(ctx context.Context, request mcp.CallToolRequest) (*mcp.CallToolResult, error) {
expr, err := request.RequireString("expr")
if err != nil {
return mcp.NewToolResultError("expr must be a string"), nil
}

// Use the existing parse query API call if available
isValid, err := validatePromQLExpression(ctx, expr)
if err != nil {
return mcp.NewToolResultError(fmt.Sprintf("Failed to validate expression: %v", err)), nil
}

if isValid {
return mcp.NewToolResultText(fmt.Sprintf("PromQL expression is valid: %s", expr)), nil
}

return mcp.NewToolResultText(fmt.Sprintf("PromQL expression is invalid: %s", expr)), nil
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we refactor this to better lean on upstream rule group validation?
https://pkg.go.dev/github.com/prometheus/[email protected]/model/rulefmt#RuleGroups.Validate

return &PrometheusRule{Groups: []RuleGroup{}}, nil
}

data, err := ioutil.ReadFile(filePath)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ioutil has been deprecated for years now

if prometheusURL == "" {
// For now, we'll assume the user provides the URL
// In a full implementation, you'd get this from the global configuration
return fmt.Errorf("prometheus_url must be provided")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static error doesn't need formatting string -- linter should catch this

Comment on lines +260 to +275
// Make HTTP POST request to reload endpoint
req, err := http.NewRequestWithContext(ctx, "POST", prometheusURL, nil)
if err != nil {
return fmt.Errorf("failed to create request: %w", err)
}

client := &http.Client{}
resp, err := client.Do(req)
if err != nil {
return fmt.Errorf("failed to make request: %w", err)
}
defer resp.Body.Close()

if resp.StatusCode != http.StatusOK {
return fmt.Errorf("reload request failed with status %d", resp.StatusCode)
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should reuse the client/transport from the v1 API client -- that's what has any HTTP/auth configs set on it from the --http.config flag. Creating a naked request and using the default client negates all of that. This will fail if a user has basic auth set, etc etc.

Better yet, we should discuss getting first party support for Prometheus' management API (which includes the /-/reload endpoint) into prometheus/client_golang so we don't have to put together our own HTTP handling here. I've already brought this up and discussed with @bwplotka a bit on slack, but there's more work to do there:
https://cloud-native.slack.com/archives/C01AUBA4PFE/p1747158200164229

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants