Skip to content

Conversation

@yaguangtang
Copy link
Member

@yaguangtang yaguangtang commented Jan 26, 2026

Summary

  • Update ipmi-exporter to v1.10.1 from Docker Hub
  • Enable SEL events collection with regex patterns for critical hardware events
  • Add Prometheus alert rules for SEL events (critical and warning severity)
  • Add unit tests for the new alert rules

Test plan

  • Verify ipmi-exporter image pulls successfully
  • Confirm SEL events are collected on nodes with IPMI support
  • Validate Prometheus alerts fire correctly when SEL events occur
  • Run alert rule unit tests

Closes: A8E-89
Fixes: #2973

🤖 Generated with Claude Code

Update ipmi-exporter to v1.10.1 and enable SEL events collection with
regex patterns for critical hardware events including memory errors,
CPU failures, temperature events, fan/PSU failures, and storage issues.

Add Prometheus alert rules for SEL events with appropriate severity
levels (critical for memory/CPU/power failures, warning for fan/storage
issues) and include unit tests for the new alerts.

Closes: A8E-89

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: yaguang tang <yaguang.tang@vexxhost.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Prometheus alerts for SEL events

1 participant