Skip to content

Feature: Auto-remediation actions for passive check failures #21

@amgowda-oci

Description

@amgowda-oci

Summary

Add a feature to automatically remediate passive check failures in GPU nodes. The remediation action should be determined based on the associated error code from the passive check.

Proposed Auto-remediation Actions

  • Reboot Node: For specific recoverable error codes, automatically reboot the affected node.
  • Terminate Node: For critical failures where reboot is not sufficient, terminate the node.
  • Suggest SRE Ticket: For error codes that require manual intervention, suggest opening an SRE ticket with relevant details.

Requirements

  • Map error codes from passive checks to appropriate remediation actions.
  • Implement logic to trigger the remediation actions based on the error code encountered.
  • Ensure all actions are logged and can be audited.
  • Provide configuration to enable/disable specific remediation actions per error code.

Acceptance Criteria

  • Passive check failures are auto-remediated based on error code mapping.
  • Documentation is updated to list error codes and associated remediation actions.
  • SRE ticket suggestion includes error details and relevant context when manual intervention is required.

Labels: enhancement
Issue Type: Feature

Metadata

Metadata

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions