generated from oracle-quickstart/oci-quickstart-template
    
        
        - 
                Notifications
    You must be signed in to change notification settings 
- Fork 2
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Summary
Add a feature to automatically remediate passive check failures in GPU nodes. The remediation action should be determined based on the associated error code from the passive check.
Proposed Auto-remediation Actions
- Reboot Node: For specific recoverable error codes, automatically reboot the affected node.
- Terminate Node: For critical failures where reboot is not sufficient, terminate the node.
- Suggest SRE Ticket: For error codes that require manual intervention, suggest opening an SRE ticket with relevant details.
Requirements
- Map error codes from passive checks to appropriate remediation actions.
- Implement logic to trigger the remediation actions based on the error code encountered.
- Ensure all actions are logged and can be audited.
- Provide configuration to enable/disable specific remediation actions per error code.
Acceptance Criteria
- Passive check failures are auto-remediated based on error code mapping.
- Documentation is updated to list error codes and associated remediation actions.
- SRE ticket suggestion includes error details and relevant context when manual intervention is required.
Labels: enhancement
Issue Type: Feature
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request