Skip to content

Commit b1eadf0

Browse files
Copilotphrocker
andauthored
Add self-healing system for automatic error detection and repair (#59)
* Initial plan * Add self-healing database schema, entities, services, and UI components Co-authored-by: phrocker <[email protected]> * Add GitHub integration service, orchestrator, and passing unit tests Co-authored-by: phrocker <[email protected]> * Add comprehensive documentation and API controller tests for self-healing feature Co-authored-by: phrocker <[email protected]> * Implement complete self-healing workflow with coding agent launcher and Docker image builder Co-authored-by: phrocker <[email protected]> * Add self-healing config to Helm ConfigMap and enforce GitHub integration requirement Co-authored-by: phrocker <[email protected]> --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: phrocker <[email protected]>
1 parent db977bf commit b1eadf0

File tree

27 files changed

+3245
-1
lines changed

27 files changed

+3245
-1
lines changed

README.md

Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,9 @@ Key Features
5050
REST API
5151
Manage your SSH configurations, enclaves, security rules, and sessions programmatically using a well-documented REST API.
5252

53+
Self-Healing System
54+
Automatically detects, analyzes, and repairs system errors through intelligent coding agents. Configure patching policies (immediate, off-hours, or never) per pod/service, with built-in security analysis to prevent healing of security-sensitive errors without manual review. When configured, the system can automatically create GitHub pull requests with fixes.
55+
5356
Custom SSH Server responds via Sentrius UI or terminals
5457
![image](docs/images/ssh.png)
5558

@@ -441,6 +444,148 @@ The JIRA integration provides secure proxy access to JIRA APIs for ticket manage
441444

442445
All JIRA requests are authenticated through Keycloak and validated against the user's permissions.
443446

447+
## Self-Healing System
448+
449+
Sentrius includes an intelligent self-healing system that automatically detects, analyzes, and repairs errors in your infrastructure.
450+
451+
### Key Features
452+
453+
- **Automatic Error Detection**: Continuously monitors the error output table and OpenTelemetry data for system errors
454+
- **Security Analysis**: Automatically analyzes errors to determine if they pose security concerns before attempting repairs
455+
- **Flexible Patching Policies**: Configure per-pod/service policies for when repairs should be applied:
456+
- **Immediate**: Apply fixes as soon as errors are detected
457+
- **Off-Hours**: Queue fixes to apply during configured maintenance windows (default: 10 PM - 6 AM)
458+
- **Never**: Disable self-healing for critical services that require manual intervention
459+
- **Coding Agent Deployment**: Automatically launches isolated coding agent pods to analyze errors and generate fixes
460+
- **Docker Image Building**: Spins up Kubernetes Jobs using Kaniko to build and push Docker images with the fixes
461+
- **Complete Workflow Automation**: Coordinates agent launch, monitoring, image building, and optional GitHub PR creation
462+
- **Read-Only Agent Monitoring**: View real-time agent activity and healing progress through the UI (non-security errors only)
463+
- **GitHub Integration**: Optionally create pull requests with fixes when GitHub credentials are configured
464+
465+
### Configuration
466+
467+
Self-healing can be configured through the web UI or via API:
468+
469+
#### Web UI Configuration
470+
471+
1. Navigate to **Self-Healing Configuration** (`/sso/v1/self-healing/config`)
472+
2. Click **Add Pod Configuration** to create a new policy
473+
3. Set the pod name, type, and patching policy using the slider control
474+
4. Enable or disable self-healing for the pod
475+
476+
#### API Configuration
477+
478+
```bash
479+
# Create or update a self-healing configuration
480+
curl -X POST http://localhost:8080/api/v1/self-healing/config \
481+
-H "Content-Type: application/json" \
482+
-H "Authorization: Bearer <TOKEN>" \
483+
-d '{
484+
"podName": "sentrius-api",
485+
"podType": "api",
486+
"patchingPolicy": "OFF_HOURS",
487+
"enabled": true
488+
}'
489+
490+
# Get all configurations
491+
curl http://localhost:8080/api/v1/self-healing/config \
492+
-H "Authorization: Bearer <TOKEN>"
493+
```
494+
495+
#### Application Properties
496+
497+
Self-healing configuration is managed through Helm values and automatically populated into the ConfigMap. Update `values.yaml`:
498+
499+
```yaml
500+
selfHealing:
501+
enabled: true
502+
offHours:
503+
start: 22 # 10 PM
504+
end: 6 # 6 AM
505+
codingAgent:
506+
clientId: "coding-agents"
507+
clientSecret: "" # Set in secrets
508+
agentLauncher:
509+
url: "http://sentrius-agents-launcherservice:8080"
510+
builder:
511+
namespace: "dev"
512+
image: "gcr.io/kaniko-project/executor:latest"
513+
timeoutSeconds: 1800
514+
autoBuild: true
515+
docker:
516+
registry: "" # Leave empty for local registry
517+
github:
518+
enabled: false # Auto-enabled if GitHub integration exists
519+
apiUrl: "https://api.github.com"
520+
owner: ""
521+
repo: ""
522+
```
523+
524+
**Important**: Self-healing requires GitHub integration to be configured in the integration tokens table. The system will automatically detect if a GitHub token exists and only proceed if configured. To add a GitHub integration token, navigate to the Integration Settings in the UI and add a token with `connectionType: "github"`.
525+
526+
### Viewing Healing Sessions
527+
528+
Monitor active and completed healing sessions:
529+
530+
1. Navigate to **Self-Healing Sessions** (`/sso/v1/self-healing/sessions`)
531+
2. Filter by status: All, Active, or Completed
532+
3. View detailed information about each session including:
533+
- Agent activity and logs
534+
- Security analysis results
535+
- Docker build status
536+
- GitHub PR links (if created)
537+
- Error details and resolution
538+
539+
### How It Works
540+
541+
The self-healing workflow consists of several automated steps:
542+
543+
1. **Error Detection**: The system scans the error_output table every 5 minutes for new errors
544+
2. **Policy Check**: Determines if healing is enabled for the affected pod and checks the patching policy
545+
3. **Security Analysis**: Analyzes error logs for security-related keywords
546+
4. **Agent Launch**: If not a security concern, launches a coding agent pod to analyze and fix the error
547+
5. **Code Repair**: The coding agent examines the error, generates fixes, and commits changes
548+
6. **Docker Build**: A Kubernetes Job is created to build a new Docker image with the fixes using Kaniko
549+
7. **GitHub PR**: If configured, creates a pull request with the changes
550+
8. **Completion**: Updates the healing session with results and status
551+
552+
The entire workflow is asynchronous and can handle multiple concurrent healing sessions.
553+
554+
### Security Considerations
555+
556+
The self-healing system includes built-in safety mechanisms:
557+
558+
- **GitHub Integration Required**: Self-healing only proceeds if a GitHub integration token is configured in the system. This ensures all fixes can be tracked via pull requests.
559+
- **Security Analysis**: Errors containing security-related keywords (authentication, authorization, vulnerability, etc.) are flagged and require manual review before healing proceeds
560+
- **No Visibility Restriction**: Security-flagged errors are hidden from general users until cleared by administrators
561+
- **Audit Trail**: All healing attempts are logged and tracked in the `self_healing_session` table
562+
- **Isolated Execution**: Healing agents run in isolated Kubernetes pods with limited permissions
563+
564+
### Manual Triggering
565+
566+
You can manually trigger self-healing for specific errors (requires GitHub integration to be configured):
567+
568+
1. Navigate to **Error Logs** (`/sso/v1/notifications/error/log/get`)
569+
2. Click **Trigger Self-Healing** on any error
570+
3. Monitor progress in the Self-Healing Sessions view
571+
572+
Or via API:
573+
574+
```bash
575+
curl -X POST http://localhost:8080/api/v1/self-healing/trigger/{errorId} \
576+
-H "Authorization: Bearer <TOKEN>"
577+
```
578+
579+
**Note**: If GitHub integration is not configured, the trigger will fail with a message prompting you to add a GitHub integration token first.
580+
581+
### Database Schema
582+
583+
The self-healing system uses three main tables:
584+
585+
- `self_healing_config`: Stores patching policies per pod/service
586+
- `self_healing_session`: Tracks each healing attempt and its status
587+
- `error_output`: Extended with healing status and security analysis fields
588+
444589
## Custom Agents
445590

446591
Sentrius supports both Java and Python-based custom agents that can extend the platform's functionality for monitoring, automation, and user assistance.
Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
package io.sentrius.agent.launcher.api;
2+
3+
import io.sentrius.agent.launcher.service.DockerImageBuilderService;
4+
import io.sentrius.sso.config.ApiPaths;
5+
import lombok.extern.slf4j.Slf4j;
6+
import org.springframework.beans.factory.annotation.Autowired;
7+
import org.springframework.http.ResponseEntity;
8+
import org.springframework.web.bind.annotation.*;
9+
10+
import java.util.HashMap;
11+
import java.util.Map;
12+
13+
@Slf4j
14+
@RestController
15+
@RequestMapping(ApiPaths.API_V1 + "/builder")
16+
public class DockerImageBuilderController {
17+
18+
@Autowired
19+
private DockerImageBuilderService dockerImageBuilderService;
20+
21+
/**
22+
* Trigger a Docker image build
23+
*/
24+
@PostMapping("/build")
25+
public ResponseEntity<Map<String, Object>> buildImage(@RequestBody Map<String, Object> buildRequest) {
26+
try {
27+
Long sessionId = ((Number) buildRequest.get("sessionId")).longValue();
28+
String podName = (String) buildRequest.get("podName");
29+
String dockerfilePath = (String) buildRequest.get("dockerfilePath");
30+
String contextPath = (String) buildRequest.get("contextPath");
31+
32+
log.info("Received Docker build request for session {} pod {}", sessionId, podName);
33+
34+
String jobName = dockerImageBuilderService.buildDockerImage(
35+
sessionId, podName, dockerfilePath, contextPath);
36+
37+
Map<String, Object> response = new HashMap<>();
38+
if (jobName != null) {
39+
response.put("success", true);
40+
response.put("jobName", jobName);
41+
return ResponseEntity.ok(response);
42+
} else {
43+
response.put("success", false);
44+
response.put("message", "Failed to create build job");
45+
return ResponseEntity.internalServerError().body(response);
46+
}
47+
48+
} catch (Exception e) {
49+
log.error("Error handling build request", e);
50+
Map<String, Object> response = new HashMap<>();
51+
response.put("success", false);
52+
response.put("message", "Error: " + e.getMessage());
53+
return ResponseEntity.internalServerError().body(response);
54+
}
55+
}
56+
57+
/**
58+
* Check build status
59+
*/
60+
@GetMapping("/status")
61+
public ResponseEntity<Map<String, String>> getBuildStatus(@RequestParam String jobName) {
62+
try {
63+
String status = dockerImageBuilderService.checkBuildStatus(jobName);
64+
65+
Map<String, String> response = new HashMap<>();
66+
response.put("status", status);
67+
response.put("jobName", jobName);
68+
69+
return ResponseEntity.ok(response);
70+
71+
} catch (Exception e) {
72+
log.error("Error getting build status for job {}", jobName, e);
73+
return ResponseEntity.internalServerError().build();
74+
}
75+
}
76+
77+
/**
78+
* Get build logs
79+
*/
80+
@GetMapping("/logs")
81+
public ResponseEntity<Map<String, String>> getBuildLogs(@RequestParam String jobName) {
82+
try {
83+
String logs = dockerImageBuilderService.getBuildLogs(jobName);
84+
85+
Map<String, String> response = new HashMap<>();
86+
response.put("logs", logs);
87+
response.put("jobName", jobName);
88+
89+
return ResponseEntity.ok(response);
90+
91+
} catch (Exception e) {
92+
log.error("Error getting build logs for job {}", jobName, e);
93+
return ResponseEntity.internalServerError().build();
94+
}
95+
}
96+
}

0 commit comments

Comments
 (0)