feat: Safety Guardrails & Alignment

## Purpose

Implement explicit safety systems to ensure NeoKai operates within acceptable boundaries, avoiding actions that could be harmful even if technically correct. This is essential for AGI-level autonomy because:

- **Risk prevention**: Blocking dangerous operations before execution
- **Accountability**: Clear audit trail of decisions and actions
- **User control**: Respecting user constraints and preferences
- **Trust**: Building confidence that NeoKai won't cause harm

Without explicit guardrails, NeoKai could take technically correct but practically harmful actions.

---

## Current State

NeoKai has:
- Relies on Claude's built-in safety
- No explicit safety system
- No action classification
- No approval gates for destructive operations
- No audit logging

Safety is implicit and not under NeoKai's control.

---

## Proposed Approach

### Phase 1: Action Classification System

1. **Action Risk Levels**
   ```typescript
   type ActionRiskLevel = 
     | 'safe'           // No significant risk
     | 'low'            // Minor risk, easily reversible
     | 'medium'         // Moderate risk, some reversibility
     | 'high'           // Significant risk, difficult to reverse
     | 'critical';      // Irreversible or high-impact
   
   interface ActionClassification {
     action: Action;
     riskLevel: ActionRiskLevel;
     riskFactors: RiskFactor[];
     affectedResources: Resource[];
     reversibility: 'fully_reversible' | 'partially_reversible' | 'irreversible';
     blastRadius: string[];  // What could be affected
   }
   ```

2. **Classification Rules**
   ```typescript
   const classificationRules = {
     safe: {
       examples: ['read_file', 'search_code', 'analyze'],
       autoApprove: true
     },
     low: {
       examples: ['create_new_file', 'add_test', 'format_code'],
       autoApprove: true,
       notifyUser: false
     },
     medium: {
       examples: ['modify_existing_file', 'add_dependency', 'create_branch'],
       autoApprove: true,
       notifyUser: true
     },
     high: {
       examples: ['delete_file', 'force_push', 'modify_config'],
       autoApprove: false,
       requireApproval: true
     },
     critical: {
       examples: ['drop_database', 'delete_production_data', 'expose_secrets'],
       autoApprove: false,
       requireExplicitApproval: true,
       requireConfirmation: 2  // Double confirm
     }
   };
   ```

3. **Risk Assessment Engine**
   ```typescript
   interface RiskAssessor {
     // Assess risk of proposed action
     assess(action: Action): Promise<ActionClassification>;
     
     // Check for compound risks (multiple actions together)
     assessCompound(actions: Action[]): Promise<CompoundRiskAssessment>;
   }
   ```

### Phase 2: Constraint System

1. **Constraint Types**
   ```typescript
   type ConstraintType = 
     | 'file_pattern'     // Don't touch these files
     | 'operation_type'   // Don't do these operations
     | 'resource_limit'   // Stay within these limits
     | 'time_window'      // Only operate during these times
     | 'approval_gate'    // Require approval for these
     | 'rollback_plan';   // Must have rollback for these
   
   interface Constraint {
     id: string;
     type: ConstraintType;
     description: string;
     rule: ConstraintRule;
     severity: 'warning' | 'block' | 'escalate';
     override: boolean;  // Can be overridden by user
   }
   ```

2. **Built-in Constraints**
   ```typescript
   const builtinConstraints = {
     // Never modify these files
     protectedFiles: {
       patterns: ['.env', '*.key', '*.pem', 'credentials.*'],
       severity: 'block',
       override: false
     },
     
     // Require approval for production changes
     productionProtection: {
       patterns: ['main', 'master', 'production'],
       operations: ['force_push', 'delete_branch'],
       severity: 'block',
       override: true  // Admin can override
     },
     
     // Don't expose secrets
     secretExposure: {
       patterns: ['api_key', 'password', 'token', 'secret'],
       operations: ['commit', 'push', 'log'],
       severity: 'block',
       override: false
     },
     
     // Rate limits
     rateLimits: {
       operations: {
         'file_delete': 10,      // Max 10 deletes per session
         'git_force_push': 1,     // Max 1 force push per session
         'dependency_add': 5      // Max 5 dependency additions
       },
       severity: 'warning',
       override: true
     }
   };
   ```

3. **Constraint Checker**
   ```typescript
   interface ConstraintChecker {
     // Check if action violates constraints
     check(action: Action): Promise<ConstraintResult>;
     
     // Get applicable constraints
     getApplicable(action: Action): Constraint[];
   }
   
   interface ConstraintResult {
     passes: boolean;
     violatedConstraints: ConstraintViolation[];
     warnings: ConstraintWarning[];
   }
   ```

### Phase 3: Approval Gates

1. **Approval Workflow**
   ```typescript
   interface ApprovalGate {
     // Request approval for action
     requestApproval(
       action: Action,
       classification: ActionClassification
     ): Promise<ApprovalRequest>;
     
     // Process approval response
     processResponse(
       requestId: string,
       response: ApprovalResponse
     ): Promise<ApprovalResult>;
   }
   
   interface ApprovalRequest {
     id: string;
     action: Action;
     classification: ActionClassification;
     justification: string;  // Why this action is needed
     alternatives: Alternative[];  // Safer alternatives if any
     expiresAt: Date;
   }
   ```

2. **Approval UI**
   ```typescript
   interface ApprovalPresenter {
     // Format approval request for user
     format(request: ApprovalRequest): ApprovalUI;
   }
   
   // Example approval request UI:
   const exampleApprovalUI = {
     summary: "Delete 3 files in src/auth/",
     risk: "HIGH - Irreversible operation",
     justification: "These files are no longer used after refactoring",
     files: ["src/auth/legacy-oauth.ts", "src/auth/old-session.ts", "src/auth/deprecated.ts"],
     alternatives: [
       "Move to archive/ instead of deleting",
       "Soft delete by renaming with .bak extension"
     ],
     actions: ['Approve', 'Reject', 'Approve with modifications', 'Request more info']
   };
   ```

3. **Approval Policies**
   ```typescript
   interface ApprovalPolicy {
     // Who can approve what
     approvals: {
       high: ['user', 'admin'],
       critical: ['admin'],  // Only admin can approve critical
     };
     
     // Timeout behavior
     timeout: {
       duration: Duration,
       defaultAction: 'reject' | 'escalate'
     };
     
     // Audit requirements
     auditLog: boolean;
   }
   ```

### Phase 4: Rollback Planning

1. **Rollback Requirements**
   ```typescript
   interface RollbackPlanner {
     // Create rollback plan for action
     createPlan(action: Action): Promise<RollbackPlan>;
     
     // Verify rollback is possible
     verifyPossible(action: Action): Promise<boolean>;
     
     // Execute rollback
     execute(plan: RollbackPlan): Promise<RollbackResult>;
   }
   
   interface RollbackPlan {
     action: Action;
     rollbackSteps: RollbackStep[];
     verificationSteps: VerificationStep[];
     estimatedTime: Duration;
     successProbability: number;
   }
   ```

2. **Rollback Requirement Rules**
   ```typescript
   const rollbackRequirements = {
     // Require rollback plan for:
     requireFor: [
       'database_migrations',
       'production_deployments',
       'breaking_api_changes',
       'mass_file_operations'
     ],
     
     // Skip rollback plan for:
     skipFor: [
       'read_only_operations',
       'non_production_environments',
       'fully_reversible_changes'
     ]
   };
   ```

### Phase 5: Audit Logging

1. **Audit Events**
   ```typescript
   interface AuditEvent {
     id: string;
     timestamp: Date;
     
     // Actor
     actor: 'neoKai' | 'user';
     sessionId: string;
     
     // Action
     action: Action;
     classification: ActionClassification;
     
     // Decision
     decision: 'approved' | 'rejected' | 'modified' | 'escalated';
     approver?: string;
     
     // Outcome
     outcome: 'success' | 'failed' | 'rolled_back';
     result?: any;
     
     // Context
     constraints: Constraint[];
     rollbackPlan?: RollbackPlan;
   }
   ```

2. **Audit Logger**
   ```typescript
   interface AuditLogger {
     // Log audit event
     log(event: AuditEvent): void;
     
     // Query audit log
     query(filters: AuditFilter): Promise<AuditEvent[]>;
     
     // Generate audit report
     report(options: ReportOptions): Promise<AuditReport>;
   }
   ```

3. **Audit Retention**
   ```typescript
   interface AuditRetention {
     defaultRetention: Duration;  // e.g., 90 days
     criticalRetention: Duration; // e.g., 1 year
     exportFormats: ['json', 'csv'];
   }
   ```

### Phase 6: Value Alignment Verification

1. **Alignment Checks**
   ```typescript
   interface AlignmentChecker {
     // Check if action aligns with user/project values
     check(action: Action): Promise<AlignmentResult>;
   }
   
   interface AlignmentResult {
     aligned: boolean;
     conflicts: AlignmentConflict[];
     recommendations: string[];
   }
   
   interface AlignmentConflict {
     value: string;  // e.g., "security", "privacy", "user_experience"
     conflict: string;
     severity: 'minor' | 'moderate' | 'major';
   }
   ```

2. **Value Specification**
   ```typescript
   interface ValueSpecification {
     // User-defined values
     values: {
       security: 'high_priority',
       performance: 'medium_priority',
       backward_compatibility: 'high_priority',
       code_cleanliness: 'medium_priority'
     };
     
     // Derived from project context
     inferredValues: {
       test_coverage: 'required',
       documentation: 'encouraged',
       breaking_changes: 'avoid'
     };
   }
   ```

---

## Technical Considerations

### Performance Impact
- Minimizing overhead of safety checks
- Caching classification results
- Parallel constraint checking

### User Experience
- Not creating too much friction
- Clear communication about why actions are blocked
- Easy override process for legitimate cases

### Completeness
- Covering all action types
- Handling edge cases
- Keeping constraints up to date

### Audit Scalability
- Handling large audit logs
- Efficient querying
- Archival strategies

---

## Success Metrics

1. **Safety Incidents**: Number of harmful actions blocked
2. **False Positive Rate**: % of blocked actions that were actually safe
3. **Approval Latency**: Time from request to decision
4. **Audit Completeness**: % of actions that are logged

---

## Implementation Roadmap

1. **Phase 1**: Action classification system
2. **Phase 2**: Basic constraint system
3. **Phase 3**: Approval gates
4. **Phase 4**: Rollback planning
5. **Phase 5**: Audit logging
6. **Phase 6**: Value alignment verification

---

## Questions for Discussion

1. What should the default constraint set be?
2. How to balance safety with productivity?
3. Should all actions be logged or only high-risk ones?
4. How to handle emergencies where normal safety should be bypassed?

---

*Part of the AGI-Level Autonomy initiative*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Safety Guardrails & Alignment #125

Purpose

Current State

Proposed Approach

Phase 1: Action Classification System

Phase 2: Constraint System

Phase 3: Approval Gates

Phase 4: Rollback Planning

Phase 5: Audit Logging

Phase 6: Value Alignment Verification

Technical Considerations

Performance Impact

User Experience

Completeness

Audit Scalability

Success Metrics

Implementation Roadmap

Questions for Discussion

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat: Safety Guardrails & Alignment #125

Description

Purpose

Current State

Proposed Approach

Phase 1: Action Classification System

Phase 2: Constraint System

Phase 3: Approval Gates

Phase 4: Rollback Planning

Phase 5: Audit Logging

Phase 6: Value Alignment Verification

Technical Considerations

Performance Impact

User Experience

Completeness

Audit Scalability

Success Metrics

Implementation Roadmap

Questions for Discussion

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions