Skip to content

feat: Safety Guardrails & Alignment #125

@lsm

Description

@lsm

Purpose

Implement explicit safety systems to ensure NeoKai operates within acceptable boundaries, avoiding actions that could be harmful even if technically correct. This is essential for AGI-level autonomy because:

  • Risk prevention: Blocking dangerous operations before execution
  • Accountability: Clear audit trail of decisions and actions
  • User control: Respecting user constraints and preferences
  • Trust: Building confidence that NeoKai won't cause harm

Without explicit guardrails, NeoKai could take technically correct but practically harmful actions.


Current State

NeoKai has:

  • Relies on Claude's built-in safety
  • No explicit safety system
  • No action classification
  • No approval gates for destructive operations
  • No audit logging

Safety is implicit and not under NeoKai's control.


Proposed Approach

Phase 1: Action Classification System

  1. Action Risk Levels

    type ActionRiskLevel = 
      | 'safe'           // No significant risk
      | 'low'            // Minor risk, easily reversible
      | 'medium'         // Moderate risk, some reversibility
      | 'high'           // Significant risk, difficult to reverse
      | 'critical';      // Irreversible or high-impact
    
    interface ActionClassification {
      action: Action;
      riskLevel: ActionRiskLevel;
      riskFactors: RiskFactor[];
      affectedResources: Resource[];
      reversibility: 'fully_reversible' | 'partially_reversible' | 'irreversible';
      blastRadius: string[];  // What could be affected
    }
  2. Classification Rules

    const classificationRules = {
      safe: {
        examples: ['read_file', 'search_code', 'analyze'],
        autoApprove: true
      },
      low: {
        examples: ['create_new_file', 'add_test', 'format_code'],
        autoApprove: true,
        notifyUser: false
      },
      medium: {
        examples: ['modify_existing_file', 'add_dependency', 'create_branch'],
        autoApprove: true,
        notifyUser: true
      },
      high: {
        examples: ['delete_file', 'force_push', 'modify_config'],
        autoApprove: false,
        requireApproval: true
      },
      critical: {
        examples: ['drop_database', 'delete_production_data', 'expose_secrets'],
        autoApprove: false,
        requireExplicitApproval: true,
        requireConfirmation: 2  // Double confirm
      }
    };
  3. Risk Assessment Engine

    interface RiskAssessor {
      // Assess risk of proposed action
      assess(action: Action): Promise<ActionClassification>;
      
      // Check for compound risks (multiple actions together)
      assessCompound(actions: Action[]): Promise<CompoundRiskAssessment>;
    }

Phase 2: Constraint System

  1. Constraint Types

    type ConstraintType = 
      | 'file_pattern'     // Don't touch these files
      | 'operation_type'   // Don't do these operations
      | 'resource_limit'   // Stay within these limits
      | 'time_window'      // Only operate during these times
      | 'approval_gate'    // Require approval for these
      | 'rollback_plan';   // Must have rollback for these
    
    interface Constraint {
      id: string;
      type: ConstraintType;
      description: string;
      rule: ConstraintRule;
      severity: 'warning' | 'block' | 'escalate';
      override: boolean;  // Can be overridden by user
    }
  2. Built-in Constraints

    const builtinConstraints = {
      // Never modify these files
      protectedFiles: {
        patterns: ['.env', '*.key', '*.pem', 'credentials.*'],
        severity: 'block',
        override: false
      },
      
      // Require approval for production changes
      productionProtection: {
        patterns: ['main', 'master', 'production'],
        operations: ['force_push', 'delete_branch'],
        severity: 'block',
        override: true  // Admin can override
      },
      
      // Don't expose secrets
      secretExposure: {
        patterns: ['api_key', 'password', 'token', 'secret'],
        operations: ['commit', 'push', 'log'],
        severity: 'block',
        override: false
      },
      
      // Rate limits
      rateLimits: {
        operations: {
          'file_delete': 10,      // Max 10 deletes per session
          'git_force_push': 1,     // Max 1 force push per session
          'dependency_add': 5      // Max 5 dependency additions
        },
        severity: 'warning',
        override: true
      }
    };
  3. Constraint Checker

    interface ConstraintChecker {
      // Check if action violates constraints
      check(action: Action): Promise<ConstraintResult>;
      
      // Get applicable constraints
      getApplicable(action: Action): Constraint[];
    }
    
    interface ConstraintResult {
      passes: boolean;
      violatedConstraints: ConstraintViolation[];
      warnings: ConstraintWarning[];
    }

Phase 3: Approval Gates

  1. Approval Workflow

    interface ApprovalGate {
      // Request approval for action
      requestApproval(
        action: Action,
        classification: ActionClassification
      ): Promise<ApprovalRequest>;
      
      // Process approval response
      processResponse(
        requestId: string,
        response: ApprovalResponse
      ): Promise<ApprovalResult>;
    }
    
    interface ApprovalRequest {
      id: string;
      action: Action;
      classification: ActionClassification;
      justification: string;  // Why this action is needed
      alternatives: Alternative[];  // Safer alternatives if any
      expiresAt: Date;
    }
  2. Approval UI

    interface ApprovalPresenter {
      // Format approval request for user
      format(request: ApprovalRequest): ApprovalUI;
    }
    
    // Example approval request UI:
    const exampleApprovalUI = {
      summary: "Delete 3 files in src/auth/",
      risk: "HIGH - Irreversible operation",
      justification: "These files are no longer used after refactoring",
      files: ["src/auth/legacy-oauth.ts", "src/auth/old-session.ts", "src/auth/deprecated.ts"],
      alternatives: [
        "Move to archive/ instead of deleting",
        "Soft delete by renaming with .bak extension"
      ],
      actions: ['Approve', 'Reject', 'Approve with modifications', 'Request more info']
    };
  3. Approval Policies

    interface ApprovalPolicy {
      // Who can approve what
      approvals: {
        high: ['user', 'admin'],
        critical: ['admin'],  // Only admin can approve critical
      };
      
      // Timeout behavior
      timeout: {
        duration: Duration,
        defaultAction: 'reject' | 'escalate'
      };
      
      // Audit requirements
      auditLog: boolean;
    }

Phase 4: Rollback Planning

  1. Rollback Requirements

    interface RollbackPlanner {
      // Create rollback plan for action
      createPlan(action: Action): Promise<RollbackPlan>;
      
      // Verify rollback is possible
      verifyPossible(action: Action): Promise<boolean>;
      
      // Execute rollback
      execute(plan: RollbackPlan): Promise<RollbackResult>;
    }
    
    interface RollbackPlan {
      action: Action;
      rollbackSteps: RollbackStep[];
      verificationSteps: VerificationStep[];
      estimatedTime: Duration;
      successProbability: number;
    }
  2. Rollback Requirement Rules

    const rollbackRequirements = {
      // Require rollback plan for:
      requireFor: [
        'database_migrations',
        'production_deployments',
        'breaking_api_changes',
        'mass_file_operations'
      ],
      
      // Skip rollback plan for:
      skipFor: [
        'read_only_operations',
        'non_production_environments',
        'fully_reversible_changes'
      ]
    };

Phase 5: Audit Logging

  1. Audit Events

    interface AuditEvent {
      id: string;
      timestamp: Date;
      
      // Actor
      actor: 'neoKai' | 'user';
      sessionId: string;
      
      // Action
      action: Action;
      classification: ActionClassification;
      
      // Decision
      decision: 'approved' | 'rejected' | 'modified' | 'escalated';
      approver?: string;
      
      // Outcome
      outcome: 'success' | 'failed' | 'rolled_back';
      result?: any;
      
      // Context
      constraints: Constraint[];
      rollbackPlan?: RollbackPlan;
    }
  2. Audit Logger

    interface AuditLogger {
      // Log audit event
      log(event: AuditEvent): void;
      
      // Query audit log
      query(filters: AuditFilter): Promise<AuditEvent[]>;
      
      // Generate audit report
      report(options: ReportOptions): Promise<AuditReport>;
    }
  3. Audit Retention

    interface AuditRetention {
      defaultRetention: Duration;  // e.g., 90 days
      criticalRetention: Duration; // e.g., 1 year
      exportFormats: ['json', 'csv'];
    }

Phase 6: Value Alignment Verification

  1. Alignment Checks

    interface AlignmentChecker {
      // Check if action aligns with user/project values
      check(action: Action): Promise<AlignmentResult>;
    }
    
    interface AlignmentResult {
      aligned: boolean;
      conflicts: AlignmentConflict[];
      recommendations: string[];
    }
    
    interface AlignmentConflict {
      value: string;  // e.g., "security", "privacy", "user_experience"
      conflict: string;
      severity: 'minor' | 'moderate' | 'major';
    }
  2. Value Specification

    interface ValueSpecification {
      // User-defined values
      values: {
        security: 'high_priority',
        performance: 'medium_priority',
        backward_compatibility: 'high_priority',
        code_cleanliness: 'medium_priority'
      };
      
      // Derived from project context
      inferredValues: {
        test_coverage: 'required',
        documentation: 'encouraged',
        breaking_changes: 'avoid'
      };
    }

Technical Considerations

Performance Impact

  • Minimizing overhead of safety checks
  • Caching classification results
  • Parallel constraint checking

User Experience

  • Not creating too much friction
  • Clear communication about why actions are blocked
  • Easy override process for legitimate cases

Completeness

  • Covering all action types
  • Handling edge cases
  • Keeping constraints up to date

Audit Scalability

  • Handling large audit logs
  • Efficient querying
  • Archival strategies

Success Metrics

  1. Safety Incidents: Number of harmful actions blocked
  2. False Positive Rate: % of blocked actions that were actually safe
  3. Approval Latency: Time from request to decision
  4. Audit Completeness: % of actions that are logged

Implementation Roadmap

  1. Phase 1: Action classification system
  2. Phase 2: Basic constraint system
  3. Phase 3: Approval gates
  4. Phase 4: Rollback planning
  5. Phase 5: Audit logging
  6. Phase 6: Value alignment verification

Questions for Discussion

  1. What should the default constraint set be?
  2. How to balance safety with productivity?
  3. Should all actions be logged or only high-risk ones?
  4. How to handle emergencies where normal safety should be bypassed?

Part of the AGI-Level Autonomy initiative

Metadata

Metadata

Assignees

No one assigned

    Labels

    agi-foundationCore components for AGI-level autonomyenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions