Quick reference guide for troubleshooting and managing the XMRT Ecosystem in production
- Database Issues: Supabase Support (support@supabase.com)
- Frontend Issues: Vercel Support (support@vercel.com)
- Security Incidents: [ESCALATE IMMEDIATELY]
- On-Call Engineer: [Define rotation]
The Eliza Gatekeeper is the central security and routing layer for all inter-Eliza communication in the XMRT-DAO ecosystem.
Check call statistics:
SELECT * FROM eliza_gatekeeper_stats
ORDER BY total_calls DESC;Check recent errors:
SELECT * FROM eliza_activity_log
WHERE activity_type IN ('gatekeeper_error', 'schema_protection')
ORDER BY created_at DESC
LIMIT 20;Check rate limit violations:
SELECT
identifier,
endpoint,
SUM(request_count) as total_requests,
MAX(window_start) as latest_window
FROM rate_limits
WHERE window_start > NOW() - INTERVAL '1 hour'
GROUP BY identifier, endpoint
HAVING SUM(request_count) > 100
ORDER BY total_requests DESC;Issue: 401 Unauthorized
- Cause: Invalid or missing
x-eliza-keyorx-eliza-sourceheader - Solution:
- Verify
INTERNAL_ELIZA_KEYsecret is set in Supabase - Check calling function includes both headers
- Verify source is in TRUSTED_SOURCES whitelist
- Verify
Issue: 429 Rate Limit Exceeded
- Cause: Too many requests from a source within 1 minute window
- Solution:
- Check if legitimate traffic spike or runaway loop
- Review
rate_limitstable for offending source - If autonomous system, check circuit breaker logic
- Consider increasing rate limits if legitimate
Issue: 403 Dangerous Operation Blocked
- Cause: Schema protection detected dangerous SQL pattern
- Solution:
- Review blocked operation in
eliza_activity_log - Determine if operation was malicious or legitimate
- If legitimate, refactor to safer approach (e.g., DELETE with WHERE clause)
- Update DANGEROUS_PATTERNS if pattern is overly restrictive
- Review blocked operation in
Issue: 404 Unknown Target
- Cause: Target function not recognized by gatekeeper routing
- Solution:
- Verify target function exists and is deployed
- Add target to gatekeeper routing switch statement
- Redeploy gatekeeper function
Issue: Schema Validation Failed
- Cause:
schema-managerrejected operation - Solution:
- Check
eliza_activity_logfor validation failure reason - Review auto-fix status (gatekeeper triggers
autonomous-code-fixer) - If auto-fix failed, manual intervention required
- Apply corrected schema operation
- Check
Weekly Tasks:
- Review gatekeeper statistics for anomalies
- Check for blocked operations (could indicate legitimate need)
- Monitor auto-correction success rate
- Review rate limit violations
Monthly Tasks:
- Audit all schema changes via gatekeeper logs
- Review and update TRUSTED_SOURCES whitelist if needed
- Analyze performance metrics (avg_duration_ms)
- Update documentation with lessons learned
Security Rotation (Optional):
-- Rotate INTERNAL_ELIZA_KEY every 90 days
-- 1. Generate new UUID
-- 2. Update INTERNAL_ELIZA_KEY secret in Supabase
-- 3. Update all calling functions to use new key
-- 4. Monitor for authentication failuresSlow routing (avg_duration_ms > 200ms):
- Check target function performance
- Review payload size (large payloads increase latency)
- Consider caching frequently accessed data
- Check database query performance in rate limit checks
High error rate:
- Review error patterns in
eliza_activity_log - Check if target functions are healthy
- Verify network connectivity between functions
- Review authentication issues
Disable Gatekeeper Enforcement: If gatekeeper is causing system-wide issues:
- Add
GATEKEEPER_ENFORCE=falseto Supabase secrets - Update gatekeeper to check this flag and allow all traffic
- Monitor for continued issues
- Fix underlying problem
- Re-enable enforcement
Bypass Gatekeeper: For emergency operations:
- Use service role key directly in Authorization header
- Call target function directly (bypasses gatekeeper)
- Log manual bypass in
eliza_activity_log - Document reason for bypass
- Review and fix after emergency
Symptoms:
- Users cannot access the website
- Vercel returns 502 error
- Health check shows "offline"
Diagnosis:
# Check health endpoint
curl https://v0-git-hub-sync-website.vercel.app/api/health
# Check Vercel deployment status
vercel list --prod
# Check database connection
# (via Supabase dashboard - Database > Logs)Solutions:
-
If Vercel deployment failed:
- Check build logs in Vercel dashboard
- Rollback to previous deployment:
vercel rollback - Redeploy:
vercel --prod
-
If database connection issue:
- Check Supabase project status
- Verify connection pooling settings
- Check if RLS policies are blocking queries
-
If edge function timeout:
- Review edge function logs in Supabase
- Check for long-running queries
- Increase function timeout if needed
Symptoms:
- Users getting "Unauthorized" errors
- Edge functions returning 401
- JWT validation failing
Diagnosis:
# Check edge function logs
# Supabase Dashboard > Edge Functions > [function-name] > Logs
# Verify JWT secret is set
# Supabase Dashboard > Settings > API > JWT SecretSolutions:
-
If JWT expired:
- Normal behavior - user needs to re-authenticate
- Verify
jwt_expiryin config.toml (currently 3600s = 1 hour)
-
If JWT secret mismatch:
- Verify
SUPABASE_ANON_KEYmatches in frontend and backend - Check environment variables in Vercel
- Redeploy if secrets were rotated
- Verify
-
If RLS policy blocking:
- Check user has proper role/permissions
- Review RLS policies for the affected table
- Test with service_role key to isolate issue
Symptoms:
- Queries taking >1 second
- Edge functions timing out
- High database CPU usage
Diagnosis:
-- Check for slow queries
SELECT
query,
calls,
total_time,
mean_time,
max_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;
-- Check missing indexes
SELECT schemaname, tablename, indexname, indexdef
FROM pg_indexes
WHERE schemaname = 'public'
ORDER BY tablename;
-- Check table sizes
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;Solutions:
-
Add missing indexes:
-- Example: Index on frequently queried columns CREATE INDEX idx_table_column ON public.table_name(column_name);
-
Optimize queries:
- Use
EXPLAIN ANALYZEto identify bottlenecks - Add appropriate WHERE clauses
- Limit result sets with pagination
- Use
-
Clean up old data:
-- Delete old logs (>30 days) DELETE FROM webhook_logs WHERE created_at < now() - interval '30 days'; -- Delete inactive sessions (>90 days) DELETE FROM conversation_sessions WHERE is_active = false AND updated_at < now() - interval '90 days';
-
Upgrade database plan if consistently hitting limits
Symptoms:
- Users getting "Too Many Requests" (429) errors
- Legitimate traffic being blocked
- Complaints about access denial
Diagnosis:
-- Check rate limit records
SELECT
identifier,
endpoint,
request_count,
window_start
FROM rate_limits
WHERE window_start > now() - interval '1 hour'
ORDER BY request_count DESC;
-- Check if specific IP is rate limited
SELECT * FROM rate_limits
WHERE identifier = '123.456.789.0'
ORDER BY window_start DESC;Solutions:
-
Increase rate limit for specific endpoint:
// In edge function const maxRequests = endpoint === '/api/mining-proxy' ? 1000 : 100;
-
Whitelist specific IP:
-- Delete rate limit records for trusted IP DELETE FROM rate_limits WHERE identifier = 'trusted-ip-address';
-
Adjust rate limit window:
- Currently 1 minute window
- Can increase to 5 or 10 minutes for less aggressive limiting
Symptoms:
- User conversations not persisting
- Memory context empty
- RLS policy errors in logs
Diagnosis:
-- Check if data is being inserted
SELECT COUNT(*), MAX(timestamp)
FROM conversation_messages
WHERE timestamp > now() - interval '1 hour';
-- Check RLS policies on memory_contexts
SELECT policyname, permissive, roles, cmd, qual
FROM pg_policies
WHERE tablename = 'memory_contexts';
-- Test with service role to bypass RLS
-- (via Supabase SQL Editor with service_role key)Solutions:
-
If RLS blocking inserts:
- Verify user is authenticated
- Check
user_idmatches JWT claim - Temporarily test with service_role to confirm
-
If validation errors:
- Check UUID format is valid
- Ensure JSON metadata is valid
- Review edge function logs for error details
-
If session not found:
- Check session creation logic
- Verify session_key format
- Ensure session hasn't expired
Symptoms:
- Edge functions not deploying
- Build errors in Supabase dashboard
- Functions not responding
Diagnosis:
# Check edge function logs
# Supabase Dashboard > Edge Functions > Logs
# Common errors:
# - TypeScript compilation errors
# - Import path issues
# - Missing dependenciesSolutions:
-
TypeScript errors:
- Check types are correct
- Verify imports exist
- Run
deno checklocally if available
-
Import errors:
- Use full URLs for Deno imports
- Verify versions are compatible
- Check network access to imported modules
-
Missing environment variables:
- Check secrets are set in Supabase
- Verify secret names match in code
- Redeploy after adding secrets
System Health Summary:
-- Run this query to get current system health
SELECT * FROM system_health_summary;Expected Values:
frontend_uptime_checks> 0 (checks in last hour)recent_function_errors< 10 (errors in last hour)messages_last_hourvaries by trafficactive_sessionsvaries by traffic
-
Frontend Uptime
- Target: >99.9%
- Alert if: 3 consecutive failures
- Check: Every 5 minutes
-
Database Performance
- Target: <100ms average query time
- Alert if: >500ms for 5 minutes
- Check: Real-time via Supabase metrics
-
Edge Function Errors
- Target: <1% error rate
- Alert if: >5% for 10 minutes
- Check: Via edge function logs
-
API Response Times
- Target: <200ms p95
- Alert if: >1s p95 for 5 minutes
- Check: Via Vercel analytics
- Frontend Logs: Vercel Dashboard > Deployment > Functions
- Edge Function Logs: Supabase Dashboard > Edge Functions > [function]
- Database Logs: Supabase Dashboard > Database > Logs
- Error Tracking: [Configure Sentry or similar]
- Check error count < threshold
- Verify health checks passing
- Review critical alerts
- Review slow query log
- Check database size growth
- Update dependencies if needed
- Review and clear rate limit blocks
- Clean up old logs (>30 days)
- Clean up inactive sessions (>90 days)
- Review and rotate API keys
- Update production documentation
- Run full backup
- Security audit
- Load testing
- Disaster recovery drill
- Review and update runbook
- Review monitoring/alerting rules
- Performance optimization review
-
IMMEDIATE ACTIONS (within 5 minutes)
- Enable read-only mode on database
- Disable affected user accounts
- Capture logs and evidence
- Notify security team
-
INVESTIGATION (within 30 minutes)
- Identify scope of breach
- Review access logs
- Check for data exfiltration
- Determine attack vector
-
REMEDIATION (within 2 hours)
- Patch vulnerability
- Rotate all API keys
- Force password resets if needed
- Deploy security fixes
-
COMMUNICATION (within 4 hours)
- Notify affected users
- Update status page
- Prepare incident report
- Coordinate with legal/compliance
-
POST-MORTEM (within 1 week)
- Document timeline
- Identify root cause
- Implement preventive measures
- Update security policies
-- Overall system stats
SELECT
(SELECT COUNT(*) FROM conversation_sessions WHERE is_active = true) as active_sessions,
(SELECT COUNT(*) FROM conversation_messages WHERE timestamp > now() - interval '1 hour') as messages_last_hour,
(SELECT COUNT(*) FROM eliza_activity_log WHERE created_at > now() - interval '1 hour') as activities_last_hour,
(SELECT pg_size_pretty(pg_database_size(current_database()))) as database_size;-- Most common errors in last 24 hours
SELECT
function_name,
COUNT(*) as error_count,
error_message
FROM api_call_logs
WHERE status = 'error'
AND created_at > now() - interval '24 hours'
GROUP BY function_name, error_message
ORDER BY error_count DESC
LIMIT 10;-- Active sessions by hour
SELECT
date_trunc('hour', updated_at) as hour,
COUNT(*) as active_sessions
FROM conversation_sessions
WHERE is_active = true
AND updated_at > now() - interval '24 hours'
GROUP BY hour
ORDER BY hour DESC;-- Table sizes
SELECT
schemaname,
tablename,
pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) as size,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as total_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;# View migration history
supabase migration list
# Rollback last migration
supabase db reset
# Or rollback to specific migration
supabase db reset --version <migration_timestamp># Via Vercel CLI
vercel rollback
# Or via Vercel Dashboard:
# Deployments > [previous deployment] > Promote to Production# Redeploy previous version from git
git checkout <previous_commit>
supabase functions deploy <function_name>
git checkout main- Supabase Documentation
- Vercel Documentation
- Production Deployment Checklist
- Architecture Diagram
- API Documentation
Last Updated: 2025-10-12
Maintained By: DevOps Team
Review Frequency: Monthly