Skip to content

Performance Troubleshooting

Eric Fitzgerald edited this page Nov 12, 2025 · 1 revision

Performance Troubleshooting

This guide helps identify and resolve performance bottlenecks in TMI deployments, covering server, database, network, and client-side performance issues.

Identifying Bottlenecks

Performance Monitoring Overview

Key metrics to monitor:

  1. Response Time: API endpoint latency
  2. Throughput: Requests per second
  3. Error Rate: Failed requests percentage
  4. Resource Usage: CPU, memory, disk, network
  5. Database Performance: Query times, connection pool
  6. Cache Hit Rate: Redis effectiveness

Quick Performance Check

# Check server resource usage
top -b -n 1 | grep tmiserver

# Check response time
time curl http://localhost:8080/api/v1/threat-models

# Check database connections
psql -d tmi -c "SELECT count(*) FROM pg_stat_activity WHERE datname='tmi'"

# Check Redis memory
redis-cli INFO memory | grep used_memory_human

# Check disk usage
df -h

# Check network connections
netstat -an | grep ESTABLISHED | wc -l

Performance Profiling

Enable pprof endpoints (development only):

# CPU profile
curl http://localhost:8080/debug/pprof/profile?seconds=30 > cpu.prof
go tool pprof cpu.prof

# Memory profile
curl http://localhost:8080/debug/pprof/heap > mem.prof
go tool pprof mem.prof

# Goroutine profile (check for leaks)
curl http://localhost:8080/debug/pprof/goroutine > goroutine.prof
go tool pprof goroutine.prof

# Blocking profile
curl http://localhost:8080/debug/pprof/block > block.prof
go tool pprof block.prof

Analyze profiles:

# Interactive analysis
go tool pprof -http=:8081 cpu.prof

# Top functions by CPU
go tool pprof -top cpu.prof

# Call graph
go tool pprof -pdf cpu.prof > cpu_profile.pdf

Server Performance Issues

High CPU Usage

Symptoms:

  • CPU at 80-100% consistently
  • Slow response times
  • Request timeouts

Diagnosis:

# Check CPU usage by process
top -o %CPU

# Profile CPU usage
curl http://localhost:8080/debug/pprof/profile?seconds=30 > cpu.prof
go tool pprof -top cpu.prof

# Check for CPU-intensive queries
grep '"duration":[0-9][0-9][0-9][0-9]' logs/tmi.log | head -20

Common causes and solutions:

  1. Inefficient algorithms:

    • Profile hot code paths
    • Optimize loops and data structures
    • Use concurrent processing where appropriate
  2. Too many goroutines:

    # Check goroutine count
    curl http://localhost:8080/debug/pprof/goroutine > goroutine.prof
    go tool pprof -top goroutine.prof
  3. JSON serialization overhead:

    • Cache serialized responses
    • Use streaming for large responses
    • Reduce response payload size

Solutions:

# Increase CPU resources
# - Add more CPU cores
# - Use faster CPU

# Optimize code
# - Add caching for expensive operations
# - Use connection pooling
# - Implement rate limiting

# Scale horizontally
# - Deploy multiple instances behind load balancer

High Memory Usage

Symptoms:

  • Memory usage grows over time
  • Out of memory errors
  • Frequent garbage collection

Diagnosis:

# Check memory usage
free -h
ps aux --sort=-%mem | head -10

# Memory profile
curl http://localhost:8080/debug/pprof/heap > mem.prof
go tool pprof -top mem.prof

# Check for memory leaks
curl http://localhost:8080/debug/pprof/heap > mem1.prof
# Wait 10 minutes
curl http://localhost:8080/debug/pprof/heap > mem2.prof
go tool pprof -base=mem1.prof mem2.prof

Common causes:

  1. Memory leaks:

    • Unclosed database connections
    • Goroutine leaks
    • Large objects not garbage collected
    • WebSocket connections not cleaned up
  2. Large responses:

    • Returning too much data in API responses
    • Not paginating results
    • Loading entire datasets into memory
  3. Caching too much data:

    • Redis consuming excessive memory
    • In-memory caches too large

Solutions:

  1. Fix connection leaks:

    // Always close database connections
    defer rows.Close()
    
    // Use context with timeout
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()
  2. Implement pagination:

    # Use limit and offset
    GET /api/v1/threat-models?limit=50&offset=0
  3. Optimize caching:

    # Set Redis maxmemory
    redis-cli CONFIG SET maxmemory 2gb
    redis-cli CONFIG SET maxmemory-policy allkeys-lru
    
    # Monitor cache size
    redis-cli INFO memory
  4. Increase server memory:

    # Allocate more RAM to server
    # Adjust container memory limits if using Docker
    docker run -m 4g tmi-server

Slow Response Times

Symptoms:

  • API requests take >1 second
  • User interface feels sluggish
  • Timeouts

Diagnosis:

# Measure endpoint response times
time curl http://localhost:8080/api/v1/threat-models

# Find slow requests in logs
grep '"duration":[5-9][0-9][0-9]' logs/tmi.log | head -20

# Profile request handling
curl http://localhost:8080/debug/pprof/profile?seconds=30 > cpu.prof

Common causes:

  1. Slow database queries (see Database Performance section)
  2. Network latency (see Network Performance section)
  3. Synchronous processing of slow operations
  4. No caching of frequently accessed data

Solutions:

  1. Add caching:

    // Cache frequently accessed data in Redis
    // Set appropriate TTL (time-to-live)
  2. Optimize queries:

    • Add database indexes
    • Reduce joins
    • Paginate results
    • Use query caching
  3. Async processing:

    // Process slow operations asynchronously
    go processSlowOperation()
    // Return immediately to client
  4. Add CDN:

    • Cache static assets
    • Reduce server load
    • Improve client load times

Database Performance Issues

Slow Queries

Symptoms:

  • API endpoints slow
  • Database CPU high
  • Query timeouts

Diagnosis:

-- Find currently running slow queries
SELECT pid, now() - query_start AS duration, state, query
FROM pg_stat_activity
WHERE state = 'active'
  AND now() - query_start > interval '1 second'
ORDER BY duration DESC;

-- Query statistics (requires pg_stat_statements extension)
SELECT query, calls, total_time, mean_time, max_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 20;

-- Analyze specific query
EXPLAIN ANALYZE
SELECT * FROM threat_models WHERE owner = 'user123';

Common causes:

  1. Missing indexes:

    -- Find tables with high sequential scans
    SELECT schemaname, tablename, seq_scan, idx_scan,
           seq_scan - idx_scan AS too_much_seq
    FROM pg_stat_user_tables
    WHERE seq_scan - idx_scan > 0
    ORDER BY too_much_seq DESC;
  2. Inefficient queries:

    • Too many JOINs
    • N+1 query problem
    • Fetching unnecessary columns
    • Not using WHERE clauses effectively
  3. Large result sets:

    • Missing LIMIT clauses
    • Fetching all rows instead of paginating

Solutions:

  1. Add indexes:

    -- Create index on frequently queried columns
    CREATE INDEX idx_threat_models_owner ON threat_models(owner);
    CREATE INDEX idx_threat_models_created_at ON threat_models(created_at);
    
    -- Composite index for common query patterns
    CREATE INDEX idx_threat_models_owner_created ON threat_models(owner, created_at);
    
    -- Verify index is used
    EXPLAIN SELECT * FROM threat_models WHERE owner = 'user123';
  2. Optimize queries:

    -- Bad: Fetching all columns
    SELECT * FROM threat_models;
    
    -- Good: Fetch only needed columns
    SELECT id, name, owner FROM threat_models;
    
    -- Bad: N+1 queries
    -- (Multiple queries in application code)
    
    -- Good: Use JOINs
    SELECT tm.*, u.email
    FROM threat_models tm
    JOIN users u ON u.id = tm.owner;
  3. Implement pagination:

    -- Add LIMIT and OFFSET
    SELECT id, name, owner
    FROM threat_models
    ORDER BY created_at DESC
    LIMIT 50 OFFSET 0;
  4. Use connection pooling:

    # Configure connection pool size
    export DB_MAX_CONNECTIONS=25
    export DB_MAX_IDLE_CONNECTIONS=5

High Database CPU

Symptoms:

  • Database server CPU at 80-100%
  • Queries queueing
  • Connection timeouts

Diagnosis:

-- Find expensive queries
SELECT query, calls, total_time, mean_time
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10;

-- Check for long-running queries
SELECT pid, query_start, state, query
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY query_start;

Solutions:

  1. Optimize queries (see Slow Queries section)

  2. Add read replicas:

    • Offload read queries to replicas
    • Keep writes on primary
    • Use connection pooling (PgBouncer)
  3. Upgrade database hardware:

    • More CPU cores
    • Faster storage (SSD)
    • More RAM for caching
  4. Tune PostgreSQL:

    -- Increase shared buffers (25% of RAM)
    ALTER SYSTEM SET shared_buffers = '4GB';
    
    -- Increase work memory for sorts
    ALTER SYSTEM SET work_mem = '50MB';
    
    -- Increase maintenance work memory
    ALTER SYSTEM SET maintenance_work_mem = '512MB';
    
    -- Reload configuration
    SELECT pg_reload_conf();

Connection Pool Exhaustion

Symptoms:

  • "Too many connections" errors
  • Connection timeouts
  • Slow query execution despite fast queries

Diagnosis:

-- Check active connections
SELECT count(*), state
FROM pg_stat_activity
WHERE datname = 'tmi'
GROUP BY state;

-- Check max connections limit
SHOW max_connections;

-- Check connection pool usage
SELECT count(*) as used_connections,
       (SELECT setting::int FROM pg_settings WHERE name = 'max_connections') as max_connections
FROM pg_stat_activity
WHERE datname = 'tmi';

Solutions:

  1. Increase connection pool size (carefully):

    # Application side
    export DB_MAX_CONNECTIONS=50
    
    # Database side
    # Edit postgresql.conf
    max_connections = 200
  2. Fix connection leaks:

    // Always close connections
    defer db.Close()
    defer rows.Close()
    
    // Use context with timeout
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()
  3. Use connection pooler (PgBouncer):

    # Install PgBouncer
    apt-get install pgbouncer
    
    # Configure connection pooling
    # Edit /etc/pgbouncer/pgbouncer.ini
    [databases]
    tmi = host=localhost port=5432 dbname=tmi
    
    [pgbouncer]
    pool_mode = transaction
    max_client_conn = 1000
    default_pool_size = 25
  4. Reduce connection lifetime:

    export DB_MAX_CONNECTION_LIFETIME=5m

Database Disk I/O

Symptoms:

  • Slow query performance
  • High disk utilization
  • Increased query latency

Diagnosis:

# Check disk I/O
iostat -x 1

# Check database disk usage
du -sh /var/lib/postgresql/data

# Find large tables
psql -d tmi -c "
SELECT schemaname, tablename,
       pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 10;"

Solutions:

  1. Use SSD storage:

    • Migrate to SSD volumes
    • Significantly faster than HDD
  2. Optimize tables:

    -- Vacuum to reclaim space
    VACUUM ANALYZE threat_models;
    
    -- Reindex to rebuild indexes
    REINDEX TABLE threat_models;
    
    -- Auto-vacuum settings
    ALTER TABLE threat_models SET (autovacuum_enabled = on);
  3. Archive old data:

    -- Move old records to archive table
    CREATE TABLE threat_models_archive AS
    SELECT * FROM threat_models
    WHERE created_at < NOW() - INTERVAL '1 year';
    
    DELETE FROM threat_models
    WHERE created_at < NOW() - INTERVAL '1 year';
  4. Tune PostgreSQL I/O settings:

    -- Increase checkpoint segments
    ALTER SYSTEM SET checkpoint_completion_target = 0.9;
    ALTER SYSTEM SET wal_buffers = '16MB';
    ALTER SYSTEM SET effective_io_concurrency = 200;

Network Performance Issues

High Latency

Symptoms:

  • Slow page loads
  • API timeouts
  • Poor user experience

Diagnosis:

# Measure latency
ping api.example.com

# Trace route
traceroute api.example.com

# Test HTTP latency
time curl -w "@curl-format.txt" -o /dev/null -s http://api.example.com/health

# curl-format.txt:
# time_namelookup: %{time_namelookup}\n
# time_connect: %{time_connect}\n
# time_starttransfer: %{time_starttransfer}\n
# time_total: %{time_total}\n

Common causes:

  1. Geographic distance between client and server
  2. Network congestion or poor routing
  3. DNS resolution slow
  4. TLS handshake slow

Solutions:

  1. Use CDN:

    • Cache static assets closer to users
    • Reduce latency for static content
    • Options: CloudFlare, Fastly, AWS CloudFront
  2. Enable HTTP/2:

    • Multiplexing reduces latency
    • Header compression
    • Server push
  3. Optimize DNS:

    • Use fast DNS provider
    • Enable DNS caching
    • Reduce DNS lookups
  4. Enable compression:

    # Enable gzip compression
    export COMPRESSION_ENABLED=true
  5. Use connection pooling:

    • Reuse connections
    • Reduce TLS handshake overhead

Bandwidth Limitations

Symptoms:

  • Slow downloads
  • Timeouts on large responses
  • High network utilization

Diagnosis:

# Monitor network usage
iftop

# Check bandwidth
speedtest-cli

# Monitor specific process
nethogs

# Check API response sizes
curl -w "Size: %{size_download} bytes\n" -o /dev/null -s http://api.example.com/api/v1/threat-models

Solutions:

  1. Enable compression:

    • gzip/brotli compression
    • Reduce payload size by 60-80%
  2. Optimize response payloads:

    # Return only necessary fields
    GET /api/v1/threat-models?fields=id,name,owner
    
    # Use pagination
    GET /api/v1/threat-models?limit=20&offset=0
  3. Implement caching:

    • Browser caching (Cache-Control headers)
    • Proxy caching
    • CDN caching
  4. Optimize images/assets:

    • Compress images
    • Use appropriate formats (WebP, SVG)
    • Lazy load images
  5. Upgrade bandwidth:

    • Increase server bandwidth
    • Use better network tier

Redis Performance Issues

High Memory Usage

Symptoms:

  • Redis using excessive RAM
  • Memory warnings
  • Evictions occurring

Diagnosis:

# Check memory usage
redis-cli INFO memory

# Find large keys
redis-cli --bigkeys

# Check key count
redis-cli DBSIZE

# Sample keys
redis-cli --sample

Solutions:

  1. Set maxmemory limit:

    redis-cli CONFIG SET maxmemory 2gb
    redis-cli CONFIG SET maxmemory-policy allkeys-lru
  2. Clean up old keys:

    # Check keys with no TTL
    redis-cli KEYS "*" | while read key; do
        ttl=$(redis-cli TTL "$key")
        if [ "$ttl" = "-1" ]; then
            echo "$key has no TTL"
        fi
    done
    
    # Set TTL on keys
    redis-cli EXPIRE "key:name" 3600
  3. Optimize data structures:

    • Use hashes instead of strings for objects
    • Use sets for unique values
    • Use sorted sets for ranked data
  4. Enable compression:

    # Use LZ4 compression for Redis backups
    redis-cli --rdb /tmp/dump.rdb.lz4

Slow Redis Operations

Symptoms:

  • Redis commands taking >10ms
  • Increased latency
  • Timeouts

Diagnosis:

# Check slow log
redis-cli SLOWLOG GET 10

# Monitor commands in real-time
redis-cli MONITOR

# Check latency
redis-cli --latency

# Check for blocking operations
redis-cli INFO commandstats

Solutions:

  1. Avoid blocking commands:

    • Use SCAN instead of KEYS
    • Use non-blocking alternatives
    • Paginate large results
  2. Optimize operations:

    # Bad: KEYS * (blocks Redis)
    redis-cli KEYS "*"
    
    # Good: SCAN (non-blocking)
    redis-cli SCAN 0 MATCH "session:*" COUNT 100
  3. Enable persistence efficiently:

    # Use AOF with fsync every second
    redis-cli CONFIG SET appendonly yes
    redis-cli CONFIG SET appendfsync everysec
    
    # Or use RDB snapshots
    redis-cli CONFIG SET save "900 1 300 10 60 10000"
  4. Use Redis cluster for horizontal scaling

Client-Side Performance

Slow Page Load

Common causes:

  1. Too many HTTP requests
  2. Large JavaScript bundles
  3. Unoptimized images
  4. Blocking scripts
  5. No caching

Solutions:

  1. Optimize JavaScript:

    • Code splitting
    • Lazy loading
    • Minification
    • Tree shaking
  2. Optimize images:

    • Compress images
    • Use responsive images
    • Lazy load off-screen images
    • Use modern formats (WebP)
  3. Reduce HTTP requests:

    • Bundle CSS/JS files
    • Use sprites for icons
    • Inline critical CSS
  4. Enable caching:

    <meta http-equiv="Cache-Control" content="max-age=31536000">
  5. Use service workers:

    • Cache API responses
    • Offline support
    • Background sync

Performance Best Practices

General Guidelines

  1. Monitor continuously:

    • Set up metrics and alerts
    • Track trends over time
    • Be proactive, not reactive
  2. Optimize early:

    • Profile during development
    • Load test before production
    • Identify bottlenecks early
  3. Cache aggressively:

    • Cache at multiple layers
    • Use appropriate TTLs
    • Invalidate stale cache
  4. Scale horizontally:

    • Use load balancers
    • Deploy multiple instances
    • Use database replicas
  5. Implement rate limiting:

    • Protect against abuse
    • Ensure fair resource usage
    • Prevent cascade failures

Performance Checklist

Server:

  • pprof profiling enabled (development)
  • Structured logging with performance metrics
  • Connection pooling configured
  • Graceful shutdown implemented
  • Resource limits set (memory, CPU)

Database:

  • Indexes on frequently queried columns
  • Query performance analyzed (EXPLAIN)
  • Connection pooling configured
  • Auto-vacuum enabled
  • Slow query logging enabled

Redis:

  • maxmemory limit set
  • Eviction policy configured
  • Persistence configured appropriately
  • Key TTLs set
  • Monitoring enabled

Network:

  • Compression enabled
  • HTTP/2 enabled
  • CDN configured for static assets
  • Connection keep-alive enabled
  • Appropriate timeouts set

Client:

  • Code splitting implemented
  • Images optimized
  • Lazy loading used
  • Caching headers set
  • Service worker configured

Related Documentation

Clone this wiki locally