Performance Troubleshooting

This guide helps identify and resolve performance bottlenecks in TMI deployments, covering server, database, network, and client-side performance issues.

Identifying Bottlenecks

Performance Monitoring Overview

Key metrics to monitor:

Response Time: API endpoint latency
Throughput: Requests per second
Error Rate: Failed requests percentage
Resource Usage: CPU, memory, disk, network
Database Performance: Query times, connection pool
Cache Hit Rate: Redis effectiveness

Quick Performance Check

# Check server resource usage
top -b -n 1 | grep tmiserver

# Check response time
time curl http://localhost:8080/api/v1/threat-models

# Check database connections
psql -d tmi -c "SELECT count(*) FROM pg_stat_activity WHERE datname='tmi'"

# Check Redis memory
redis-cli INFO memory | grep used_memory_human

# Check disk usage
df -h

# Check network connections
netstat -an | grep ESTABLISHED | wc -l

Performance Profiling

Enable pprof endpoints (development only):

# CPU profile
curl http://localhost:8080/debug/pprof/profile?seconds=30 > cpu.prof
go tool pprof cpu.prof

# Memory profile
curl http://localhost:8080/debug/pprof/heap > mem.prof
go tool pprof mem.prof

# Goroutine profile (check for leaks)
curl http://localhost:8080/debug/pprof/goroutine > goroutine.prof
go tool pprof goroutine.prof

# Blocking profile
curl http://localhost:8080/debug/pprof/block > block.prof
go tool pprof block.prof

Analyze profiles:

# Interactive analysis
go tool pprof -http=:8081 cpu.prof

# Top functions by CPU
go tool pprof -top cpu.prof

# Call graph
go tool pprof -pdf cpu.prof > cpu_profile.pdf

Server Performance Issues

High CPU Usage

Symptoms:

CPU at 80-100% consistently
Slow response times
Request timeouts

Diagnosis:

# Check CPU usage by process
top -o %CPU

# Profile CPU usage
curl http://localhost:8080/debug/pprof/profile?seconds=30 > cpu.prof
go tool pprof -top cpu.prof

# Check for CPU-intensive queries
grep '"duration":[0-9][0-9][0-9][0-9]' logs/tmi.log | head -20

Common causes and solutions:

Inefficient algorithms:
- Profile hot code paths
- Optimize loops and data structures
- Use concurrent processing where appropriate

Too many goroutines:

# Check goroutine count
curl http://localhost:8080/debug/pprof/goroutine > goroutine.prof
go tool pprof -top goroutine.prof

JSON serialization overhead:
- Cache serialized responses
- Use streaming for large responses
- Reduce response payload size

Solutions:

# Increase CPU resources
# - Add more CPU cores
# - Use faster CPU

# Optimize code
# - Add caching for expensive operations
# - Use connection pooling
# - Implement rate limiting

# Scale horizontally
# - Deploy multiple instances behind load balancer

High Memory Usage

Symptoms:

Memory usage grows over time
Out of memory errors
Frequent garbage collection

Diagnosis:

# Check memory usage
free -h
ps aux --sort=-%mem | head -10

# Memory profile
curl http://localhost:8080/debug/pprof/heap > mem.prof
go tool pprof -top mem.prof

# Check for memory leaks
curl http://localhost:8080/debug/pprof/heap > mem1.prof
# Wait 10 minutes
curl http://localhost:8080/debug/pprof/heap > mem2.prof
go tool pprof -base=mem1.prof mem2.prof

Common causes:

Memory leaks:
- Unclosed database connections
- Goroutine leaks
- Large objects not garbage collected
- WebSocket connections not cleaned up
Large responses:
- Returning too much data in API responses
- Not paginating results
- Loading entire datasets into memory
Caching too much data:
- Redis consuming excessive memory
- In-memory caches too large

Solutions:

Fix connection leaks:

// Always close database connections
defer rows.Close()

// Use context with timeout
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()

Implement pagination:

# Use limit and offset
GET /api/v1/threat-models?limit=50&offset=0

Optimize caching:

# Set Redis maxmemory
redis-cli CONFIG SET maxmemory 2gb
redis-cli CONFIG SET maxmemory-policy allkeys-lru

# Monitor cache size
redis-cli INFO memory

Increase server memory:

# Allocate more RAM to server
# Adjust container memory limits if using Docker
docker run -m 4g tmi-server

Slow Response Times

Symptoms:

API requests take >1 second
User interface feels sluggish
Timeouts

Diagnosis:

# Measure endpoint response times
time curl http://localhost:8080/api/v1/threat-models

# Find slow requests in logs
grep '"duration":[5-9][0-9][0-9]' logs/tmi.log | head -20

# Profile request handling
curl http://localhost:8080/debug/pprof/profile?seconds=30 > cpu.prof

Common causes:

Slow database queries (see Database Performance section)
Network latency (see Network Performance section)
Synchronous processing of slow operations
No caching of frequently accessed data

Solutions:

Add caching:

// Cache frequently accessed data in Redis
// Set appropriate TTL (time-to-live)

Optimize queries:
- Add database indexes
- Reduce joins
- Paginate results
- Use query caching

Async processing:

// Process slow operations asynchronously
go processSlowOperation()
// Return immediately to client

Add CDN:
- Cache static assets
- Reduce server load
- Improve client load times

Database Performance Issues

Slow Queries

Symptoms:

API endpoints slow
Database CPU high
Query timeouts

Diagnosis:

-- Find currently running slow queries
SELECT pid, now() - query_start AS duration, state, query
FROM pg_stat_activity
WHERE state = 'active'
  AND now() - query_start > interval '1 second'
ORDER BY duration DESC;

-- Query statistics (requires pg_stat_statements extension)
SELECT query, calls, total_time, mean_time, max_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 20;

-- Analyze specific query
EXPLAIN ANALYZE
SELECT * FROM threat_models WHERE owner = 'user123';

Common causes:

Missing indexes:

-- Find tables with high sequential scans
SELECT schemaname, tablename, seq_scan, idx_scan,
       seq_scan - idx_scan AS too_much_seq
FROM pg_stat_user_tables
WHERE seq_scan - idx_scan > 0
ORDER BY too_much_seq DESC;

Inefficient queries:
- Too many JOINs
- N+1 query problem
- Fetching unnecessary columns
- Not using WHERE clauses effectively
Large result sets:
- Missing LIMIT clauses
- Fetching all rows instead of paginating

Solutions:

Add indexes:

-- Create index on frequently queried columns
CREATE INDEX idx_threat_models_owner ON threat_models(owner);
CREATE INDEX idx_threat_models_created_at ON threat_models(created_at);

-- Composite index for common query patterns
CREATE INDEX idx_threat_models_owner_created ON threat_models(owner, created_at);

-- Verify index is used
EXPLAIN SELECT * FROM threat_models WHERE owner = 'user123';

Optimize queries:

-- Bad: Fetching all columns
SELECT * FROM threat_models;

-- Good: Fetch only needed columns
SELECT id, name, owner FROM threat_models;

-- Bad: N+1 queries
-- (Multiple queries in application code)

-- Good: Use JOINs
SELECT tm.*, u.email
FROM threat_models tm
JOIN users u ON u.id = tm.owner;

Implement pagination:

-- Add LIMIT and OFFSET
SELECT id, name, owner
FROM threat_models
ORDER BY created_at DESC
LIMIT 50 OFFSET 0;

Use connection pooling:

# Configure connection pool size
export DB_MAX_CONNECTIONS=25
export DB_MAX_IDLE_CONNECTIONS=5

High Database CPU

Symptoms:

Database server CPU at 80-100%
Queries queueing
Connection timeouts

Diagnosis:

-- Find expensive queries
SELECT query, calls, total_time, mean_time
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10;

-- Check for long-running queries
SELECT pid, query_start, state, query
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY query_start;

Solutions:

Optimize queries (see Slow Queries section)
Add read replicas:
- Offload read queries to replicas
- Keep writes on primary
- Use connection pooling (PgBouncer)
Upgrade database hardware:
- More CPU cores
- Faster storage (SSD)
- More RAM for caching

Tune PostgreSQL:

-- Increase shared buffers (25% of RAM)
ALTER SYSTEM SET shared_buffers = '4GB';

-- Increase work memory for sorts
ALTER SYSTEM SET work_mem = '50MB';

-- Increase maintenance work memory
ALTER SYSTEM SET maintenance_work_mem = '512MB';

-- Reload configuration
SELECT pg_reload_conf();

Connection Pool Exhaustion

Symptoms:

"Too many connections" errors
Connection timeouts
Slow query execution despite fast queries

Diagnosis:

-- Check active connections
SELECT count(*), state
FROM pg_stat_activity
WHERE datname = 'tmi'
GROUP BY state;

-- Check max connections limit
SHOW max_connections;

-- Check connection pool usage
SELECT count(*) as used_connections,
       (SELECT setting::int FROM pg_settings WHERE name = 'max_connections') as max_connections
FROM pg_stat_activity
WHERE datname = 'tmi';

Solutions:

Increase connection pool size (carefully):

# Application side
export DB_MAX_CONNECTIONS=50

# Database side
# Edit postgresql.conf
max_connections = 200

Fix connection leaks:

// Always close connections
defer db.Close()
defer rows.Close()

// Use context with timeout
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()

Use connection pooler (PgBouncer):

# Install PgBouncer
apt-get install pgbouncer

# Configure connection pooling
# Edit /etc/pgbouncer/pgbouncer.ini
[databases]
tmi = host=localhost port=5432 dbname=tmi

[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25

Reduce connection lifetime:
```
export DB_MAX_CONNECTION_LIFETIME=5m
```

Database Disk I/O

Symptoms:

Slow query performance
High disk utilization
Increased query latency

Diagnosis:

# Check disk I/O
iostat -x 1

# Check database disk usage
du -sh /var/lib/postgresql/data

# Find large tables
psql -d tmi -c "
SELECT schemaname, tablename,
       pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 10;"

Solutions:

Use SSD storage:
- Migrate to SSD volumes
- Significantly faster than HDD

Optimize tables:

-- Vacuum to reclaim space
VACUUM ANALYZE threat_models;

-- Reindex to rebuild indexes
REINDEX TABLE threat_models;

-- Auto-vacuum settings
ALTER TABLE threat_models SET (autovacuum_enabled = on);

Archive old data:

-- Move old records to archive table
CREATE TABLE threat_models_archive AS
SELECT * FROM threat_models
WHERE created_at < NOW() - INTERVAL '1 year';

DELETE FROM threat_models
WHERE created_at < NOW() - INTERVAL '1 year';

Tune PostgreSQL I/O settings:

-- Increase checkpoint segments
ALTER SYSTEM SET checkpoint_completion_target = 0.9;
ALTER SYSTEM SET wal_buffers = '16MB';
ALTER SYSTEM SET effective_io_concurrency = 200;

Network Performance Issues

High Latency

Symptoms:

Slow page loads
API timeouts
Poor user experience

Diagnosis:

# Measure latency
ping api.example.com

# Trace route
traceroute api.example.com

# Test HTTP latency
time curl -w "@curl-format.txt" -o /dev/null -s http://api.example.com/health

# curl-format.txt:
# time_namelookup: %{time_namelookup}\n
# time_connect: %{time_connect}\n
# time_starttransfer: %{time_starttransfer}\n
# time_total: %{time_total}\n

Common causes:

Geographic distance between client and server
Network congestion or poor routing
DNS resolution slow
TLS handshake slow

Solutions:

Use CDN:
- Cache static assets closer to users
- Reduce latency for static content
- Options: CloudFlare, Fastly, AWS CloudFront
Enable HTTP/2:
- Multiplexing reduces latency
- Header compression
- Server push
Optimize DNS:
- Use fast DNS provider
- Enable DNS caching
- Reduce DNS lookups

Enable compression:

# Enable gzip compression
export COMPRESSION_ENABLED=true

Use connection pooling:
- Reuse connections
- Reduce TLS handshake overhead

Bandwidth Limitations

Symptoms:

Slow downloads
Timeouts on large responses
High network utilization

Diagnosis:

# Monitor network usage
iftop

# Check bandwidth
speedtest-cli

# Monitor specific process
nethogs

# Check API response sizes
curl -w "Size: %{size_download} bytes\n" -o /dev/null -s http://api.example.com/api/v1/threat-models

Solutions:

Enable compression:
- gzip/brotli compression
- Reduce payload size by 60-80%

Optimize response payloads:

# Return only necessary fields
GET /api/v1/threat-models?fields=id,name,owner

# Use pagination
GET /api/v1/threat-models?limit=20&offset=0

Implement caching:
- Browser caching (Cache-Control headers)
- Proxy caching
- CDN caching
Optimize images/assets:
- Compress images
- Use appropriate formats (WebP, SVG)
- Lazy load images
Upgrade bandwidth:
- Increase server bandwidth
- Use better network tier

Redis Performance Issues

High Memory Usage

Symptoms:

Redis using excessive RAM
Memory warnings
Evictions occurring

Diagnosis:

# Check memory usage
redis-cli INFO memory

# Find large keys
redis-cli --bigkeys

# Check key count
redis-cli DBSIZE

# Sample keys
redis-cli --sample

Solutions:

Set maxmemory limit:

redis-cli CONFIG SET maxmemory 2gb
redis-cli CONFIG SET maxmemory-policy allkeys-lru

Clean up old keys:

# Check keys with no TTL
redis-cli KEYS "*" | while read key; do
    ttl=$(redis-cli TTL "$key")
    if [ "$ttl" = "-1" ]; then
        echo "$key has no TTL"
    fi
done

# Set TTL on keys
redis-cli EXPIRE "key:name" 3600

Optimize data structures:
- Use hashes instead of strings for objects
- Use sets for unique values
- Use sorted sets for ranked data

Enable compression:

# Use LZ4 compression for Redis backups
redis-cli --rdb /tmp/dump.rdb.lz4

Slow Redis Operations

Symptoms:

Redis commands taking >10ms
Increased latency
Timeouts

Diagnosis:

# Check slow log
redis-cli SLOWLOG GET 10

# Monitor commands in real-time
redis-cli MONITOR

# Check latency
redis-cli --latency

# Check for blocking operations
redis-cli INFO commandstats

Solutions:

Avoid blocking commands:
- Use SCAN instead of KEYS
- Use non-blocking alternatives
- Paginate large results

Optimize operations:

# Bad: KEYS * (blocks Redis)
redis-cli KEYS "*"

# Good: SCAN (non-blocking)
redis-cli SCAN 0 MATCH "session:*" COUNT 100

Enable persistence efficiently:

# Use AOF with fsync every second
redis-cli CONFIG SET appendonly yes
redis-cli CONFIG SET appendfsync everysec

# Or use RDB snapshots
redis-cli CONFIG SET save "900 1 300 10 60 10000"

Use Redis cluster for horizontal scaling

Client-Side Performance

Slow Page Load

Common causes:

Too many HTTP requests
Large JavaScript bundles
Unoptimized images
Blocking scripts
No caching

Solutions:

Optimize JavaScript:
- Code splitting
- Lazy loading
- Minification
- Tree shaking
Optimize images:
- Compress images
- Use responsive images
- Lazy load off-screen images
- Use modern formats (WebP)
Reduce HTTP requests:
- Bundle CSS/JS files
- Use sprites for icons
- Inline critical CSS

Enable caching:

<meta http-equiv="Cache-Control" content="max-age=31536000">

Use service workers:
- Cache API responses
- Offline support
- Background sync

Performance Best Practices

General Guidelines

Monitor continuously:
- Set up metrics and alerts
- Track trends over time
- Be proactive, not reactive
Optimize early:
- Profile during development
- Load test before production
- Identify bottlenecks early
Cache aggressively:
- Cache at multiple layers
- Use appropriate TTLs
- Invalidate stale cache
Scale horizontally:
- Use load balancers
- Deploy multiple instances
- Use database replicas
Implement rate limiting:
- Protect against abuse
- Ensure fair resource usage
- Prevent cascade failures

Performance Checklist

Server:

pprof profiling enabled (development)
Structured logging with performance metrics
Connection pooling configured
Graceful shutdown implemented
Resource limits set (memory, CPU)

Database:

Indexes on frequently queried columns
Query performance analyzed (EXPLAIN)
Connection pooling configured
Auto-vacuum enabled
Slow query logging enabled

Performance Troubleshooting

Performance Troubleshooting

Identifying Bottlenecks

Performance Monitoring Overview

Quick Performance Check

Performance Profiling

Server Performance Issues

High CPU Usage

High Memory Usage

Slow Response Times

Database Performance Issues

Slow Queries

High Database CPU

Connection Pool Exhaustion

Database Disk I/O

Network Performance Issues

High Latency

Bandwidth Limitations

Redis Performance Issues

High Memory Usage

Slow Redis Operations

Client-Side Performance

Slow Page Load

Performance Best Practices

General Guidelines

Performance Checklist

Related Documentation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!