-
Notifications
You must be signed in to change notification settings - Fork 0
Performance Troubleshooting
This guide helps identify and resolve performance bottlenecks in TMI deployments, covering server, database, network, and client-side performance issues.
Key metrics to monitor:
- Response Time: API endpoint latency
- Throughput: Requests per second
- Error Rate: Failed requests percentage
- Resource Usage: CPU, memory, disk, network
- Database Performance: Query times, connection pool
- Cache Hit Rate: Redis effectiveness
# Check server resource usage
top -b -n 1 | grep tmiserver
# Check response time
time curl http://localhost:8080/api/v1/threat-models
# Check database connections
psql -d tmi -c "SELECT count(*) FROM pg_stat_activity WHERE datname='tmi'"
# Check Redis memory
redis-cli INFO memory | grep used_memory_human
# Check disk usage
df -h
# Check network connections
netstat -an | grep ESTABLISHED | wc -lEnable pprof endpoints (development only):
# CPU profile
curl http://localhost:8080/debug/pprof/profile?seconds=30 > cpu.prof
go tool pprof cpu.prof
# Memory profile
curl http://localhost:8080/debug/pprof/heap > mem.prof
go tool pprof mem.prof
# Goroutine profile (check for leaks)
curl http://localhost:8080/debug/pprof/goroutine > goroutine.prof
go tool pprof goroutine.prof
# Blocking profile
curl http://localhost:8080/debug/pprof/block > block.prof
go tool pprof block.profAnalyze profiles:
# Interactive analysis
go tool pprof -http=:8081 cpu.prof
# Top functions by CPU
go tool pprof -top cpu.prof
# Call graph
go tool pprof -pdf cpu.prof > cpu_profile.pdfSymptoms:
- CPU at 80-100% consistently
- Slow response times
- Request timeouts
Diagnosis:
# Check CPU usage by process
top -o %CPU
# Profile CPU usage
curl http://localhost:8080/debug/pprof/profile?seconds=30 > cpu.prof
go tool pprof -top cpu.prof
# Check for CPU-intensive queries
grep '"duration":[0-9][0-9][0-9][0-9]' logs/tmi.log | head -20Common causes and solutions:
-
Inefficient algorithms:
- Profile hot code paths
- Optimize loops and data structures
- Use concurrent processing where appropriate
-
Too many goroutines:
# Check goroutine count curl http://localhost:8080/debug/pprof/goroutine > goroutine.prof go tool pprof -top goroutine.prof
-
JSON serialization overhead:
- Cache serialized responses
- Use streaming for large responses
- Reduce response payload size
Solutions:
# Increase CPU resources
# - Add more CPU cores
# - Use faster CPU
# Optimize code
# - Add caching for expensive operations
# - Use connection pooling
# - Implement rate limiting
# Scale horizontally
# - Deploy multiple instances behind load balancerSymptoms:
- Memory usage grows over time
- Out of memory errors
- Frequent garbage collection
Diagnosis:
# Check memory usage
free -h
ps aux --sort=-%mem | head -10
# Memory profile
curl http://localhost:8080/debug/pprof/heap > mem.prof
go tool pprof -top mem.prof
# Check for memory leaks
curl http://localhost:8080/debug/pprof/heap > mem1.prof
# Wait 10 minutes
curl http://localhost:8080/debug/pprof/heap > mem2.prof
go tool pprof -base=mem1.prof mem2.profCommon causes:
-
Memory leaks:
- Unclosed database connections
- Goroutine leaks
- Large objects not garbage collected
- WebSocket connections not cleaned up
-
Large responses:
- Returning too much data in API responses
- Not paginating results
- Loading entire datasets into memory
-
Caching too much data:
- Redis consuming excessive memory
- In-memory caches too large
Solutions:
-
Fix connection leaks:
// Always close database connections defer rows.Close() // Use context with timeout ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) defer cancel()
-
Implement pagination:
# Use limit and offset GET /api/v1/threat-models?limit=50&offset=0
-
Optimize caching:
# Set Redis maxmemory redis-cli CONFIG SET maxmemory 2gb redis-cli CONFIG SET maxmemory-policy allkeys-lru # Monitor cache size redis-cli INFO memory
-
Increase server memory:
# Allocate more RAM to server # Adjust container memory limits if using Docker docker run -m 4g tmi-server
Symptoms:
- API requests take >1 second
- User interface feels sluggish
- Timeouts
Diagnosis:
# Measure endpoint response times
time curl http://localhost:8080/api/v1/threat-models
# Find slow requests in logs
grep '"duration":[5-9][0-9][0-9]' logs/tmi.log | head -20
# Profile request handling
curl http://localhost:8080/debug/pprof/profile?seconds=30 > cpu.profCommon causes:
- Slow database queries (see Database Performance section)
- Network latency (see Network Performance section)
- Synchronous processing of slow operations
- No caching of frequently accessed data
Solutions:
-
Add caching:
// Cache frequently accessed data in Redis // Set appropriate TTL (time-to-live)
-
Optimize queries:
- Add database indexes
- Reduce joins
- Paginate results
- Use query caching
-
Async processing:
// Process slow operations asynchronously go processSlowOperation() // Return immediately to client
-
Add CDN:
- Cache static assets
- Reduce server load
- Improve client load times
Symptoms:
- API endpoints slow
- Database CPU high
- Query timeouts
Diagnosis:
-- Find currently running slow queries
SELECT pid, now() - query_start AS duration, state, query
FROM pg_stat_activity
WHERE state = 'active'
AND now() - query_start > interval '1 second'
ORDER BY duration DESC;
-- Query statistics (requires pg_stat_statements extension)
SELECT query, calls, total_time, mean_time, max_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 20;
-- Analyze specific query
EXPLAIN ANALYZE
SELECT * FROM threat_models WHERE owner = 'user123';Common causes:
-
Missing indexes:
-- Find tables with high sequential scans SELECT schemaname, tablename, seq_scan, idx_scan, seq_scan - idx_scan AS too_much_seq FROM pg_stat_user_tables WHERE seq_scan - idx_scan > 0 ORDER BY too_much_seq DESC;
-
Inefficient queries:
- Too many JOINs
- N+1 query problem
- Fetching unnecessary columns
- Not using WHERE clauses effectively
-
Large result sets:
- Missing LIMIT clauses
- Fetching all rows instead of paginating
Solutions:
-
Add indexes:
-- Create index on frequently queried columns CREATE INDEX idx_threat_models_owner ON threat_models(owner); CREATE INDEX idx_threat_models_created_at ON threat_models(created_at); -- Composite index for common query patterns CREATE INDEX idx_threat_models_owner_created ON threat_models(owner, created_at); -- Verify index is used EXPLAIN SELECT * FROM threat_models WHERE owner = 'user123';
-
Optimize queries:
-- Bad: Fetching all columns SELECT * FROM threat_models; -- Good: Fetch only needed columns SELECT id, name, owner FROM threat_models; -- Bad: N+1 queries -- (Multiple queries in application code) -- Good: Use JOINs SELECT tm.*, u.email FROM threat_models tm JOIN users u ON u.id = tm.owner;
-
Implement pagination:
-- Add LIMIT and OFFSET SELECT id, name, owner FROM threat_models ORDER BY created_at DESC LIMIT 50 OFFSET 0;
-
Use connection pooling:
# Configure connection pool size export DB_MAX_CONNECTIONS=25 export DB_MAX_IDLE_CONNECTIONS=5
Symptoms:
- Database server CPU at 80-100%
- Queries queueing
- Connection timeouts
Diagnosis:
-- Find expensive queries
SELECT query, calls, total_time, mean_time
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10;
-- Check for long-running queries
SELECT pid, query_start, state, query
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY query_start;Solutions:
-
Optimize queries (see Slow Queries section)
-
Add read replicas:
- Offload read queries to replicas
- Keep writes on primary
- Use connection pooling (PgBouncer)
-
Upgrade database hardware:
- More CPU cores
- Faster storage (SSD)
- More RAM for caching
-
Tune PostgreSQL:
-- Increase shared buffers (25% of RAM) ALTER SYSTEM SET shared_buffers = '4GB'; -- Increase work memory for sorts ALTER SYSTEM SET work_mem = '50MB'; -- Increase maintenance work memory ALTER SYSTEM SET maintenance_work_mem = '512MB'; -- Reload configuration SELECT pg_reload_conf();
Symptoms:
- "Too many connections" errors
- Connection timeouts
- Slow query execution despite fast queries
Diagnosis:
-- Check active connections
SELECT count(*), state
FROM pg_stat_activity
WHERE datname = 'tmi'
GROUP BY state;
-- Check max connections limit
SHOW max_connections;
-- Check connection pool usage
SELECT count(*) as used_connections,
(SELECT setting::int FROM pg_settings WHERE name = 'max_connections') as max_connections
FROM pg_stat_activity
WHERE datname = 'tmi';Solutions:
-
Increase connection pool size (carefully):
# Application side export DB_MAX_CONNECTIONS=50 # Database side # Edit postgresql.conf max_connections = 200
-
Fix connection leaks:
// Always close connections defer db.Close() defer rows.Close() // Use context with timeout ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) defer cancel()
-
Use connection pooler (PgBouncer):
# Install PgBouncer apt-get install pgbouncer # Configure connection pooling # Edit /etc/pgbouncer/pgbouncer.ini [databases] tmi = host=localhost port=5432 dbname=tmi [pgbouncer] pool_mode = transaction max_client_conn = 1000 default_pool_size = 25
-
Reduce connection lifetime:
export DB_MAX_CONNECTION_LIFETIME=5m
Symptoms:
- Slow query performance
- High disk utilization
- Increased query latency
Diagnosis:
# Check disk I/O
iostat -x 1
# Check database disk usage
du -sh /var/lib/postgresql/data
# Find large tables
psql -d tmi -c "
SELECT schemaname, tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 10;"Solutions:
-
Use SSD storage:
- Migrate to SSD volumes
- Significantly faster than HDD
-
Optimize tables:
-- Vacuum to reclaim space VACUUM ANALYZE threat_models; -- Reindex to rebuild indexes REINDEX TABLE threat_models; -- Auto-vacuum settings ALTER TABLE threat_models SET (autovacuum_enabled = on);
-
Archive old data:
-- Move old records to archive table CREATE TABLE threat_models_archive AS SELECT * FROM threat_models WHERE created_at < NOW() - INTERVAL '1 year'; DELETE FROM threat_models WHERE created_at < NOW() - INTERVAL '1 year';
-
Tune PostgreSQL I/O settings:
-- Increase checkpoint segments ALTER SYSTEM SET checkpoint_completion_target = 0.9; ALTER SYSTEM SET wal_buffers = '16MB'; ALTER SYSTEM SET effective_io_concurrency = 200;
Symptoms:
- Slow page loads
- API timeouts
- Poor user experience
Diagnosis:
# Measure latency
ping api.example.com
# Trace route
traceroute api.example.com
# Test HTTP latency
time curl -w "@curl-format.txt" -o /dev/null -s http://api.example.com/health
# curl-format.txt:
# time_namelookup: %{time_namelookup}\n
# time_connect: %{time_connect}\n
# time_starttransfer: %{time_starttransfer}\n
# time_total: %{time_total}\nCommon causes:
- Geographic distance between client and server
- Network congestion or poor routing
- DNS resolution slow
- TLS handshake slow
Solutions:
-
Use CDN:
- Cache static assets closer to users
- Reduce latency for static content
- Options: CloudFlare, Fastly, AWS CloudFront
-
Enable HTTP/2:
- Multiplexing reduces latency
- Header compression
- Server push
-
Optimize DNS:
- Use fast DNS provider
- Enable DNS caching
- Reduce DNS lookups
-
Enable compression:
# Enable gzip compression export COMPRESSION_ENABLED=true
-
Use connection pooling:
- Reuse connections
- Reduce TLS handshake overhead
Symptoms:
- Slow downloads
- Timeouts on large responses
- High network utilization
Diagnosis:
# Monitor network usage
iftop
# Check bandwidth
speedtest-cli
# Monitor specific process
nethogs
# Check API response sizes
curl -w "Size: %{size_download} bytes\n" -o /dev/null -s http://api.example.com/api/v1/threat-modelsSolutions:
-
Enable compression:
- gzip/brotli compression
- Reduce payload size by 60-80%
-
Optimize response payloads:
# Return only necessary fields GET /api/v1/threat-models?fields=id,name,owner # Use pagination GET /api/v1/threat-models?limit=20&offset=0
-
Implement caching:
- Browser caching (Cache-Control headers)
- Proxy caching
- CDN caching
-
Optimize images/assets:
- Compress images
- Use appropriate formats (WebP, SVG)
- Lazy load images
-
Upgrade bandwidth:
- Increase server bandwidth
- Use better network tier
Symptoms:
- Redis using excessive RAM
- Memory warnings
- Evictions occurring
Diagnosis:
# Check memory usage
redis-cli INFO memory
# Find large keys
redis-cli --bigkeys
# Check key count
redis-cli DBSIZE
# Sample keys
redis-cli --sampleSolutions:
-
Set maxmemory limit:
redis-cli CONFIG SET maxmemory 2gb redis-cli CONFIG SET maxmemory-policy allkeys-lru
-
Clean up old keys:
# Check keys with no TTL redis-cli KEYS "*" | while read key; do ttl=$(redis-cli TTL "$key") if [ "$ttl" = "-1" ]; then echo "$key has no TTL" fi done # Set TTL on keys redis-cli EXPIRE "key:name" 3600
-
Optimize data structures:
- Use hashes instead of strings for objects
- Use sets for unique values
- Use sorted sets for ranked data
-
Enable compression:
# Use LZ4 compression for Redis backups redis-cli --rdb /tmp/dump.rdb.lz4
Symptoms:
- Redis commands taking >10ms
- Increased latency
- Timeouts
Diagnosis:
# Check slow log
redis-cli SLOWLOG GET 10
# Monitor commands in real-time
redis-cli MONITOR
# Check latency
redis-cli --latency
# Check for blocking operations
redis-cli INFO commandstatsSolutions:
-
Avoid blocking commands:
- Use SCAN instead of KEYS
- Use non-blocking alternatives
- Paginate large results
-
Optimize operations:
# Bad: KEYS * (blocks Redis) redis-cli KEYS "*" # Good: SCAN (non-blocking) redis-cli SCAN 0 MATCH "session:*" COUNT 100
-
Enable persistence efficiently:
# Use AOF with fsync every second redis-cli CONFIG SET appendonly yes redis-cli CONFIG SET appendfsync everysec # Or use RDB snapshots redis-cli CONFIG SET save "900 1 300 10 60 10000"
-
Use Redis cluster for horizontal scaling
Common causes:
- Too many HTTP requests
- Large JavaScript bundles
- Unoptimized images
- Blocking scripts
- No caching
Solutions:
-
Optimize JavaScript:
- Code splitting
- Lazy loading
- Minification
- Tree shaking
-
Optimize images:
- Compress images
- Use responsive images
- Lazy load off-screen images
- Use modern formats (WebP)
-
Reduce HTTP requests:
- Bundle CSS/JS files
- Use sprites for icons
- Inline critical CSS
-
Enable caching:
<meta http-equiv="Cache-Control" content="max-age=31536000">
-
Use service workers:
- Cache API responses
- Offline support
- Background sync
-
Monitor continuously:
- Set up metrics and alerts
- Track trends over time
- Be proactive, not reactive
-
Optimize early:
- Profile during development
- Load test before production
- Identify bottlenecks early
-
Cache aggressively:
- Cache at multiple layers
- Use appropriate TTLs
- Invalidate stale cache
-
Scale horizontally:
- Use load balancers
- Deploy multiple instances
- Use database replicas
-
Implement rate limiting:
- Protect against abuse
- Ensure fair resource usage
- Prevent cascade failures
Server:
- pprof profiling enabled (development)
- Structured logging with performance metrics
- Connection pooling configured
- Graceful shutdown implemented
- Resource limits set (memory, CPU)
Database:
- Indexes on frequently queried columns
- Query performance analyzed (EXPLAIN)
- Connection pooling configured
- Auto-vacuum enabled
- Slow query logging enabled
Redis:
- maxmemory limit set
- Eviction policy configured
- Persistence configured appropriately
- Key TTLs set
- Monitoring enabled
Network:
- Compression enabled
- HTTP/2 enabled
- CDN configured for static assets
- Connection keep-alive enabled
- Appropriate timeouts set
Client:
- Code splitting implemented
- Images optimized
- Lazy loading used
- Caching headers set
- Service worker configured
- Debugging-Guide - Debugging procedures
- Common-Issues - Frequent problems and solutions
- Monitoring-and-Health - System monitoring
- Performance-and-Scaling - Scaling strategies
- Database-Operations - Database optimization
- Using TMI for Threat Modeling
- Accessing TMI
- Creating Your First Threat Model
- Understanding the User Interface
- Working with Data Flow Diagrams
- Managing Threats
- Collaborative Threat Modeling
- Using Notes and Documentation
- Metadata and Extensions
- Planning Your Deployment
- Deploying TMI Server
- Deploying TMI Web Application
- Setting Up Authentication
- Database Setup
- Component Integration
- Post-Deployment
- Monitoring and Health
- Database Operations
- Security Operations
- Performance and Scaling
- Maintenance Tasks