Skip to content

Performance and Scaling

Eric Fitzgerald edited this page Nov 12, 2025 · 1 revision

Performance and Scaling

This guide covers performance tuning, scaling strategies, capacity planning, and optimization techniques for TMI deployments.

Overview

TMI performance optimization and scaling involves:

  • Application performance tuning
  • Database optimization and scaling
  • Cache performance optimization
  • Horizontal and vertical scaling strategies
  • Load balancing and high availability
  • Capacity planning and monitoring

Quick Performance Checks

Application Performance

# Check response times
curl -w "@curl-format.txt" -o /dev/null -s https://api.tmi.example.com/version

# curl-format.txt:
time_total:        %{time_total}\n
time_connect:      %{time_connect}\n
time_starttransfer:%{time_starttransfer}\n
size_download:     %{size_download}\n

# Load test with Apache Bench
ab -n 1000 -c 10 https://api.tmi.example.com/version

# WebSocket connection test
wscat -c "wss://api.tmi.example.com/ws/diagrams/{id}" \
  -H "Authorization: Bearer $TOKEN"

Database Performance

-- Check slow queries
SELECT
  query,
  mean_time,
  calls,
  total_time
FROM pg_stat_statements
WHERE mean_time > 100
ORDER BY mean_time DESC
LIMIT 10;

-- Check database size and growth
SELECT
  pg_database.datname,
  pg_size_pretty(pg_database_size(pg_database.datname)) AS size
FROM pg_database;

-- Check connection count
SELECT count(*) FROM pg_stat_activity;

Cache Performance

# Redis hit rate
redis-cli -h redis-host -a password info stats | \
  awk '/keyspace_hits|keyspace_misses/ {
    split($0,a,":");
    if ($1 ~ /hits/) hits=a[2];
    if ($1 ~ /misses/) misses=a[2]
  }
  END {
    total=hits+misses;
    rate=(hits/total)*100;
    printf "Hit Rate: %.2f%%\n", rate
  }'

# Memory usage
redis-cli -h redis-host -a password info memory | grep used_memory_human

Application Performance Tuning

Server Configuration

Timeout Settings

Optimize HTTP timeouts for your workload:

# config-production.yml
server:
  read_timeout: 5s       # Time to read request
  write_timeout: 10s     # Time to write response
  idle_timeout: 60s      # Idle connection timeout

For high-latency clients or large payloads:

server:
  read_timeout: 15s
  write_timeout: 30s
  idle_timeout: 120s

Via environment:

SERVER_READ_TIMEOUT=15s
SERVER_WRITE_TIMEOUT=30s
SERVER_IDLE_TIMEOUT=120s

WebSocket Configuration

# WebSocket inactivity timeout
WEBSOCKET_INACTIVITY_TIMEOUT_SECONDS=300  # 5 minutes

# For high-activity collaboration
WEBSOCKET_INACTIVITY_TIMEOUT_SECONDS=600  # 10 minutes

Resource Limits

Go Runtime Tuning

# Set maximum Go processes (default: number of CPU cores)
GOMAXPROCS=8

# Garbage collection tuning
GOGC=100  # Default - adjust based on memory patterns

# For memory-constrained environments
GOGC=80   # More frequent GC, lower memory usage

# For CPU-constrained environments
GOGC=200  # Less frequent GC, higher memory usage

System Resource Limits

For systemd service:

# /etc/systemd/system/tmi.service
[Service]
# Maximum processes
LimitNPROC=512

# Maximum open files
LimitNOFILE=65536

# Memory limit
MemoryLimit=1G

# CPU limit (100% of one core)
CPUQuota=100%

For Docker:

docker run -d \
  --name tmi-server \
  --memory="1g" \
  --cpus="2.0" \
  --ulimit nofile=65536:65536 \
  tmi/tmi-server:latest

For Kubernetes:

resources:
  requests:
    memory: "512Mi"
    cpu: "500m"
  limits:
    memory: "1Gi"
    cpu: "2000m"

Logging Configuration

Optimize logging for performance:

logging:
  level: "info"                    # Use 'warn' or 'error' for production
  log_api_requests: false          # Disable in high-traffic production
  log_api_responses: false         # Disable to reduce I/O
  log_websocket_messages: false    # Disable for performance
  redact_auth_tokens: true         # Security
  suppress_unauth_logs: true       # Reduce noise

For high-performance production:

LOGGING_LEVEL=warn
LOGGING_LOG_API_REQUESTS=false
LOGGING_LOG_API_RESPONSES=false
LOGGING_LOG_WEBSOCKET_MESSAGES=false

Database Performance Tuning

PostgreSQL Configuration

Connection Pool Optimization

Configure connection pooling:

database:
  postgres:
    max_open_conns: 25      # Max concurrent connections
    max_idle_conns: 5       # Idle connections to maintain
    conn_max_lifetime: 5m   # Connection lifetime

Sizing guidelines:

  • Small deployment (< 100 users):

    • max_open_conns: 10
    • max_idle_conns: 2
  • Medium deployment (100-1000 users):

    • max_open_conns: 25
    • max_idle_conns: 5
  • Large deployment (1000+ users):

    • max_open_conns: 50
    • max_idle_conns: 10

PostgreSQL Server Settings

Edit /etc/postgresql/*/main/postgresql.conf:

# Memory Settings
shared_buffers = 256MB           # 25% of RAM (for dedicated server)
effective_cache_size = 1GB       # 50-75% of RAM
work_mem = 16MB                  # Per-operation memory
maintenance_work_mem = 64MB      # For VACUUM, CREATE INDEX

# Connection Settings
max_connections = 100            # Adjust based on connection pool

# Query Planner
random_page_cost = 1.1           # For SSD (default 4.0 for HDD)
effective_io_concurrency = 200   # For SSD (default 1)

# Write Performance
wal_buffers = 16MB
checkpoint_completion_target = 0.9

For production with 4GB RAM:

shared_buffers = 1GB
effective_cache_size = 3GB
work_mem = 32MB
maintenance_work_mem = 256MB

Restart PostgreSQL after changes:

sudo systemctl restart postgresql

Index Optimization

Check for missing indexes:

-- Tables with high sequential scan counts
SELECT
  schemaname,
  tablename,
  seq_scan,
  idx_scan,
  seq_tup_read,
  CASE
    WHEN seq_scan > 0 THEN seq_tup_read / seq_scan
    ELSE 0
  END AS avg_seq_tup_per_scan
FROM pg_stat_user_tables
WHERE seq_scan > 0
  AND schemaname = 'public'
ORDER BY seq_tup_read DESC
LIMIT 20;

TMI's key indexes (already created by migrations):

-- Primary key indexes (automatic)
-- Foreign key indexes
CREATE INDEX idx_threats_threat_model_id ON threats(threat_model_id);
CREATE INDEX idx_diagrams_threat_model_id ON diagrams(threat_model_id);

-- Query optimization indexes
CREATE INDEX idx_users_email ON users(email);
CREATE INDEX idx_threats_threat_model_id_created_at ON threats(threat_model_id, created_at);

Check index usage:

SELECT
  schemaname,
  tablename,
  indexname,
  idx_scan,
  idx_tup_read
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
ORDER BY idx_scan DESC;

-- Find unused indexes
SELECT
  schemaname,
  tablename,
  indexname
FROM pg_stat_user_indexes
WHERE idx_scan = 0
  AND indexname NOT LIKE '%_pkey'
  AND schemaname = 'public';

Query Optimization

Analyze slow queries:

-- Enable query timing
\timing on

-- Example query analysis
EXPLAIN ANALYZE
SELECT * FROM threats
WHERE threat_model_id = 'uuid-here'
ORDER BY created_at DESC
LIMIT 50;

Optimize query patterns:

-- Use LIMIT for large result sets
SELECT * FROM threats LIMIT 50;

-- Use appropriate indexes
-- Good: Uses index
SELECT * FROM threats WHERE threat_model_id = 'uuid';

-- Bad: Full table scan
SELECT * FROM threats WHERE lower(title) LIKE '%search%';

-- Better: Use functional index
CREATE INDEX idx_threats_title_lower ON threats(lower(title));

Vacuum and Analyze

Regular maintenance:

# Manual vacuum and analyze
psql -h postgres-host -U tmi_user -d tmi -c "VACUUM ANALYZE;"

# Check last vacuum/analyze
psql -h postgres-host -U tmi_user -d tmi -c "
  SELECT
    schemaname,
    tablename,
    last_vacuum,
    last_autovacuum,
    last_analyze,
    last_autoanalyze,
    n_dead_tup
  FROM pg_stat_user_tables
  ORDER BY n_dead_tup DESC"

Configure autovacuum in postgresql.conf:

autovacuum = on
autovacuum_max_workers = 3
autovacuum_naptime = 1min
autovacuum_vacuum_threshold = 50
autovacuum_analyze_threshold = 50

PostgreSQL Scaling

Read Replicas

For read-heavy workloads, add read replicas:

# Configure read replica
database:
  postgres:
    primary:
      host: "postgres-primary"
      port: 5432
    replicas:
      - host: "postgres-replica-1"
        port: 5432
      - host: "postgres-replica-2"
        port: 5432

Replication setup:

# On primary server (postgresql.conf)
wal_level = replica
max_wal_senders = 3
wal_keep_size = 1GB

# Create replication user
psql -U postgres -c "CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'password';"

# On replica server
# Use pg_basebackup to initialize replica
pg_basebackup -h primary-host -D /var/lib/postgresql/data -U replicator -P -v

Connection Pooling (PgBouncer)

For high-connection environments:

# Install PgBouncer
sudo apt-get install pgbouncer

# Configure /etc/pgbouncer/pgbouncer.ini
[databases]
tmi = host=postgres-host port=5432 dbname=tmi

[pgbouncer]
listen_addr = 127.0.0.1
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25

# Start PgBouncer
systemctl start pgbouncer

# Update TMI to use PgBouncer
POSTGRES_HOST=localhost
POSTGRES_PORT=6432

Redis Performance Tuning

Memory Configuration

# Edit /etc/redis/redis.conf

# Set memory limit
maxmemory 1gb

# Eviction policy
maxmemory-policy allkeys-lru  # Evict least recently used keys
# Or: volatile-lru (only evict keys with TTL)

# Memory optimization
hash-max-ziplist-entries 512
hash-max-ziplist-value 64

Persistence Configuration

Balance performance vs durability:

# For performance (may lose data on crash)
appendonly no
save ""

# Balanced (recommended)
appendonly yes
appendfsync everysec
save 900 1
save 300 10

# For durability (slower writes)
appendonly yes
appendfsync always

Redis Optimization

# Disable slow commands
rename-command KEYS ""
rename-command FLUSHALL ""

# TCP backlog
tcp-backlog 511

# TCP keepalive
tcp-keepalive 300

# Lazy freeing
lazyfree-lazy-eviction yes
lazyfree-lazy-expire yes

Cache TTL Strategy

TMI's cache TTL configuration:

Cache Type TTL Justification
Threat Models 10 minutes Core entities, moderate updates
Diagrams 2 minutes High collaboration, real-time
Sub-resources 5 minutes Threats, documents, sources
Authorization 15 minutes Security-critical, infrequent changes
Metadata 7 minutes Flexible data, moderate updates
Lists 5 minutes Paginated results

Adjust based on your usage patterns:

// For high-collaboration environments (reduce TTL)
cache.Set("threat_model:"+id, data, 5*time.Minute)

// For read-heavy environments (increase TTL)
cache.Set("threat_model:"+id, data, 15*time.Minute)

Scaling Strategies

Horizontal Scaling

Load Balancing

Nginx load balancer:

# /etc/nginx/conf.d/tmi-upstream.conf
upstream tmi_backend {
    least_conn;  # Or: ip_hash for sticky sessions
    server tmi-server-1:8080 max_fails=3 fail_timeout=30s;
    server tmi-server-2:8080 max_fails=3 fail_timeout=30s;
    server tmi-server-3:8080 max_fails=3 fail_timeout=30s;
}

server {
    listen 443 ssl http2;
    server_name api.tmi.example.com;

    location / {
        proxy_pass http://tmi_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

        # WebSocket support
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
}

HAProxy load balancer:

# /etc/haproxy/haproxy.cfg
frontend tmi_front
    bind *:443 ssl crt /etc/ssl/certs/tmi.pem
    default_backend tmi_back

backend tmi_back
    balance leastconn
    option httpchk GET /version
    http-check expect status 200
    server tmi1 tmi-server-1:8080 check
    server tmi2 tmi-server-2:8080 check
    server tmi3 tmi-server-3:8080 check

Docker Compose Scaling

# Scale to 3 instances
docker-compose up -d --scale tmi-server=3

# With explicit configuration
docker-compose -f docker-compose.yml -f docker-compose.scale.yml up -d
# docker-compose.scale.yml
version: "3.8"
services:
  tmi-server:
    deploy:
      replicas: 3

Kubernetes Horizontal Pod Autoscaler

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: tmi-server-hpa
  namespace: tmi
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: tmi-server
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

Vertical Scaling

Server Resources

Increase Docker container resources:

docker update tmi-server --memory="2g" --cpus="4.0"

Kubernetes resource increase:

resources:
  requests:
    memory: "1Gi"
    cpu: "1000m"
  limits:
    memory: "2Gi"
    cpu: "4000m"

Heroku dyno scaling:

# Scale to larger dyno type
heroku ps:resize web=standard-2x --app tmi-server

# Or Performance tier
heroku ps:resize web=performance-m --app tmi-server

Database Scaling

PostgreSQL vertical scaling:

# Increase shared_buffers (requires restart)
ALTER SYSTEM SET shared_buffers = '2GB';
SELECT pg_reload_conf();  # Or restart PostgreSQL

# Increase work_mem (no restart)
ALTER SYSTEM SET work_mem = '64MB';
SELECT pg_reload_conf();

Redis vertical scaling:

# Increase memory limit
redis-cli CONFIG SET maxmemory 2gb

# Make permanent in redis.conf
echo "maxmemory 2gb" >> /etc/redis/redis.conf

Geographic Distribution

For global deployments:

┌─────────────────────┐
│  Global Load Balancer│
└──────────┬───────────┘
           │
    ┌──────┴──────┐
    │             │
┌───▼────┐    ┌───▼────┐
│ US-East│    │ EU-West│
│ Region │    │ Region │
└────────┘    └────────┘
    │             │
 TMI+DB+Cache  TMI+DB+Cache

Consider:

  • Regional deployments
  • Database replication across regions
  • CDN for static assets
  • DNS-based routing

Capacity Planning

Resource Monitoring

Track key metrics for capacity planning:

-- Database growth rate
SELECT
  date_trunc('month', created_at) AS month,
  count(*) AS records
FROM threat_models
GROUP BY month
ORDER BY month;

-- User growth
SELECT
  date_trunc('week', created_at) AS week,
  count(*) AS new_users
FROM users
GROUP BY week
ORDER BY week;

Capacity Thresholds

Set alerts for capacity thresholds:

  • CPU: Alert at 70%, critical at 85%
  • Memory: Alert at 75%, critical at 90%
  • Disk: Alert at 75%, critical at 90%
  • Database connections: Alert at 70% of max
  • Redis memory: Alert at 80%, critical at 95%

Growth Projections

Calculate growth rates:

# Database size growth
# Current size: 5GB
# Growth: 100MB/month
# Projected size in 12 months: 5GB + (100MB * 12) = 6.2GB

# User growth
# Current: 100 users
# Growth: 20% month-over-month
# Projected in 12 months: 100 * (1.2^12) = ~900 users

Capacity Planning Checklist

  • Monitor resource utilization trends
  • Project growth rates (users, data, traffic)
  • Calculate resource needs for 6-12 months
  • Plan scaling activities before reaching thresholds
  • Budget for infrastructure growth
  • Test scaling procedures in staging
  • Document capacity baselines

Performance Benchmarking

Application Benchmarks

# HTTP endpoint benchmarking with Apache Bench
ab -n 10000 -c 100 -H "Authorization: Bearer $TOKEN" \
  https://api.tmi.example.com/api/threat-models

# WebSocket benchmarking
# Install: npm install -g websocket-bench
wsbench -c 100 -n 1000 wss://api.tmi.example.com/ws/diagrams/{id} \
  -H "Authorization: Bearer $TOKEN"

# Full load testing with k6
k6 run load-test.js

Example k6 script (load-test.js):

import http from 'k6/http';
import { check, sleep } from 'k6';

export let options = {
  stages: [
    { duration: '2m', target: 100 },  // Ramp up
    { duration: '5m', target: 100 },  // Stay at 100 users
    { duration: '2m', target: 0 },    // Ramp down
  ],
};

export default function() {
  let response = http.get('https://api.tmi.example.com/api/threat-models', {
    headers: { 'Authorization': `Bearer ${__ENV.TOKEN}` },
  });

  check(response, {
    'status is 200': (r) => r.status === 200,
    'response time < 500ms': (r) => r.timings.duration < 500,
  });

  sleep(1);
}

Run benchmark:

TOKEN=$YOUR_TOKEN k6 run load-test.js

Database Benchmarks

# PostgreSQL benchmarking with pgbench
createdb pgbench_test
pgbench -i -s 10 pgbench_test  # Initialize
pgbench -c 10 -j 2 -t 1000 pgbench_test  # Run benchmark

# Results show:
# - Transactions per second (TPS)
# - Average latency
# - Connection overhead

Performance Monitoring Dashboards

Key Performance Indicators (KPIs)

Application KPIs:

  • Request throughput (requests/second)
  • Response time percentiles (P50, P95, P99)
  • Error rate (percentage of 5xx responses)
  • WebSocket connection count
  • Active user sessions

Database KPIs:

  • Query response time
  • Connection count
  • Cache hit ratio
  • Replication lag
  • Table sizes

Infrastructure KPIs:

  • CPU utilization
  • Memory utilization
  • Disk I/O
  • Network throughput
  • Container restarts

Grafana Dashboard Examples

Create dashboards tracking:

System Overview:

  • Service uptime (%)
  • Request rate (req/s)
  • Error rate (%)
  • Active users
  • Response time (P95)

Database Performance:

  • Query duration (ms)
  • Connection count
  • Slow queries
  • Cache hit rate
  • Database size

Resource Utilization:

  • CPU usage (%)
  • Memory usage (%)
  • Disk usage (%)
  • Network I/O (MB/s)

Troubleshooting Performance Issues

High Response Times

Check:

  1. Database query performance
  2. Cache hit rates
  3. Network latency
  4. Application logs for errors
  5. Resource utilization (CPU, memory)

Solutions:

  • Optimize slow queries
  • Add missing indexes
  • Increase cache TTL
  • Scale horizontally
  • Optimize code

High CPU Usage

Check:

# Process CPU usage
top -p $(pgrep tmi-server)

# System CPU by process
ps aux --sort=-%cpu | head

Solutions:

  • Profile application (Go pprof)
  • Optimize hot code paths
  • Reduce logging
  • Scale horizontally

Memory Leaks

Check:

# Memory usage over time
docker stats tmi-server --no-stream

# Go heap profile
curl http://localhost:8080/debug/pprof/heap > heap.prof
go tool pprof heap.prof

Solutions:

  • Analyze heap dump
  • Fix memory leaks in code
  • Increase garbage collection frequency
  • Restart services periodically

Database Connection Exhaustion

Check:

SELECT count(*) FROM pg_stat_activity;

Solutions:

  • Increase connection pool size
  • Use connection pooler (PgBouncer)
  • Fix connection leaks in application
  • Optimize query execution time

Related Documentation

Additional Resources

Clone this wiki locally