Last Updated: 2025-12-15
Environment: Production (pulse.rectorspace.com)
VPS: 176.222.53.185 (pnodepulse user)
- Quick Reference
- Deployment
- Rollback Procedures
- Database Operations
- Troubleshooting
- Monitoring & Health
- Emergency Procedures
- Maintenance Tasks
# SSH to VPS
ssh pnodepulse
# Check service status
docker compose ps
# View logs
docker compose logs -f blue --tail 100
# Health check
curl http://localhost:7000/api/health
# Restart services
docker compose restart blue
# Database backup
./scripts/backup-db.sh
# Database restore
./scripts/restore-db.sh /backups/pnode-pulse/pnode-pulse_YYYYMMDD_HHMMSS.dump| Service | Port | URL |
|---|---|---|
| Blue (Production) | 7000 | http://pulse.rectorspace.com |
| Green (Production) | 7001 | (inactive, for blue/green) |
| Staging | 7002 | http://staging.pulse.rectorspace.com |
| PostgreSQL | 5434 | localhost only |
| Redis | 6381 | localhost only |
- On-Call: [Set up PagerDuty/OpsGenie]
- Team Slack: #pnode-pulse-ops
- Incident Log: GitHub Issues (label: incident)
Deployment is fully automated via GitHub Actions on merge to main branch.
Process:
- Create PR from
devor feature branch tomain - CI Checks: Wait for lint, typecheck, build to pass
- Code Review: Get approval from team member
- Merge PR: Triggers GitHub Actions workflow
- Monitor: Watch deployment in Actions tab
- Verify: Check health endpoint and monitor logs
Deployment Steps (automated):
- Build Docker image from
mainbranch - Push image to GHCR with
:latesttag - SSH to VPS
- Pull new image
- Execute blue/green deployment (zero downtime)
- Run health checks
- Notify in Slack (if configured)
Timeline: ~10-15 minutes from merge to live
Use manual deployment when:
- Automated deployment fails
- Emergency hotfix needed
- Testing deployment process
# SSH to VPS
ssh pnodepulse
# Navigate to project directory
cd ~/pnode-pulse
# Pull latest images
docker compose pull blue
# Deploy blue (production)
docker compose up -d blue
# Wait 10 seconds for health check
sleep 10
# Verify health
curl -f http://localhost:7000/api/health || echo "⚠️ Health check failed"
# Check logs
docker compose logs -f blue --tail 50Zero-downtime deployment using blue/green strategy:
# SSH to VPS
ssh pnodepulse
cd ~/pnode-pulse
# Determine active environment
ACTIVE=$(docker compose ps --filter "status=running" | grep "7000" | grep -q "blue" && echo "blue" || echo "green")
INACTIVE=$([ "$ACTIVE" = "blue" ] && echo "green" || echo "blue")
echo "Active: $ACTIVE, Deploying to: $INACTIVE"
# Pull latest image
docker compose pull $INACTIVE
# Start inactive environment
docker compose up -d $INACTIVE
# Wait for health check
sleep 15
curl -f http://localhost:7001/api/health || exit 1
# Switch nginx upstream (manual step - update nginx config)
# sudo nano /etc/nginx/sites-available/pulse.rectorspace.com
# Change upstream from 7000 to 7001 (or vice versa)
# sudo nginx -t && sudo systemctl reload nginx
# Stop old environment
docker compose stop $ACTIVE
echo "✓ Deployment complete: $INACTIVE is now active"IMPORTANT: Always run migrations before deploying code that depends on schema changes.
# SSH to VPS
ssh pnodepulse
cd ~/pnode-pulse
# View pending migrations
docker compose exec blue npx prisma migrate status
# Apply migrations (non-interactive)
docker compose exec blue npx prisma migrate deploy
# Verify
docker compose exec blue npx prisma migrate status
# Should show: "Database is up to date"Scenario: New deployment causes errors or unexpected behavior
Steps:
# SSH to VPS
ssh pnodepulse
cd ~/pnode-pulse
# Option 1: Quick switch to inactive environment (if still running)
docker compose start green # or blue
# Update nginx upstream back to previous port
# Option 2: Rollback to specific Docker image
# List recent image tags
docker images ghcr.io/rector-labs/pnode-pulse --format "table {{.Tag}}\t{{.CreatedAt}}"
# Pull specific version (use git SHA from GitHub)
docker pull ghcr.io/rector-labs/pnode-pulse:abc1234567890def
# Tag as latest locally
docker tag ghcr.io/rector-labs/pnode-pulse:abc1234567890def ghcr.io/rector-labs/pnode-pulse:latest
# Restart with old image
docker compose up -d blue
# Verify
curl -f http://localhost:7000/api/healthTimeline: 2-5 minutes
Safe Rollback (migration hasn't run long):
# If migration just ran and caused immediate issues
docker compose exec blue npx prisma migrate resolve --rolled-back 20251209_migration_name
# Restore application to previous version (without migration)Full Restore (data corruption or critical failure):
# See DATABASE_BACKUP.md for full restore procedure
./scripts/restore-db.sh /backups/pnode-pulse/pnode-pulse_YYYYMMDD_HHMMSS.dumpTimeline: 15-30 minutes (depending on database size)
Automated: Daily at 2:00 AM UTC (cron job)
Retention: 30 days
Location: /backups/pnode-pulse/
Manual Backup:
ssh pnodepulse
export POSTGRES_PASSWORD=<password>
export POSTGRES_HOST=localhost
export POSTGRES_PORT=5434
./scripts/backup-db.shVerify Latest Backup:
ls -lh /backups/pnode-pulse/ | tail -1See full documentation: docs/DATABASE_BACKUP.md
# Quick restore
./scripts/restore-db.sh /backups/pnode-pulse/pnode-pulse_YYYYMMDD_HHMMSS.dumpVacuum (monthly):
docker compose exec postgres psql -U pnodepulse -c "VACUUM ANALYZE;"Check Database Size:
docker compose exec postgres psql -U pnodepulse -c "
SELECT pg_size_pretty(pg_database_size('pnodepulse')) AS size;
"Check Table Sizes:
docker compose exec postgres psql -U pnodepulse -c "
SELECT schemaname, tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 10;
"# Check container status
docker compose ps
# Check logs for errors
docker compose logs blue --tail 100
# Check disk space
df -h
# Check memory
free -h
# Restart service
docker compose restart blue
# Full restart (if needed)
docker compose down
docker compose up -dSymptoms: "Connection refused", "Connection timeout", "Too many connections"
# Test database connectivity
docker compose exec postgres pg_isready -U pnodepulse
# Check active connections
docker compose exec postgres psql -U pnodepulse -c "
SELECT COUNT(*) as connections FROM pg_stat_activity;
"
# Check for long-running queries
docker compose exec postgres psql -U pnodepulse -c "
SELECT pid, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC
LIMIT 10;
"
# Kill stuck connection (if needed)
docker compose exec postgres psql -U pnodepulse -c "
SELECT pg_terminate_backend(12345); -- Use PID from above
"
# Restart PostgreSQL
docker compose restart postgres# Test connectivity
docker compose exec redis redis-cli ping
# Should return: PONG
# Check memory usage
docker compose exec redis redis-cli info memory | grep used_memory_human
# Flush cache (if needed - DESTRUCTIVE)
docker compose exec redis redis-cli FLUSHALL
# Restart Redis
docker compose restart redis# Check application logs
docker compose logs blue --tail 200 | grep ERROR
# Check nginx logs (if applicable)
sudo tail -f /var/log/nginx/error.log
# Check resource usage
docker stats
# Check health endpoint
curl -v http://localhost:7000/api/health
# Restart application
docker compose restart blue# Check container resource usage
docker stats
# Check system resources
htop # or top
# Check specific processes
docker compose exec blue ps aux | head -20
# Restart high-usage container
docker compose restart blue
# Check for memory leaks (if persistent)
docker compose logs blue | grep "out of memory"# Check disk usage
df -h
# Find large files
du -h / | sort -rh | head -20
# Clean up Docker images/containers
docker system prune -a --volumes
# Clean up old backups (if needed)
find /backups/pnode-pulse/ -name "*.dump" -mtime +30 -delete
# Clean up logs
sudo journalctl --vacuum-time=7dApplication Health:
curl http://localhost:7000/api/health
# Expected: {"status":"ok","timestamp":"..."}Database Health:
docker compose exec postgres pg_isready -U pnodepulse
# Expected: ... - accepting connectionsRedis Health:
docker compose exec redis redis-cli ping
# Expected: PONGView Live Logs:
# Application
docker compose logs -f blue --tail 100
# All services
docker compose logs -f --tail 50
# Specific time range
docker compose logs --since 30m blueSystem Metrics:
# Container stats
docker stats
# Disk I/O
iostat -x 1
# Network
iftop
# Memory
free -h && cat /proc/meminfo | grep -i availableExternal monitoring checks your site from outside, detecting when it's completely unreachable.
Service: UptimeRobot (recommended) - Free tier with 50 monitors
Monitors to Configure:
| Monitor | URL | Check |
|---|---|---|
| Homepage | https://pulse.rectorspace.com | HTTP 200 |
| Health | https://pulse.rectorspace.com/api/health | Keyword: "status":"healthy" |
| API | https://pulse.rectorspace.com/api/v1/leaderboard | Keyword: nodes |
| Staging | https://staging.pulse.rectorspace.com/api/health | HTTP 200 |
Full Setup Guide: docs/UPTIME_MONITORING.md
When Alert Fires:
- Check health endpoint:
curl -s https://pulse.rectorspace.com/api/health - SSH and check logs:
ssh pnodepulse && docker compose logs blue --tail 100 - Restart if needed:
docker compose restart blue - Check Sentry for related errors
Application Performance Monitoring tracks errors from inside the application.
Service: Sentry (recommended) - Free tier with 5K errors/month
Configuration: Set SENTRY_DSN in .env to activate
Full Setup Guide: docs/APM_SETUP.md
What Sentry Provides:
- Real-time error notifications
- Full stack traces with source maps
- Performance monitoring (slow API calls)
- User context and session replay
Recommended alerts:
- UptimeRobot: Site unreachable (2+ failed checks)
- Sentry: New error type (first occurrence)
- Sentry: Error spike (>10 in 5 minutes)
- Application health check fails (3 consecutive failures)
- Database connection pool exhaustion (>80%)
- Disk space < 20%
- Memory usage > 90%
- High error rate (>1% of requests)
- Backup failure (no backup in 26 hours)
Incident: All services down, site unreachable
-
Assess:
ssh pnodepulse docker compose ps systemctl status docker df -h
-
Quick Recovery:
# Restart all services docker compose restart # If Docker is down sudo systemctl restart docker docker compose up -d
-
If still failing:
- Check VPS dashboard for alerts
- Check disk space (
df -h) - Check system logs (
sudo journalctl -xe) - Restore from backup if data corruption suspected
-
Document: Create incident report in GitHub Issues
Incident: Database errors, inconsistent data
-
Stop writes immediately:
docker compose stop blue green staging
-
Assess damage:
docker compose exec postgres psql -U pnodepulse -c "\dt" # Check table counts, verify critical tables exist
-
Restore from backup:
./scripts/restore-db.sh /backups/pnode-pulse/pnode-pulse_YYYYMMDD_HHMMSS.dump
-
Verify restoration:
curl http://localhost:7000/api/health # Check critical data in UI
Incident: Suspected breach, unauthorized access
- Isolate: Block access, change credentials
- Assess: Check logs, identify scope
- Contain: Rotate API keys, database passwords
- Recover: Restore from known-good backup if needed
- Document: Full incident report, timeline
- Post-mortem: Review access controls, update security
- Review error logs for patterns
- Check disk usage trends
- Verify backup integrity (spot check)
- Review monitoring alerts
- Database VACUUM ANALYZE
- Review and rotate application logs
- Update dependencies (security patches)
- Test restore procedure (staging)
- Full restore test (production backup → staging)
- Disaster recovery drill
- Review and update runbook
- Performance optimization review
- GitHub Repository: https://github.com/RECTOR-LABS/pnode-pulse
- Database Backup:
docs/DATABASE_BACKUP.md - Uptime Monitoring:
docs/UPTIME_MONITORING.md - APM Setup (Sentry):
docs/APM_SETUP.md - CI/CD Workflows:
.github/workflows/ - Docker Compose:
docker-compose.yml - Environment Config:
.env.example
Document Owner: DevOps Team
Review Schedule: Quarterly
Incident Reports: GitHub Issues (label: incident)