Operational Runbook
This runbook provides guidance for operating Fluxbase in production, including troubleshooting common issues, handling incidents, and maintaining system health.
Overview
Section titled “Overview”This document covers:
- Common failure scenarios and remediation
- Database troubleshooting
- Performance debugging
- Security incident response
- Monitoring alert responses
Quick Reference
Section titled “Quick Reference”Health Check Endpoints
Section titled “Health Check Endpoints”| Endpoint | Purpose | Expected Response |
|---|---|---|
GET /health | Basic health check | {"status":"healthy"} |
GET /api/v1/monitoring/health | Detailed component health | JSON with component status |
GET /metrics | Prometheus metrics | Prometheus text format |
Common Issues Quick Fix
Section titled “Common Issues Quick Fix”| Symptom | First Check | Quick Fix |
|---|---|---|
| 502 Bad Gateway | Database connection | Restart PostgreSQL connection pool |
| High latency | Connection pool exhaustion | Increase database.max_connections |
| Memory growth | RLS cache size | Reduce realtime.rls_cache_size |
| WebSocket drops | Slow clients | Check realtime.slow_client_threshold |
Database Troubleshooting
Section titled “Database Troubleshooting”Connection Pool Exhaustion
Section titled “Connection Pool Exhaustion”Symptoms:
- Requests timing out
- “too many connections” errors in logs
- High
fluxbase_db_connections_waitingmetric
Diagnosis:
-- Check active connectionsSELECT count(*), stateFROM pg_stat_activityWHERE datname = 'fluxbase'GROUP BY state;
-- Find long-running queriesSELECT pid, now() - pg_stat_activity.query_start AS duration, queryFROM pg_stat_activityWHERE (now() - pg_stat_activity.query_start) > interval '5 minutes'ORDER BY duration DESC;Remediation:
- Identify and terminate long-running queries:
SELECT pg_terminate_backend(pid)FROM pg_stat_activityWHERE duration > interval '5 minutes'AND state != 'idle';- Increase pool size if consistently at limit:
database: max_connections: 100 # Up from default 50- Review application for connection leaks (queries not being closed).
Slow Queries
Section titled “Slow Queries”Symptoms:
- High p99 latency
- Elevated
fluxbase_db_query_duration_secondshistogram
Diagnosis:
-- Enable slow query logging (temporarily)ALTER SYSTEM SET log_min_duration_statement = '100ms';SELECT pg_reload_conf();
-- Check for missing indexesSELECT schemaname, tablename, attname, null_frac, n_distinct, correlationFROM pg_statsWHERE schemaname = 'public' AND tablename = 'your_table';
-- Analyze query planEXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)SELECT * FROM your_query;Remediation:
- Add missing indexes:
CREATE INDEX CONCURRENTLY idx_users_email ON auth.users(email);- Update table statistics:
ANALYZE your_table;- Consider query optimization or pagination.
Replication Lag (If Using Replicas)
Section titled “Replication Lag (If Using Replicas)”Symptoms:
- Stale data in read queries
pg_stat_replication.replay_lagincreasing
Diagnosis:
-- Check replication statusSELECT client_addr, state, sent_lsn, replay_lsn, sent_lsn - replay_lsn AS lag_bytesFROM pg_stat_replication;Remediation:
- Check replica disk I/O and network
- Increase
max_wal_sendersif needed - Consider reducing write-heavy workloads temporarily
Performance Debugging
Section titled “Performance Debugging”High CPU Usage
Section titled “High CPU Usage”Diagnosis:
- Check process-level CPU:
top -c -p $(pgrep -f fluxbase)- Enable CPU profiling (if compiled with pprof):
curl http://localhost:8080/debug/pprof/profile?seconds=30 > cpu.profgo tool pprof cpu.prof- Check for hot paths in metrics:
curl -s http://localhost:8080/metrics | grep -E 'fluxbase_(http|db)_requests_total'Common Causes:
- JSON serialization of large responses
- Complex RLS policies
- Inefficient filter queries
Remediation:
- Add pagination to large result sets
- Simplify RLS policies
- Add database indexes
High Memory Usage
Section titled “High Memory Usage”Diagnosis:
# Check process memoryps aux | grep fluxbase
# If pprof enabledcurl http://localhost:8080/debug/pprof/heap > heap.profgo tool pprof heap.profCommon Causes:
- Large RLS cache
- Many concurrent WebSocket connections
- Large response bodies in memory
Remediation:
- Reduce cache sizes:
realtime: rls_cache_size: 50000 # Down from 100000 rls_cache_ttl: 15s # Down from 30s- Limit concurrent connections:
realtime: max_connections_per_user: 5 max_connections_per_ip: 10- Add response size limits
High Latency
Section titled “High Latency”Diagnosis:
- Check component latencies in metrics:
curl -s http://localhost:8080/metrics | grep -E 'duration.*bucket'- Check database query times:
SELECT query, calls, mean_exec_time, total_exec_timeFROM pg_stat_statementsORDER BY mean_exec_time DESCLIMIT 20;- Check network latency to database:
ping -c 10 your-database-hostRemediation:
- Add caching (Redis/Dragonfly) for frequent queries
- Optimize slow queries (indexes, query rewrite)
- Scale horizontally if single-instance bottleneck
Realtime/WebSocket Issues
Section titled “Realtime/WebSocket Issues”Connections Dropping
Section titled “Connections Dropping”Symptoms:
- Clients frequently reconnecting
- “connection closed” errors in client logs
Diagnosis:
# Check connection metricscurl -s http://localhost:8080/metrics | grep -E 'fluxbase_realtime'
# Check for slow clientscurl -s http://localhost:8080/api/v1/monitoring/metrics | jq '.realtime'Common Causes:
- Slow clients not consuming messages fast enough
- Network issues between client and server
- Server resource exhaustion
Remediation:
- Increase slow client timeout:
realtime: slow_client_timeout: 60s # Up from 30s slow_client_threshold: 200 # Up from 100- Reduce message frequency if broadcasting too much
- Check load balancer/proxy timeouts
Messages Not Delivered
Section titled “Messages Not Delivered”Symptoms:
- Clients subscribed but not receiving updates
- Database changes not triggering notifications
Diagnosis:
-- Check if pg_notify is workingNOTIFY test_channel, 'test message';
-- Check if triggers existSELECT * FROM pg_triggerWHERE tgname LIKE 'fluxbase%';Remediation:
- Verify realtime is enabled on the table:
-- Check for realtime triggerSELECT * FROM pg_triggerWHERE tgrelid = 'your_table'::regclass;- Restart the listener if stuck by restarting the Fluxbase service.
Security Incident Response
Section titled “Security Incident Response”Suspected Account Compromise
Section titled “Suspected Account Compromise”Immediate Actions:
- Revoke all sessions for affected user:
DELETE FROM auth.sessions WHERE user_id = 'affected-user-uuid';- Reset user password:
UPDATE auth.usersSET encrypted_password = crypt('temporary-password', gen_salt('bf'))WHERE id = 'affected-user-uuid';- Revoke OAuth tokens:
DELETE FROM auth.identities WHERE user_id = 'affected-user-uuid';- Check audit logs for suspicious activity:
SELECT * FROM auth.audit_log_entriesWHERE actor_id = 'affected-user-uuid'ORDER BY created_at DESCLIMIT 100;Service Key Compromise
Section titled “Service Key Compromise”Immediate Actions:
- Revoke the compromised key:
curl -X POST http://localhost:8080/api/v1/admin/service-keys/{key-id}/revoke \ -H "Authorization: Bearer $ADMIN_TOKEN" \ -d '{"reason": "Security incident - suspected compromise"}'- Review access logs:
grep "X-Service-Key" /var/log/fluxbase/access.log | tail -1000- Rotate all service keys as precaution:
# List all active keyscurl http://localhost:8080/api/v1/admin/service-keys
# Rotate each keycurl -X POST http://localhost:8080/api/v1/admin/service-keys/{key-id}/rotateDDoS or Rate Limit Abuse
Section titled “DDoS or Rate Limit Abuse”Symptoms:
- Spike in request volume
- Many 429 Too Many Requests responses
- Single IP making excessive requests
Immediate Actions:
- Check top requesters:
# If using nginxawk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20- Block abusive IPs at firewall level:
# Using iptablesiptables -A INPUT -s 1.2.3.4 -j DROP
# Using UFWufw deny from 1.2.3.4- Enable stricter rate limiting temporarily:
server: rate_limit: requests_per_second: 10 # Down from default burst: 20Storage Issues
Section titled “Storage Issues”Disk Space Running Low
Section titled “Disk Space Running Low”Diagnosis:
# Check disk usagedf -h
# Find large filesdu -sh /var/fluxbase/storage/* | sort -rh | head -20
# Check database sizepsql -c "SELECT pg_size_pretty(pg_database_size('fluxbase'));"Remediation:
- Clean up old storage objects:
-- Find orphaned objectsSELECT * FROM storage.objects oLEFT JOIN storage.buckets b ON o.bucket_id = b.idWHERE b.id IS NULL;
-- Delete old objects (be careful!)DELETE FROM storage.objectsWHERE created_at < NOW() - INTERVAL '90 days'AND bucket_id = 'temp-uploads';- Vacuum database:
VACUUM FULL VERBOSE;- Archive old data
Storage Upload Failures
Section titled “Storage Upload Failures”Symptoms:
- 500 errors on file uploads
- “no space left on device” in logs
Diagnosis:
# Check storage backendls -la /var/fluxbase/storage/
# Check S3 connectivity (if using S3)aws s3 ls s3://your-bucket --region your-regionRemediation:
- For local storage: expand disk or clean up
- For S3: check IAM permissions and bucket policy
- Verify storage configuration in fluxbase.yaml
Background Jobs Issues
Section titled “Background Jobs Issues”Jobs Stuck in Pending
Section titled “Jobs Stuck in Pending”Symptoms:
- Jobs not executing
fluxbase_jobs_pendingmetric growing
Diagnosis:
-- Check pending jobsSELECT id, name, status, scheduled_at, attemptsFROM jobs.jobsWHERE status = 'pending'ORDER BY scheduled_atLIMIT 20;
-- Check worker statusSELECT * FROM jobs.workers WHERE status = 'active';Remediation:
- Check if workers are running:
curl http://localhost:8080/api/v1/monitoring/metrics | jq '.jobs.workers'- Restart workers if stuck
- Increase worker concurrency:
jobs: max_concurrent_per_worker: 10 # Up from 5Jobs Failing Repeatedly
Section titled “Jobs Failing Repeatedly”Diagnosis:
-- Check failed jobs with errorsSELECT id, name, attempts, error, updated_atFROM jobs.jobsWHERE status = 'failed'ORDER BY updated_at DESCLIMIT 20;Remediation:
- Fix underlying error (check job code)
- Retry failed jobs:
UPDATE jobs.jobsSET status = 'pending', attempts = 0, error = NULLWHERE id = 'job-uuid';- Adjust retry policy if needed
Alerting Response Guide
Section titled “Alerting Response Guide”Alert: High Error Rate
Section titled “Alert: High Error Rate”Threshold: Error rate > 1% of requests
Response:
- Check recent deployments for regressions
- Review error logs for common patterns
- Check downstream dependencies (database, storage)
- Consider rollback if recently deployed
Alert: Database Connection Saturation
Section titled “Alert: Database Connection Saturation”Threshold: Available connections < 10%
Response:
- Terminate idle long-held connections
- Review application for connection leaks
- Scale database or increase connection limits
Alert: Disk Space Critical
Section titled “Alert: Disk Space Critical”Threshold: < 10% disk space remaining
Response:
- Identify largest space consumers
- Clean up logs, temp files
- Expand storage if needed
- Set up automated cleanup
Alert: Memory Usage High
Section titled “Alert: Memory Usage High”Threshold: > 90% memory usage
Response:
- Check for memory leaks (heap profile)
- Reduce cache sizes
- Restart service if immediate relief needed
- Scale vertically if persistent
Maintenance Procedures
Section titled “Maintenance Procedures”Rolling Restart
Section titled “Rolling Restart”For zero-downtime restarts across multiple instances:
# If using Kuberneteskubectl rollout restart deployment/fluxbase
# If using systemd (per-instance)for host in host1 host2 host3; do ssh $host "sudo systemctl restart fluxbase" sleep 30 # Wait for health checkdoneDatabase Maintenance
Section titled “Database Maintenance”Weekly:
-- Update statisticsANALYZE;
-- Check for bloatSELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as sizeFROM pg_tablesWHERE schemaname NOT IN ('pg_catalog', 'information_schema')ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESCLIMIT 20;Monthly:
-- Reindex if neededREINDEX DATABASE fluxbase;
-- Check for unused indexesSELECT schemaname, tablename, indexname, idx_scanFROM pg_stat_user_indexesWHERE idx_scan = 0ORDER BY pg_relation_size(indexrelid) DESC;Escalation Matrix
Section titled “Escalation Matrix”| Severity | Response Time | Escalation |
|---|---|---|
| P1 - Service Down | 15 min | On-call → Team Lead → Engineering Manager |
| P2 - Major Degradation | 1 hour | On-call → Team Lead |
| P3 - Minor Issue | 4 hours | On-call |
| P4 - Non-urgent | Next business day | Ticket |
Learn More
Section titled “Learn More”- Backup & Restore - Disaster recovery procedures
- Monitoring & Observability - Setting up monitoring
- Production Checklist - Pre-production checklist
- Scaling - Scaling strategies