Operational Runbook

Name: Fluxbase
Author: Fluxbase

This runbook provides guidance for operating Fluxbase in production, including troubleshooting common issues, handling incidents, and maintaining system health.

Overview

This document covers:

Common failure scenarios and remediation
Database troubleshooting
Performance debugging
Security incident response
Monitoring alert responses

Quick Reference

Health Check Endpoints

Endpoint	Purpose	Expected Response
`GET /health`	Basic health check	`{"status":"healthy"}`
`GET /api/v1/monitoring/health`	Detailed component health	JSON with component status
`GET /metrics`	Prometheus metrics	Prometheus text format

Common Issues Quick Fix

Symptom	First Check	Quick Fix
502 Bad Gateway	Database connection	Restart PostgreSQL connection pool
High latency	Connection pool exhaustion	Increase `database.max_connections`
Memory growth	RLS cache size	Reduce `realtime.rls_cache_size`
WebSocket drops	Slow clients	Check `realtime.slow_client_threshold`

Database Troubleshooting

Connection Pool Exhaustion

Symptoms:

Requests timing out
“too many connections” errors in logs
High fluxbase_db_connections_waiting metric

Diagnosis:

-- Check active connections
SELECT count(*), state
FROM pg_stat_activity
WHERE datname = 'fluxbase'
GROUP BY state;

-- Find long-running queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes'
ORDER BY duration DESC;

Remediation:

Identify and terminate long-running queries:

SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE duration > interval '5 minutes'
AND state != 'idle';

Increase pool size if consistently at limit:

database:
  max_connections: 100  # Up from default 50

Review application for connection leaks (queries not being closed).

Slow Queries

Symptoms:

High p99 latency
Elevated fluxbase_db_query_duration_seconds histogram

Diagnosis:

-- Enable slow query logging (temporarily)
ALTER SYSTEM SET log_min_duration_statement = '100ms';
SELECT pg_reload_conf();

-- Check for missing indexes
SELECT schemaname, tablename, attname, null_frac, n_distinct, correlation
FROM pg_stats
WHERE schemaname = 'public' AND tablename = 'your_table';

-- Analyze query plan
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM your_query;

Remediation:

Add missing indexes:

CREATE INDEX CONCURRENTLY idx_users_email ON auth.users(email);

Update table statistics:

ANALYZE your_table;

Consider query optimization or pagination.

Replication Lag (If Using Replicas)

Symptoms:

Stale data in read queries
pg_stat_replication.replay_lag increasing

Diagnosis:

-- Check replication status
SELECT client_addr, state, sent_lsn, replay_lsn,
       sent_lsn - replay_lsn AS lag_bytes
FROM pg_stat_replication;

Remediation:

Check replica disk I/O and network
Increase max_wal_senders if needed
Consider reducing write-heavy workloads temporarily

Performance Debugging

High CPU Usage

Diagnosis:

Check process-level CPU:

top -c -p $(pgrep -f fluxbase)

Enable CPU profiling (if compiled with pprof):

curl http://localhost:8080/debug/pprof/profile?seconds=30 > cpu.prof
go tool pprof cpu.prof

Check for hot paths in metrics:

curl -s http://localhost:8080/metrics | grep -E 'fluxbase_(http|db)_requests_total'

Common Causes:

JSON serialization of large responses
Complex RLS policies
Inefficient filter queries

Remediation:

Add pagination to large result sets
Simplify RLS policies
Add database indexes

High Memory Usage

Diagnosis:

# Check process memory
ps aux | grep fluxbase

# If pprof enabled
curl http://localhost:8080/debug/pprof/heap > heap.prof
go tool pprof heap.prof

Common Causes:

Large RLS cache
Many concurrent WebSocket connections
Large response bodies in memory

Remediation:

Reduce cache sizes:

realtime:
  rls_cache_size: 50000  # Down from 100000
  rls_cache_ttl: 15s     # Down from 30s

Limit concurrent connections:

realtime:
  max_connections_per_user: 5
  max_connections_per_ip: 10

Add response size limits

High Latency

Diagnosis:

Check component latencies in metrics:

curl -s http://localhost:8080/metrics | grep -E 'duration.*bucket'

Check database query times:

SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;

Check network latency to database:

ping -c 10 your-database-host

Remediation:

Add caching (Redis/Dragonfly) for frequent queries
Optimize slow queries (indexes, query rewrite)
Scale horizontally if single-instance bottleneck

Realtime/WebSocket Issues

Connections Dropping

Symptoms:

Clients frequently reconnecting
“connection closed” errors in client logs

Diagnosis:

# Check connection metrics
curl -s http://localhost:8080/metrics | grep -E 'fluxbase_realtime'

# Check for slow clients
curl -s http://localhost:8080/api/v1/monitoring/metrics | jq '.realtime'

Common Causes:

Slow clients not consuming messages fast enough
Network issues between client and server
Server resource exhaustion

Remediation:

Increase slow client timeout:

realtime:
  slow_client_timeout: 60s  # Up from 30s
  slow_client_threshold: 200  # Up from 100

Reduce message frequency if broadcasting too much
Check load balancer/proxy timeouts

Messages Not Delivered

Symptoms:

Clients subscribed but not receiving updates
Database changes not triggering notifications

Diagnosis:

-- Check if pg_notify is working
NOTIFY test_channel, 'test message';

-- Check if triggers exist
SELECT * FROM pg_trigger
WHERE tgname LIKE 'fluxbase%';

Remediation:

Verify realtime is enabled on the table:

-- Check for realtime trigger
SELECT * FROM pg_trigger
WHERE tgrelid = 'your_table'::regclass;

Restart the listener if stuck by restarting the Fluxbase service.

Security Incident Response

Suspected Account Compromise

Immediate Actions:

Revoke all sessions for affected user:

DELETE FROM auth.sessions WHERE user_id = 'affected-user-uuid';

Reset user password:

UPDATE auth.users
SET encrypted_password = crypt('temporary-password', gen_salt('bf'))
WHERE id = 'affected-user-uuid';

Revoke OAuth tokens:

DELETE FROM auth.identities WHERE user_id = 'affected-user-uuid';

Check audit logs for suspicious activity:

SELECT * FROM auth.audit_log_entries
WHERE actor_id = 'affected-user-uuid'
ORDER BY created_at DESC
LIMIT 100;

Service Key Compromise

Immediate Actions:

Revoke the compromised key:

curl -X POST http://localhost:8080/api/v1/admin/service-keys/{key-id}/revoke \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -d '{"reason": "Security incident - suspected compromise"}'

Review access logs:

grep "X-Service-Key" /var/log/fluxbase/access.log | tail -1000

Rotate all service keys as precaution:

# List all active keys
curl http://localhost:8080/api/v1/admin/service-keys

# Rotate each key
curl -X POST http://localhost:8080/api/v1/admin/service-keys/{key-id}/rotate

DDoS or Rate Limit Abuse

Symptoms:

Spike in request volume
Many 429 Too Many Requests responses
Single IP making excessive requests

Immediate Actions:

Check top requesters:

# If using nginx
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20

Block abusive IPs at firewall level:

# Using iptables
iptables -A INPUT -s 1.2.3.4 -j DROP

# Using UFW
ufw deny from 1.2.3.4

Enable stricter rate limiting temporarily:

server:
  rate_limit:
    requests_per_second: 10  # Down from default
    burst: 20

Storage Issues

Disk Space Running Low

Diagnosis:

# Check disk usage
df -h

# Find large files
du -sh /var/fluxbase/storage/* | sort -rh | head -20

# Check database size
psql -c "SELECT pg_size_pretty(pg_database_size('fluxbase'));"

Remediation:

Clean up old storage objects:

-- Find orphaned objects
SELECT * FROM storage.objects o
LEFT JOIN storage.buckets b ON o.bucket_id = b.id
WHERE b.id IS NULL;

-- Delete old objects (be careful!)
DELETE FROM storage.objects
WHERE created_at < NOW() - INTERVAL '90 days'
AND bucket_id = 'temp-uploads';

Vacuum database:

VACUUM FULL VERBOSE;

Archive old data

Storage Upload Failures

Symptoms:

500 errors on file uploads
“no space left on device” in logs

Diagnosis:

# Check storage backend
ls -la /var/fluxbase/storage/

# Check S3 connectivity (if using S3)
aws s3 ls s3://your-bucket --region your-region

Remediation:

For local storage: expand disk or clean up
For S3: check IAM permissions and bucket policy
Verify storage configuration in fluxbase.yaml

Background Jobs Issues

Jobs Stuck in Pending

Symptoms:

Jobs not executing
fluxbase_jobs_pending metric growing

Diagnosis:

-- Check pending jobs
SELECT id, name, status, scheduled_at, attempts
FROM jobs.jobs
WHERE status = 'pending'
ORDER BY scheduled_at
LIMIT 20;

-- Check worker status
SELECT * FROM jobs.workers WHERE status = 'active';

Remediation:

Check if workers are running:

curl http://localhost:8080/api/v1/monitoring/metrics | jq '.jobs.workers'

Restart workers if stuck
Increase worker concurrency:

jobs:
  max_concurrent_per_worker: 10  # Up from 5

Jobs Failing Repeatedly

Diagnosis:

-- Check failed jobs with errors
SELECT id, name, attempts, error, updated_at
FROM jobs.jobs
WHERE status = 'failed'
ORDER BY updated_at DESC
LIMIT 20;

Remediation:

Fix underlying error (check job code)
Retry failed jobs:

UPDATE jobs.jobs
SET status = 'pending', attempts = 0, error = NULL
WHERE id = 'job-uuid';

Adjust retry policy if needed

Alerting Response Guide

Alert: High Error Rate

Threshold: Error rate > 1% of requests

Response:

Check recent deployments for regressions
Review error logs for common patterns
Check downstream dependencies (database, storage)
Consider rollback if recently deployed

Alert: Database Connection Saturation

Threshold: Available connections < 10%

Response:

Terminate idle long-held connections
Review application for connection leaks
Scale database or increase connection limits

Alert: Disk Space Critical

Threshold: < 10% disk space remaining

Response:

Identify largest space consumers
Clean up logs, temp files
Expand storage if needed
Set up automated cleanup

Alert: Memory Usage High

Threshold: > 90% memory usage

Response:

Check for memory leaks (heap profile)
Reduce cache sizes
Restart service if immediate relief needed
Scale vertically if persistent

Maintenance Procedures

Rolling Restart

For zero-downtime restarts across multiple instances:

# If using Kubernetes
kubectl rollout restart deployment/fluxbase

# If using systemd (per-instance)
for host in host1 host2 host3; do
  ssh $host "sudo systemctl restart fluxbase"
  sleep 30  # Wait for health check
done

Database Maintenance

Weekly:

-- Update statistics
ANALYZE;

-- Check for bloat
SELECT schemaname, tablename,
       pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size
FROM pg_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 20;

Monthly:

-- Reindex if needed
REINDEX DATABASE fluxbase;

-- Check for unused indexes
SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;

Escalation Matrix

Severity	Response Time	Escalation
P1 - Service Down	15 min	On-call → Team Lead → Engineering Manager
P2 - Major Degradation	1 hour	On-call → Team Lead
P3 - Minor Issue	4 hours	On-call
P4 - Non-urgent	Next business day	Ticket

Learn More

Backup & Restore - Disaster recovery procedures
Monitoring & Observability - Setting up monitoring
Production Checklist - Pre-production checklist
Scaling - Scaling strategies