Distributed Tracing
Distributed tracing provides end-to-end visibility into requests as they travel through your Fluxbase instance. Fluxbase uses OpenTelemetry for standardized, vendor-agnostic tracing.
What is Distributed Tracing?
Section titled “What is Distributed Tracing?”Distributed tracing tracks a request as it moves through different services and components. Each unit of work is called a “span”, and spans are combined into a “trace” that shows the complete journey of a request.
Key Benefits:
- Performance Analysis: Identify slow database queries, API calls, and function executions
- Error Debugging: Trace errors across service boundaries
- Architecture Understanding: Visualize service dependencies and data flow
- Capacity Planning: Make data-driven decisions about scaling
Configuration
Section titled “Configuration”Enable OpenTelemetry tracing in your fluxbase.yaml:
observability: tracing: enabled: true endpoint: "localhost:4317" # OTLP collector endpoint service_name: "fluxbase" # Service name for traces environment: "production" # Environment (development, staging, production) sample_rate: 1.0 # Sample rate (0.0-1.0, 1.0 = 100%) insecure: false # Use TLS for productionEnvironment Variables:
export FLUXBASE_OBSERVABILITY_TRACING_ENABLED=trueexport FLUXBASE_OBSERVABILITY_TRACING_ENDPOINT="collector.example.com:4317"export FLUXBASE_OBSERVABILITY_TRACING_SERVICE_NAME="fluxbase"export FLUXBASE_OBSERVABILITY_TRACING_ENVIRONMENT="production"export FLUXBASE_OBSERVABILITY_TRACING_SAMPLE_RATE=0.1 # Sample 10% of tracesSetting Up Trace Backends
Section titled “Setting Up Trace Backends”Fluxbase uses the OTLP (OpenTelemetry Protocol) format, which is compatible with many backends:
1. Jaeger
Section titled “1. Jaeger”Jaeger is a popular open-source tracing backend.
Run Jaeger with Docker:
docker run -d --name jaeger \ -e COLLECTOR_OTLP_ENABLED=true \ -p 4317:4317 \ -p 16686:16686 \ jaegertracing/all-in-one:latestConfigure Fluxbase:
observability: tracing: enabled: true endpoint: "localhost:4317" insecure: true # For local developmentAccess Jaeger UI:
- Navigate to
http://localhost:16686 - Browse traces by service, operation, and tags
2. Grafana Tempo
Section titled “2. Grafana Tempo”Grafana Tempo is a scalable, high-performance distributed tracing backend.
Run Tempo with Docker:
docker run -d --name tempo \ -p 4317:4317 \ -p 3200:3200 \ grafana/tempo:latest \ -server.http-listen-port=3200 \ -storage.trace.backend=local \ -storage.trace.local.path=/tmp/tempoConfigure Fluxbase:
observability: tracing: enabled: true endpoint: "localhost:4317"Access Tempo UI:
- Use Grafana with Tempo data source
- Navigate to Grafana → Explore → Select Tempo data source
3. OpenTelemetry Collector
Section titled “3. OpenTelemetry Collector”For production deployments, use the OpenTelemetry Collector as a central processing pipeline:
receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317
processors: batch: timeout: 5s max_batch_size: 1000
exporters: jaeger: endpoint: jaeger:14250 tls: insecure: true
logging: loglevel: debug
service: pipelines: traces: receivers: [otlp] processors: [batch] exporters: [jaeger, logging]Run Collector:
docker run -d --name otel-collector \ -v $(pwd)/otel-collector-config.yaml:/etc/otelcol/config.yaml \ -p 4317:4317 \ otel/opentelemetry-collector:latest4. Cloud Providers
Section titled “4. Cloud Providers”AWS X-Ray:
# Use AWS Distro for OpenTelemetry (ADOT) Collectorobservability: tracing: enabled: true endpoint: "localhost:4317" # ADOT Collector endpointGoogle Cloud Trace:
# Use OpenTelemetry Collector with Google Cloud Trace exporterobservability: tracing: enabled: true endpoint: "localhost:4317"Azure Monitor:
# Use Azure Monitor Application Insights exporterobservability: tracing: enabled: true endpoint: "localhost:4317"Automatic Instrumentation
Section titled “Automatic Instrumentation”Fluxbase automatically creates spans for:
Database Operations
Section titled “Database Operations”All PostgreSQL queries are automatically traced:
// Automatic span created for this queryctx, span := observability.StartDBSpan(ctx, "SELECT", "users")defer observability.EndDBSpan(span, err)
rows, err := db.Query(ctx, "SELECT * FROM users WHERE id = $1", userID)Span Attributes:
db.system: “postgresql”db.operation: “SELECT”, “INSERT”, “UPDATE”, “DELETE”db.table: Table name- Error status if query fails
Authentication Operations
Section titled “Authentication Operations”All auth operations create spans:
auth.signupauth.signinauth.signoutauth.oauthauth.magic_link
Storage Operations
Section titled “Storage Operations”File storage operations are traced:
storage.uploadstorage.downloadstorage.delete
Custom Spans
Section titled “Custom Spans”Create custom spans for additional visibility:
Basic Custom Span
Section titled “Basic Custom Span”import "github.com/nimbleflux/fluxbase/internal/observability"
// Start a custom spanctx, span := observability.StartSpan(ctx, "my-custom-operation")defer span.End()
// Your code hereresult, err := doSomething(ctx)
// Record error if failedif err != nil { observability.RecordError(ctx, err)}Span with Attributes
Section titled “Span with Attributes”import ( "go.opentelemetry.io/otel/attribute" "github.com/nimbleflux/fluxbase/internal/observability")
ctx, span := observability.StartSpan(ctx, "process-data")defer span.End()
// Add custom attributesobservability.SetSpanAttributes(ctx, attribute.String("user.id", userID), attribute.Int("record.count", len(records)), attribute.String("processing.type", "batch"),)Span with Events
Section titled “Span with Events”// Add events to track progressobservability.AddSpanEvent(ctx, "validation.started", attribute.Int("record.count", len(records)),)
// ... validation code ...
observability.AddSpanEvent(ctx, "validation.completed", attribute.Int("valid.records", validCount), attribute.Int("invalid.records", invalidCount),)Edge Function Tracing
Section titled “Edge Function Tracing”Fluxbase automatically traces edge function execution:
// Your Deno functionimport { tracer } from "https://deno.land/x/otel@v0.1.0/api.ts";
// Span context is automatically available via environment variablesconst traceParent = Deno.env.get("TRACEPARENT");const traceId = Deno.env.get("OTEL_TRACE_ID");const spanId = Deno.env.get("OTEL_SPAN_ID");
// Fluxbase automatically creates function spans with attributes:// - function.execution_id// - function.name// - function.namespace// - user.id (if authenticated)// - http.method// - http.urlFunction Span Events:
// Add custom events to function spansawait fetch("https://api.example.com/data", { headers: { "traceparent": traceParent, // Propagate trace context },});Background Job Tracing
Section titled “Background Job Tracing”Jobs are automatically traced with progress tracking:
// Job span is created when job startsctx, span := observability.StartJobSpan(ctx, observability.JobSpanConfig{ JobID: jobID, JobName: "send-email", Namespace: "notifications", Priority: 5, ScheduledAt: scheduledAt, WorkerID: workerID, WorkerName: "worker-1", UserID: userID,})defer span.End()
// Track job progressobservability.SetJobProgress(ctx, 25, "Email queued")// ... send email ...observability.SetJobProgress(ctx, 50, "Email sent")// ... update database ...observability.SetJobProgress(ctx, 100, "Completed")
// Set final resultobservability.SetJobResult(ctx, "completed", duration, nil)Trace Context Propagation
Section titled “Trace Context Propagation”Trace context automatically propagates to:
- Database Queries: All queries carry trace context
- HTTP Clients: Use
traceparentheader - Background Jobs: Jobs inherit parent trace
- Edge Functions: Trace context passed as environment variables
Manual Propagation:
// Get trace context for subprocessesenv := observability.GetTraceContextEnv(ctx)
// Pass to subprocesscmd := exec.CommandContext(ctx, "my-subprocess")cmd.Env = append(os.Environ(), flattenEnv(env)...)Sampling Strategies
Section titled “Sampling Strategies”Reduce tracing overhead with smart sampling:
# Sample all traces in developmentobservability: tracing: sample_rate: 1.0 # 100% sampling
# Sample 10% of traces in productionobservability: tracing: sample_rate: 0.1 # 10% sampling
# Dynamic sampling based on routeobservability: tracing: sample_rate: 0.01 # 1% baselineHead-Based Sampling:
// Always trace slow operationsif duration > time.Second { observability.SetSpanAttributes(ctx, attribute.Bool("slow.request", true), )}
// Always trace errorsif err != nil { observability.RecordError(ctx, err)}Analyzing Traces
Section titled “Analyzing Traces”Identify Slow Queries
Section titled “Identify Slow Queries”Look for database spans with high duration:
- Open Jaeger UI or Grafana Tempo
- Filter by operation
db.queryordb.SELECT - Sort by duration
- Click on slow spans to see SQL query
Trace Errors Across Services
Section titled “Trace Errors Across Services”Follow an error through the system:
- Find traces with error status
- Expand the trace timeline
- Look for red error spans
- Click on error spans to see stack traces
Performance Optimization
Section titled “Performance Optimization”Identify optimization opportunities:
- Look for spans with high duration
- Check if spans run sequentially (could be parallelized)
- Identify N+1 query patterns
- Find slow external API calls
Best Practices
Section titled “Best Practices”1. Use Semantic Attributes
Section titled “1. Use Semantic Attributes”Follow OpenTelemetry semantic conventions:
import semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
observability.SetSpanAttributes(ctx, semconv.HTTPMethodKey.String("GET"), semconv.HTTPStatusCodeKey.Int(200), semconv.EnduserIDKey.String(userID),)2. Add Contextual Events
Section titled “2. Add Contextual Events”Track important events in spans:
observability.AddSpanEvent(ctx, "cache.miss", attribute.String("cache.key", cacheKey),)
observability.AddSpanEvent(ctx, "db.query.started")// ... run query ...observability.AddSpanEvent(ctx, "db.query.completed", attribute.Int("db.row_count", rowCount),)3. Set Appropriate Span Status
Section titled “3. Set Appropriate Span Status”if err != nil { observability.RecordError(ctx, err) span.SetStatus(codes.Error, err.Error())} else { span.SetStatus(codes.Ok, "")}4. Use Span Links
Section titled “4. Use Span Links”Connect related spans:
// Link to background job spanspan.AddLink(trace.Link{ SpanContext: jobSpan.SpanContext(), Attributes: []attribute.KeyValue{ attribute.String("job.id", jobID), },})5. Configure Resource Attributes
Section titled “5. Configure Resource Attributes”Identify the service generating traces:
observability: tracing: service_name: "fluxbase" environment: "production"Resource attributes added automatically:
service.name: Service nameservice.version: Fluxbase versiondeployment.environment: Environment nameservice.namespace: “fluxbase”
Troubleshooting
Section titled “Troubleshooting”No Traces Appearing
Section titled “No Traces Appearing”Check 1: Verify tracing is enabled
# Check logs for initialization messagegrep "OpenTelemetry tracing initialized" /var/log/fluxbase/fluxbase.logCheck 2: Verify endpoint connectivity
# Test connection to collectortelnet localhost 4317Check 3: Check sample rate
# Ensure sample_rate > 0observability: tracing: sample_rate: 1.0 # Try 100% sampling for testingCheck 4: Verify collector configuration
# Check collector logsdocker logs otel-collectorSpans Not Connecting
Section titled “Spans Not Connecting”Issue: Spans appear but don’t form a complete trace.
Solution: Ensure trace context propagation is working:
- Check that requests include
traceparentheader - Verify context is passed through function calls
- Check that spans use
defer span.End()
High Memory Usage
Section titled “High Memory Usage”Issue: Tracing causes high memory usage.
Solutions:
-
Reduce sample rate:
observability:tracing:sample_rate: 0.1 # Sample only 10% -
Use batch processing:
# Collector configurationprocessors:batch:max_batch_size: 1000timeout: 10s -
Limit span attributes:
// Avoid adding large attributesobservability.SetSpanAttributes(ctx,attribute.String("huge.data", hugeDataString), // Badattribute.String("data.hash", hashData(hugeData)), // Good)
Performance Impact
Section titled “Performance Impact”Tracing overhead is minimal with proper configuration:
| Configuration | Overhead | Use Case |
|---|---|---|
| Sampling: 100% | ~5-10% | Development, critical paths |
| Sampling: 10% | ~1-2% | Production general |
| Sampling: 1% | <1% | High-traffic production |
Optimization Tips:
- Use sampling in production
- Disable tracing for health checks
- Use batch exporters
- Filter sensitive data from spans
- Set appropriate span timeout