Distributed Tracing

Name: Fluxbase
Author: Fluxbase

Distributed tracing provides end-to-end visibility into requests as they travel through your Fluxbase instance. Fluxbase uses OpenTelemetry for standardized, vendor-agnostic tracing.

What is Distributed Tracing?

Distributed tracing tracks a request as it moves through different services and components. Each unit of work is called a “span”, and spans are combined into a “trace” that shows the complete journey of a request.

Key Benefits:

Performance Analysis: Identify slow database queries, API calls, and function executions
Error Debugging: Trace errors across service boundaries
Architecture Understanding: Visualize service dependencies and data flow
Capacity Planning: Make data-driven decisions about scaling

Configuration

Enable OpenTelemetry tracing in your fluxbase.yaml:

observability:
  tracing:
    enabled: true
    endpoint: "localhost:4317"        # OTLP collector endpoint
    service_name: "fluxbase"           # Service name for traces
    environment: "production"           # Environment (development, staging, production)
    sample_rate: 1.0                   # Sample rate (0.0-1.0, 1.0 = 100%)
    insecure: false                    # Use TLS for production

Environment Variables:

export FLUXBASE_OBSERVABILITY_TRACING_ENABLED=true
export FLUXBASE_OBSERVABILITY_TRACING_ENDPOINT="collector.example.com:4317"
export FLUXBASE_OBSERVABILITY_TRACING_SERVICE_NAME="fluxbase"
export FLUXBASE_OBSERVABILITY_TRACING_ENVIRONMENT="production"
export FLUXBASE_OBSERVABILITY_TRACING_SAMPLE_RATE=0.1  # Sample 10% of traces

Setting Up Trace Backends

Fluxbase uses the OTLP (OpenTelemetry Protocol) format, which is compatible with many backends:

1. Jaeger

Jaeger is a popular open-source tracing backend.

Run Jaeger with Docker:

docker run -d --name jaeger \
  -e COLLECTOR_OTLP_ENABLED=true \
  -p 4317:4317 \
  -p 16686:16686 \
  jaegertracing/all-in-one:latest

Configure Fluxbase:

observability:
  tracing:
    enabled: true
    endpoint: "localhost:4317"
    insecure: true  # For local development

Access Jaeger UI:

Navigate to http://localhost:16686
Browse traces by service, operation, and tags

2. Grafana Tempo

Grafana Tempo is a scalable, high-performance distributed tracing backend.

Run Tempo with Docker:

docker run -d --name tempo \
  -p 4317:4317 \
  -p 3200:3200 \
  grafana/tempo:latest \
  -server.http-listen-port=3200 \
  -storage.trace.backend=local \
  -storage.trace.local.path=/tmp/tempo

Configure Fluxbase:

observability:
  tracing:
    enabled: true
    endpoint: "localhost:4317"

Access Tempo UI:

Use Grafana with Tempo data source
Navigate to Grafana → Explore → Select Tempo data source

3. OpenTelemetry Collector

For production deployments, use the OpenTelemetry Collector as a central processing pipeline:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    timeout: 5s
    max_batch_size: 1000

exporters:
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true

  logging:
    loglevel: debug

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger, logging]

Run Collector:

docker run -d --name otel-collector \
  -v $(pwd)/otel-collector-config.yaml:/etc/otelcol/config.yaml \
  -p 4317:4317 \
  otel/opentelemetry-collector:latest

4. Cloud Providers

AWS X-Ray:

# Use AWS Distro for OpenTelemetry (ADOT) Collector
observability:
  tracing:
    enabled: true
    endpoint: "localhost:4317"  # ADOT Collector endpoint

Google Cloud Trace:

# Use OpenTelemetry Collector with Google Cloud Trace exporter
observability:
  tracing:
    enabled: true
    endpoint: "localhost:4317"

Azure Monitor:

# Use Azure Monitor Application Insights exporter
observability:
  tracing:
    enabled: true
    endpoint: "localhost:4317"

Automatic Instrumentation

Fluxbase automatically creates spans for:

Database Operations

All PostgreSQL queries are automatically traced:

// Automatic span created for this query
ctx, span := observability.StartDBSpan(ctx, "SELECT", "users")
defer observability.EndDBSpan(span, err)

rows, err := db.Query(ctx, "SELECT * FROM users WHERE id = $1", userID)

Span Attributes:

db.system: “postgresql”
db.operation: “SELECT”, “INSERT”, “UPDATE”, “DELETE”
db.table: Table name
Error status if query fails

Authentication Operations

All auth operations create spans:

auth.signup
auth.signin
auth.signout
auth.oauth
auth.magic_link

Storage Operations

File storage operations are traced:

storage.upload
storage.download
storage.delete

Custom Spans

Create custom spans for additional visibility:

Basic Custom Span

import "github.com/nimbleflux/fluxbase/internal/observability"

// Start a custom span
ctx, span := observability.StartSpan(ctx, "my-custom-operation")
defer span.End()

// Your code here
result, err := doSomething(ctx)

// Record error if failed
if err != nil {
    observability.RecordError(ctx, err)
}

Span with Attributes

import (
    "go.opentelemetry.io/otel/attribute"
    "github.com/nimbleflux/fluxbase/internal/observability"
)

ctx, span := observability.StartSpan(ctx, "process-data")
defer span.End()

// Add custom attributes
observability.SetSpanAttributes(ctx,
    attribute.String("user.id", userID),
    attribute.Int("record.count", len(records)),
    attribute.String("processing.type", "batch"),
)

Span with Events

// Add events to track progress
observability.AddSpanEvent(ctx, "validation.started",
    attribute.Int("record.count", len(records)),
)

// ... validation code ...

observability.AddSpanEvent(ctx, "validation.completed",
    attribute.Int("valid.records", validCount),
    attribute.Int("invalid.records", invalidCount),
)

Edge Function Tracing

Fluxbase automatically traces edge function execution:

// Your Deno function
import { tracer } from "https://deno.land/x/otel@v0.1.0/api.ts";

// Span context is automatically available via environment variables
const traceParent = Deno.env.get("TRACEPARENT");
const traceId = Deno.env.get("OTEL_TRACE_ID");
const spanId = Deno.env.get("OTEL_SPAN_ID");

// Fluxbase automatically creates function spans with attributes:
// - function.execution_id
// - function.name
// - function.namespace
// - user.id (if authenticated)
// - http.method
// - http.url

Function Span Events:

// Add custom events to function spans
await fetch("https://api.example.com/data", {
  headers: {
    "traceparent": traceParent,  // Propagate trace context
  },
});

Background Job Tracing

Jobs are automatically traced with progress tracking:

// Job span is created when job starts
ctx, span := observability.StartJobSpan(ctx, observability.JobSpanConfig{
    JobID:       jobID,
    JobName:     "send-email",
    Namespace:   "notifications",
    Priority:    5,
    ScheduledAt: scheduledAt,
    WorkerID:    workerID,
    WorkerName:  "worker-1",
    UserID:      userID,
})
defer span.End()

// Track job progress
observability.SetJobProgress(ctx, 25, "Email queued")
// ... send email ...
observability.SetJobProgress(ctx, 50, "Email sent")
// ... update database ...
observability.SetJobProgress(ctx, 100, "Completed")

// Set final result
observability.SetJobResult(ctx, "completed", duration, nil)

Trace Context Propagation

Trace context automatically propagates to:

Database Queries: All queries carry trace context
HTTP Clients: Use traceparent header
Background Jobs: Jobs inherit parent trace
Edge Functions: Trace context passed as environment variables

Manual Propagation:

// Get trace context for subprocesses
env := observability.GetTraceContextEnv(ctx)

// Pass to subprocess
cmd := exec.CommandContext(ctx, "my-subprocess")
cmd.Env = append(os.Environ(), flattenEnv(env)...)

Sampling Strategies

Reduce tracing overhead with smart sampling:

# Sample all traces in development
observability:
  tracing:
    sample_rate: 1.0  # 100% sampling

# Sample 10% of traces in production
observability:
  tracing:
    sample_rate: 0.1  # 10% sampling

# Dynamic sampling based on route
observability:
  tracing:
    sample_rate: 0.01  # 1% baseline

Head-Based Sampling:

// Always trace slow operations
if duration > time.Second {
    observability.SetSpanAttributes(ctx,
        attribute.Bool("slow.request", true),
    )
}

// Always trace errors
if err != nil {
    observability.RecordError(ctx, err)
}

Analyzing Traces

Identify Slow Queries

Look for database spans with high duration:

Open Jaeger UI or Grafana Tempo
Filter by operation db.query or db.SELECT
Sort by duration
Click on slow spans to see SQL query

Trace Errors Across Services

Follow an error through the system:

Find traces with error status
Expand the trace timeline
Look for red error spans
Click on error spans to see stack traces

Performance Optimization

Identify optimization opportunities:

Look for spans with high duration
Check if spans run sequentially (could be parallelized)
Identify N+1 query patterns
Find slow external API calls

Best Practices

1. Use Semantic Attributes

Follow OpenTelemetry semantic conventions:

import semconv "go.opentelemetry.io/otel/semconv/v1.24.0"

observability.SetSpanAttributes(ctx,
    semconv.HTTPMethodKey.String("GET"),
    semconv.HTTPStatusCodeKey.Int(200),
    semconv.EnduserIDKey.String(userID),
)

2. Add Contextual Events

Track important events in spans:

observability.AddSpanEvent(ctx, "cache.miss",
    attribute.String("cache.key", cacheKey),
)

observability.AddSpanEvent(ctx, "db.query.started")
// ... run query ...
observability.AddSpanEvent(ctx, "db.query.completed",
    attribute.Int("db.row_count", rowCount),
)

3. Set Appropriate Span Status

if err != nil {
    observability.RecordError(ctx, err)
    span.SetStatus(codes.Error, err.Error())
} else {
    span.SetStatus(codes.Ok, "")
}

4. Use Span Links

Connect related spans:

// Link to background job span
span.AddLink(trace.Link{
    SpanContext: jobSpan.SpanContext(),
    Attributes: []attribute.KeyValue{
        attribute.String("job.id", jobID),
    },
})

5. Configure Resource Attributes

Identify the service generating traces:

observability:
  tracing:
    service_name: "fluxbase"
    environment: "production"

Resource attributes added automatically:

service.name: Service name
service.version: Fluxbase version
deployment.environment: Environment name
service.namespace: “fluxbase”

Troubleshooting

No Traces Appearing

Check 1: Verify tracing is enabled

# Check logs for initialization message
grep "OpenTelemetry tracing initialized" /var/log/fluxbase/fluxbase.log

Check 2: Verify endpoint connectivity

# Test connection to collector
telnet localhost 4317

Check 3: Check sample rate

# Ensure sample_rate > 0
observability:
  tracing:
    sample_rate: 1.0  # Try 100% sampling for testing

Check 4: Verify collector configuration

# Check collector logs
docker logs otel-collector

Spans Not Connecting

Issue: Spans appear but don’t form a complete trace.

Solution: Ensure trace context propagation is working:

Check that requests include traceparent header
Verify context is passed through function calls
Check that spans use defer span.End()

High Memory Usage

Issue: Tracing causes high memory usage.

Solutions:

Reduce sample rate:

observability:
  tracing:
    sample_rate: 0.1  # Sample only 10%

Use batch processing:

# Collector configuration
processors:
  batch:
    max_batch_size: 1000
    timeout: 10s

Limit span attributes:

// Avoid adding large attributes
observability.SetSpanAttributes(ctx,
    attribute.String("huge.data", hugeDataString),  // Bad
    attribute.String("data.hash", hashData(hugeData)),  // Good
)

Performance Impact

Tracing overhead is minimal with proper configuration:

Configuration	Overhead	Use Case
Sampling: 100%	~5-10%	Development, critical paths
Sampling: 10%	~1-2%	Production general
Sampling: 1%	<1%	High-traffic production

Optimization Tips:

Use sampling in production
Disable tracing for health checks
Use batch exporters
Filter sensitive data from spans
Set appropriate span timeout