Plenty of engineers have been paged at 3 AM more times than they care to count. Some of those incidents resolve in four minutes because the telemetry says exactly what is wrong. Others take four hours because somebody is staring at a wall of unstructured logs trying to correlate timestamps by hand across six different services. The difference is rarely talent or experience. It is almost always the quality of the observability stack.

This post is not about observability in a single language or framework. There is a separate companion post about Node.js observability specifically. This one is about the full picture: the architecture decisions, the storage backends, the alerting philosophy, the cost traps, and the debugging workflow that ties it all together.

The Three Pillars Are a Lie#

Every observability vendor presentation starts the same way: "There are three pillars of observability: logs, metrics, and traces." They draw three columns on a slide, put a checkmark next to each one, and move on to pricing.

Here is the problem: three isolated pillars are just three isolated data silos. Having logs in Elasticsearch, metrics in Prometheus, and traces in Jaeger does not give you observability. It gives you three different tools you have to manually cross-reference while your pager is screaming.

The actual value of observability comes from correlation. When a metric shows a latency spike at 03:14, you need to click on that spike and see the traces that contributed to it. When you find a slow trace, you need to see the logs emitted during that trace. When a log shows a database error, you need to see the metric that tells you how many other requests hit the same error.

Without correlation, you have three separate haystack-searching tools. With correlation, you have a unified debugging experience.

The missing piece that makes correlation work is context propagation. Every request entering your system gets a trace ID. That trace ID flows through every service, every log line, every metric exemplar. When you query any single pillar, you can pivot to the other two because they all share that trace ID.

yaml

# What an uncorrelated log looks like — useless at 3 AM
2026-03-15 03:14:07 ERROR Database connection timeout
 
# What a correlated log looks like — you can actually find the trace
{
  "timestamp": "2026-03-15T03:14:07.234Z",
  "level": "error",
  "message": "Database connection timeout",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "service": "order-service",
  "db.system": "postgresql",
  "db.statement": "SELECT * FROM orders WHERE user_id = $1",
  "db.connection_pool.active": 48,
  "db.connection_pool.max": 50
}

That second log line contains everything you need to begin debugging. The trace ID lets you pull the full distributed trace. The connection pool stats tell you the pool is nearly exhausted. The service name tells you which deployment to investigate. The span ID links this log to a specific operation within the trace.

This is why the three pillars metaphor is misleading. It suggests you build them independently and you are done. The reality is that the connections between the pillars are where the debugging value lives.

OpenTelemetry: The Standard That Actually Won#

For years, the observability ecosystem was fragmented. You had OpenTracing for traces, OpenCensus for metrics, vendor-specific agents for everything else, and none of them talked to each other. If you picked Datadog, you were locked into Datadog. If you picked Jaeger, you had a different instrumentation library than if you picked Zipkin.

OpenTelemetry merged OpenTracing and OpenCensus, and it genuinely won. It is now the second most active CNCF project after Kubernetes. Every major vendor supports it. Every major language has an SDK. You instrument your code once and send telemetry to whatever backend you want.

The architecture has three layers, and understanding all three is critical.

Layer 1: The SDK (Your Application Code)#

The SDK is what you import into your application. It provides APIs for creating spans, recording metrics, and emitting structured logs. The critical insight is that most of the SDK work is automatic. You do not need to manually create spans for every HTTP request or database query. Auto-instrumentation does that for you.

python

# Python: auto-instrumentation with zero code changes
# pip install opentelemetry-distro opentelemetry-exporter-otlp
# opentelemetry-bootstrap -a install
 
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor
 
# Set up the tracer provider
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
trace.set_tracer_provider(provider)
 
# Auto-instrument everything — Flask, SQLAlchemy, Redis
FlaskInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument()
RedisInstrumentor().instrument()
 
# That's it. Every incoming HTTP request, every DB query, every Redis call
# now generates spans with timing, status codes, and error details.

// Go: auto-instrumentation for net/http and database/sql
package main
 
import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/trace"
    "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
    "go.opentelemetry.io/contrib/instrumentation/database/sql/otelsql"
)
 
func main() {
    exporter, _ := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint("otel-collector:4317"),
        otlptracegrpc.WithInsecure(),
    )
    tp := trace.NewTracerProvider(trace.WithBatcher(exporter))
    otel.SetTracerProvider(tp)
 
    // Wrap your HTTP handler — all incoming requests get traced
    handler := otelhttp.NewHandler(mux, "server")
 
    // Wrap your database driver — all queries get traced
    db, _ := otelsql.Open("postgres", dsn)
}

The beauty of auto-instrumentation is that you get 80% of the value with 5% of the effort. Every HTTP call, database query, cache lookup, and external API call gets a span automatically. You only write manual instrumentation for business-logic-specific operations.

Layer 2: The Collector (Your Telemetry Pipeline)#

The OpenTelemetry Collector is a standalone binary that receives, processes, and exports telemetry data. It sits between your applications and your storage backends. This is the most underappreciated component in the entire stack.

Why not just export directly from your application to Prometheus or Jaeger? Three reasons.

First, decoupling. If you decide to switch from Jaeger to Tempo, you change the Collector config, not every application. Second, processing. The Collector can filter, sample, enrich, and transform telemetry before it hits storage. Third, reliability. The Collector buffers data, so if your storage backend has a brief outage, you do not lose telemetry.

yaml

# otel-collector-config.yaml — a production-ready pipeline
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
 
processors:
  batch:
    timeout: 5s
    send_batch_size: 8192
 
  memory_limiter:
    check_interval: 1s
    limit_mib: 2048
    spike_limit_mib: 512
 
  attributes:
    actions:
      - key: environment
        value: production
        action: upsert
      - key: db.statement
        action: hash # Don't store raw SQL — PII risk
 
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: error-traces
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-traces
        type: latency
        latency: { threshold_ms: 2000 }
      - name: probabilistic
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }
 
exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
 
  prometheusremotewrite:
    endpoint: http://mimir:9009/api/v1/push
 
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, attributes, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, attributes, batch]
      exporters: [loki]

That tail sampling config is critical. It keeps 100% of error traces and slow traces, but only 10% of successful fast traces. This alone can cut your trace storage costs by 80% while keeping every trace you actually care about.

Layer 3: The Storage Backends#

This is where your telemetry lands for querying. I will cover the specific backend choices later, but the key architectural point is that OpenTelemetry does not care which backends you use. You can run Grafana's stack (Loki, Mimir, Tempo), the Elastic stack, Datadog, or any combination. The Collector abstracts all of that away.

Structured Logging: Stop Paying the Printf Tax#

Every production codebase I have inherited has the same problem: a mix of console.log, logger.info("Processing request for user " + userId), and the occasional System.out.println left over from debugging. This is a debugging tax you pay on every single incident.

Unstructured logs require parsing. Parsing requires regex. Regex breaks when someone changes the log format. You end up writing log parsing rules that are more complex than the application code that generated the logs.

Structured logging means every log entry is a machine-parseable object with consistent fields. You never concatenate strings into log messages. Instead, you pass data as separate fields.

java

// Bad: string concatenation — impossible to query efficiently
logger.info("Order " + orderId + " placed by user " + userId
    + " for $" + amount + " with " + items.size() + " items");
 
// Good: structured fields — every field is independently queryable
logger.info("Order placed",
    kv("order_id", orderId),
    kv("user_id", userId),
    kv("amount", amount),
    kv("item_count", items.size()),
    kv("payment_method", paymentMethod),
    kv("trace_id", Span.current().getSpanContext().getTraceId())
);

typescript

// Node.js with Pino — structured by default
const logger = pino({
  level: process.env.LOG_LEVEL || "info",
  formatters: {
    level: (label) => ({ level: label }),
  },
  mixin() {
    const span = trace.getActiveSpan();
    if (span) {
      const ctx = span.spanContext();
      return {
        trace_id: ctx.traceId,
        span_id: ctx.spanId,
      };
    }
    return {};
  },
});
 
// Every log line automatically gets trace_id and span_id
logger.info({ orderId, userId, amount, itemCount: items.length }, "Order placed");
// Output: {"level":"info","trace_id":"abc123","span_id":"def456",
//   "orderId":"ord-789","userId":"usr-012","amount":149.99,
//   "itemCount":3,"msg":"Order placed"}

The key principle: the msg field is a human-readable summary that never contains variable data. All variable data goes into separate fields. This means you can query msg = "Order placed" to find all order events, then filter by amount > 1000 or payment_method = "crypto" without writing a regex.

Log Levels: A Practical Framework#

Most teams either log everything at INFO or use log levels inconsistently. Here is a useful framework:

ERROR: Something is broken and a human needs to investigate. A request failed, data might be corrupted, an external dependency is down. Every ERROR log should be actionable. If you see an ERROR and shrug, it should not be an ERROR.

WARN: Something is degraded but still functional. Connection pool is at 80% capacity. A retry succeeded but the first attempt failed. Response time exceeded the SLO threshold but the request still completed. WARNs are leading indicators of future ERRORs.

INFO: Significant business events. Order placed, user registered, payment processed, deployment started. These are the events you look at when you are investigating what happened, not when you are fighting a fire.

DEBUG: Implementation details useful during development. SQL queries, cache hit/miss, serialization timing. This level should be OFF in production by default but switchable at runtime without a redeploy.

The runtime-switchable part is important. When you are debugging a production issue, you need to be able to turn on DEBUG logging for a specific service without redeploying. This means your logging configuration should be driven by an environment variable or a config endpoint, not a build-time constant.

// Go: runtime-switchable log level via HTTP endpoint
var logLevel = new(slog.LevelVar)
 
func init() {
    logLevel.Set(slog.LevelInfo)
    handler := slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
        Level: logLevel,
    })
    slog.SetDefault(slog.New(handler))
}
 
// Expose an endpoint to change log level at runtime
// POST /admin/log-level?level=debug
func handleLogLevel(w http.ResponseWriter, r *http.Request) {
    level := r.URL.Query().Get("level")
    switch strings.ToLower(level) {
    case "debug":
        logLevel.Set(slog.LevelDebug)
    case "info":
        logLevel.Set(slog.LevelInfo)
    case "warn":
        logLevel.Set(slog.LevelWarn)
    case "error":
        logLevel.Set(slog.LevelError)
    }
    w.WriteHeader(http.StatusOK)
}

Metrics That Matter: RED vs USE#

There are two canonical frameworks for choosing which metrics to track, and using the wrong one for a given component will leave you blind.

The RED Method: For Request-Driven Services#

RED stands for Rate, Errors, and Duration. It applies to anything that serves requests: APIs, web servers, RPC services, GraphQL endpoints.

Rate: Requests per second. This tells you traffic volume. A sudden drop in rate is often more alarming than a spike, because it means clients cannot reach you or have given up.

Errors: Failed requests per second (or error percentage). Distinguish between client errors (4xx) and server errors (5xx). A spike in 400s means someone is sending bad requests, possibly an API integration issue. A spike in 500s means your code is broken.

Duration: Response time distribution. Never track just the average. Averages hide outliers. Track p50, p95, and p99. The p50 tells you what the typical user experiences. The p99 tells you what your most unlucky users experience. When p50 is fine but p99 is terrible, you have a tail latency problem, often caused by garbage collection, connection pool exhaustion, or a hot partition.

yaml

# Prometheus recording rules for RED metrics
groups:
  - name: red_metrics
    interval: 30s
    rules:
      # Rate
      - record: service:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (service, method, path)
 
      # Errors
      - record: service:http_errors:rate5m
        expr: |
          sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)
 
      # Duration (p50, p95, p99)
      - record: service:http_duration:p99_5m
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m]))
            by (service, le)
          )
      - record: service:http_duration:p95_5m
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m]))
            by (service, le)
          )

The USE Method: For Infrastructure Resources#

USE stands for Utilization, Saturation, and Errors. It applies to infrastructure components: CPUs, memory, disks, network interfaces, connection pools, thread pools, queue depths.

Utilization: How busy is the resource? CPU at 85%, disk at 70% capacity, connection pool with 45 of 50 connections in use.

Saturation: How much work is queued waiting for the resource? This is the metric most people miss. A CPU at 85% utilization with zero saturation is fine. A CPU at 85% utilization with a load average of 12 on a 4-core machine is in trouble. Queue depth for message consumers, wait time for connection pool checkout, and thread pool queue size are all saturation metrics.

Errors: Hardware or resource-level errors. Disk I/O errors, network packet drops, connection pool timeout errors.

The key insight: use RED for your services and USE for the resources those services depend on. When RED metrics show a problem (latency spike), USE metrics tell you which resource is the bottleneck (database connection pool saturated). They are complementary, not alternatives.

Distributed Tracing in Practice#

The concept of distributed tracing is simple: follow a request as it travels through multiple services, recording the time spent in each one. The implementation details are where teams struggle.

Spans and the Trace Tree#

A trace is a tree of spans. The root span represents the initial request (e.g., an HTTP request hitting your API gateway). Child spans represent operations within that request: database queries, downstream HTTP calls, cache lookups, message queue publishes.

python

# Manual span creation for business logic
from opentelemetry import trace
 
tracer = trace.get_tracer("order-service")
 
def process_order(order_data):
    with tracer.start_as_current_span("process_order",
        attributes={
            "order.id": order_data["id"],
            "order.item_count": len(order_data["items"]),
        }
    ) as span:
        # Validate inventory — child span created automatically by
        # the HTTP client instrumentation
        inventory = check_inventory(order_data["items"])
 
        if not inventory.available:
            span.set_status(trace.StatusCode.ERROR, "Inventory unavailable")
            span.set_attribute("order.failure_reason", "out_of_stock")
            raise OutOfStockError(inventory.missing_items)
 
        # Process payment — another child span
        with tracer.start_as_current_span("process_payment",
            attributes={"payment.method": order_data["payment_method"]}
        ):
            payment = charge_customer(order_data)
            span.set_attribute("payment.transaction_id", payment.tx_id)
 
        # Publish event — the trace context propagates into the message
        with tracer.start_as_current_span("publish_order_event"):
            publish_to_queue("order.completed", {
                "order_id": order_data["id"],
                "amount": order_data["total"],
            })

Context Propagation Across Boundaries#

The hardest part of distributed tracing is propagating context across service boundaries. For HTTP, this is handled by the W3C Trace Context standard, which uses two headers: traceparent and tracestate. Auto-instrumentation handles this for you in most HTTP clients.

But HTTP is the easy case. The tricky boundaries are:

Message queues: When you publish a message to Kafka, RabbitMQ, or SQS, the trace context needs to travel with the message. OpenTelemetry instrumentation libraries inject trace context into message headers. When the consumer picks up the message, it extracts the context and creates a new span that is a child of the producer's span. This means you can trace a request from the initial HTTP call, through a message queue, to the consumer processing it minutes later.

java

// Kafka producer — context injected into message headers
try (Scope scope = span.makeCurrent()) {
    ProducerRecord<String, String> record =
        new ProducerRecord<>("orders", orderId, payload);
 
    // OpenTelemetry Kafka instrumentation injects traceparent
    // into Kafka record headers automatically
    producer.send(record);
}
 
// Kafka consumer — context extracted from message headers
// Again, auto-instrumentation handles this
@KafkaListener(topics = "orders")
public void processOrder(ConsumerRecord<String, String> record) {
    // The span created here is a child of the producer's span
    // You see the full trace: HTTP -> Kafka publish -> Kafka consume
    orderService.fulfill(record.value());
}

gRPC: Context propagation works through gRPC metadata. The OpenTelemetry gRPC interceptors handle this automatically for both unary and streaming calls.

Async workflows: When a request triggers an async job (e.g., writing to a job queue that a worker picks up later), you need to decide: should the job be part of the original trace or a new trace that links back to the original? The "link" approach is usually best. The worker creates a new trace with a link to the original trace. This keeps traces manageable in length while preserving the causal relationship.

The Collector Pipeline: Choosing Your Data Router#

The OpenTelemetry Collector is the default choice, but it is not the only option. Here is when to use what.

OpenTelemetry Collector is the right choice if you are starting fresh or your primary telemetry is traces and metrics from OpenTelemetry-instrumented applications. It natively understands OTLP (the OpenTelemetry protocol), has processors for sampling, filtering, and enrichment, and exports to almost every backend. It is also the only option that properly handles tail-based sampling for traces.

Vector (from Datadog, but open source) is excellent for log-heavy workloads. It has a powerful transformation language (VRL), better performance than Fluentd for high-volume log processing, and a sophisticated topology model. If you have a lot of legacy logs that need parsing and transformation before they hit your storage backend, Vector is worth evaluating.

Fluentd/Fluent Bit is the incumbent in the Kubernetes ecosystem. Fluent Bit is the lightweight version that runs as a DaemonSet, collecting container logs and forwarding them. If your primary concern is collecting container stdout/stderr logs and shipping them somewhere, Fluent Bit is simple and battle-tested.

My recommendation: run the OpenTelemetry Collector for traces and metrics, and either the Collector's log pipeline or Fluent Bit for logs, depending on your log volume and transformation needs. Do not run three different collectors if you can avoid it. The operational burden of maintaining multiple data pipelines is real.

Storage Backends: Where Does Your Data Live#

This section could be its own book, so I will focus on the trade-offs that matter for each pillar.

Metrics: Prometheus vs InfluxDB vs Mimir#

Prometheus is the default choice for metrics in the Kubernetes ecosystem. It uses a pull model (it scrapes your applications), has a powerful query language (PromQL), and is incredibly efficient for time-series data. Its limitation is that it is fundamentally a single-node system. For multi-cluster or long-term storage, you need Thanos or Mimir.

Grafana Mimir (formerly Cortex) is horizontally scalable Prometheus. It accepts Prometheus remote write, stores data in object storage (S3, GCS), and supports multi-tenancy. If you have more than one Prometheus server or need more than two weeks of metric retention, Mimir is the upgrade path.

InfluxDB is the right choice if you have IoT-style metrics (high cardinality, irregular intervals) or your team is not comfortable with PromQL. Its query language (Flux) is more accessible. But in the cloud-native ecosystem, Prometheus compatibility is king, and most dashboards and alerting rules assume PromQL.

Logs: Elasticsearch vs Loki#

Elasticsearch gives you full-text search on your logs. You can run arbitrary queries, aggregate across fields, and build complex dashboards. The cost is operational complexity and resource consumption. An Elasticsearch cluster for production logs at scale requires dedicated care and feeding. You need to manage index lifecycles, shard allocation, and cluster capacity planning.

Grafana Loki takes a fundamentally different approach. It indexes only labels (service name, log level, namespace), not the log content itself. Queries filter by labels first, then grep through the matching log streams. This makes it dramatically cheaper to run at scale, but it means free-text search is slow on large time ranges. The trade-off is worth it for most teams, because when you are debugging an incident, you almost always know which service you are looking at. You are filtering by service name and time range, then scanning the results. Loki is fast at that pattern.

My recommendation: start with Loki unless you have a specific need for full-text search across all services simultaneously. You can always add Elasticsearch later for specific use cases.

Traces: Jaeger vs Tempo vs Zipkin#

Jaeger is the original open-source distributed tracing backend. It is mature, well-documented, and supports multiple storage backends (Elasticsearch, Cassandra, Kafka). Its UI is functional but dated.

Grafana Tempo is a trace backend that uses object storage (S3, GCS) instead of a database. This makes it incredibly cheap to operate. It integrates with Grafana's explore view, and its trace-to-logs and trace-to-metrics features make cross-pillar correlation seamless. The trade-off is that Tempo does not index traces, so you cannot search by arbitrary attributes. You need the trace ID to retrieve a trace. This is fine if your metrics have exemplars (trace IDs attached to metric data points) and your logs have trace IDs.

My recommendation: Tempo if you are using the Grafana stack, Jaeger if you need attribute-based trace search (e.g., "find me all traces where user_id = 12345").

Alerting That Does Not Cause Alert Fatigue#

Alert fatigue is the number one reason observability stacks fail. The team sets up monitoring, creates fifty alert rules in the first week, gets paged constantly for non-issues, and starts ignoring alerts. Six months later, a real incident gets missed because everyone has learned to dismiss pages.

The fix is SLO-based alerting with error budgets and burn rate.

SLOs and Error Budgets#

An SLO (Service Level Objective) is a target for your service's reliability. "99.9% of requests will return a successful response within 500ms over a 30-day window." This gives you an error budget: 0.1% of requests can fail or be slow. Over 30 days with 10 million requests per day, that is 10,000 failed requests per day before you breach your SLO.

The error budget reframes reliability from "never fail" (impossible) to "fail less than X" (actionable). It also gives you a tool for prioritizing: if your error budget is 80% consumed with two weeks left in the window, you should probably stop shipping features and fix reliability. If your error budget is 5% consumed, you have room to take risks with deployments.

Burn Rate Alerts#

Instead of alerting on "error rate > 1%", alert on the rate at which you are consuming your error budget.

A burn rate of 1x means you will exactly exhaust your error budget by the end of the window. A burn rate of 14.4x means you will exhaust your error budget in two days. A burn rate of 36x means you will exhaust it in 20 hours.

yaml

# Prometheus alerting rules using multi-window burn rate
groups:
  - name: slo_alerts
    rules:
      # Fast burn: 2% of error budget consumed in 1 hour
      # 14.4x burn rate, checked over 5m and 1h windows
      - alert: HighErrorBudgetBurn_Critical
        expr: |
          (
            sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service)
            /
            sum(rate(http_requests_total[5m])) by (service)
          ) > (14.4 * 0.001)
          and
          (
            sum(rate(http_requests_total{status_code=~"5.."}[1h])) by (service)
            /
            sum(rate(http_requests_total[1h])) by (service)
          ) > (14.4 * 0.001)
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.service }} burning error budget 14.4x"
          description: >
            At current error rate, 30-day error budget will be
            exhausted in ~50 hours. Immediate investigation required.
 
      # Slow burn: 10% of error budget consumed in 3 days
      # 1x burn rate, checked over 30m and 6h windows
      - alert: HighErrorBudgetBurn_Warning
        expr: |
          (
            sum(rate(http_requests_total{status_code=~"5.."}[30m])) by (service)
            /
            sum(rate(http_requests_total[30m])) by (service)
          ) > (1.0 * 0.001)
          and
          (
            sum(rate(http_requests_total{status_code=~"5.."}[6h])) by (service)
            /
            sum(rate(http_requests_total[6h])) by (service)
          ) > (1.0 * 0.001)
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.service }} slowly burning error budget"

The multi-window approach (checking both a short window and a long window) prevents false positives from brief spikes. A one-second error spike would trigger a 5-minute window alert but not a 1-hour window alert, so the alert does not fire. But a sustained error rate triggers both windows, so the alert fires and you can trust it.

This approach typically reduces alert volume by 90% compared to threshold-based alerting while catching real incidents faster. The alerts you do get are meaningful, so people actually respond to them.

Dashboards: Grafana Patterns That Help vs Vanity Dashboards#

Many Grafana dashboards in the wild are useless. They show metrics that look impressive in a demo but do not help anyone debug anything. Here is what actually works.

The Four-Dashboard Model#

Dashboard 1: Service Overview. One dashboard per service with RED metrics (rate, errors, duration), the top 5 slowest endpoints, and the top 5 error-producing endpoints. This is what you look at first during an incident. It answers: "Is this service healthy, and if not, which endpoints are affected?"

Dashboard 2: Infrastructure. USE metrics for the resources your service depends on. Database connection pool utilization, CPU and memory, disk I/O, network throughput. This answers: "Is the problem in my code or in the infrastructure underneath it?"

Dashboard 3: Dependencies. Response time and error rate for every downstream service and database. This answers: "Is the problem in my service or in something my service calls?"

Dashboard 4: Business Metrics. Orders per minute, payment success rate, user registrations, whatever your business cares about. This answers: "Is the technical problem actually affecting users?"

Notice what is not here: a "system overview" dashboard with 47 panels showing every metric in the system. That dashboard helps no one. It takes 30 seconds to load and another 30 seconds to find the panel you care about. By the time you have oriented yourself, you could have already diagnosed the issue with a focused dashboard.

Dashboard Anti-Patterns#

Vanity metrics: Showing total request count (ever increasing, tells you nothing) instead of request rate (tells you about current traffic).

Missing time context: Dashboards without a comparison to the previous week. A p99 of 800ms means nothing without context. Is that normal? Is it 2x higher than last Tuesday? The "compare to previous period" feature in Grafana is the most underused feature I know.

No drill-down path: Every dashboard panel should be clickable. Click on a metric and you should land in a trace view showing specific requests that contributed to that metric. This requires metric exemplars, which I will cover in the debugging workflow section.

Cost Management: The Cardinality Problem#

Observability is expensive. Datadog bills have bankrupted startup budgets. Even self-hosted solutions eat compute and storage. The number one cost driver is cardinality: the number of unique time series your metrics system tracks.

Every unique combination of metric name and label values creates a new time series. If you have a metric http_requests_total with labels {service, method, path, status_code}, and you have 10 services, 4 methods, 200 paths, and 50 status codes, you have 10 x 4 x 200 x 50 = 400,000 time series. That is manageable.

Now someone adds a user_id label. With 100,000 users, you have 40 billion time series. Your Prometheus server falls over, your Mimir cluster costs more than your application infrastructure, and the person who added that label does not understand why you are upset.

Rules for Metric Labels#

Never use unbounded values as labels. User IDs, request IDs, email addresses, IP addresses -- these are log fields, not metric labels. If you need per-user metrics, use a different approach (log aggregation or a dedicated analytics system).
Use histograms instead of per-value tracking. Instead of tracking every unique response time, use histogram buckets. This gives you percentile calculations with bounded cardinality.
Drop unnecessary labels at the collector. The OpenTelemetry Collector can strip labels before they hit storage. This is your safety net for catching unbounded labels before they explode your cardinality.

yaml

# OpenTelemetry Collector: drop high-cardinality attributes
processors:
  attributes:
    actions:
      # Remove user_id from metrics — use logs for per-user data
      - key: user_id
        action: delete
      # Remove full URL path, keep only the route template
      - key: http.target
        action: delete
      # Keep http.route which has bounded cardinality
      # /users/:id instead of /users/12345

Sampling Strategies for Traces#

Storing every trace is expensive and unnecessary. Most requests are successful, fast, and boring. You want to keep the interesting ones.

Head-based sampling decides at the start of a trace whether to sample it. It is simple (flip a coin, keep 10%) but blind. You might drop the one trace that shows an error.

Tail-based sampling decides after the trace is complete. The Collector sees the entire trace, including whether it errored or was slow, and then decides whether to keep it. This is what you want. Keep 100% of error traces, 100% of slow traces, and a small percentage of everything else.

The trade-off is that tail-based sampling requires the Collector to buffer complete traces before making a decision, which means higher memory usage and the need for all spans of a trace to arrive at the same Collector instance (use a load balancer with trace-ID-based routing).

Retention Policies#

Not all telemetry ages equally. Here is a sensible retention policy:

Metrics: 90 days at full resolution, then downsampled (5-minute averages) for 1 year, then deleted. You need full resolution for recent incidents and trending for capacity planning.
Logs: 30 days hot (fast query), 90 days warm (slower query, cheaper storage), then deleted. If you need logs older than 90 days, something has gone wrong with your incident response process.
Traces: 14 days. Traces are the most expensive to store and the least likely to be queried after the incident is resolved. If you need a trace from three weeks ago, you probably have a process problem, not a storage problem.

The Debugging Workflow: Alert to Root Cause#

Here is the workflow that makes all of this machinery worth the investment. This is the sequence I follow at 3 AM when my pager goes off.

Step 1: Check the alert. The burn-rate alert tells me which service is burning error budget, at what rate, and when it started. The alert annotation tells me the current error rate and the expected impact.

Step 2: Open the service overview dashboard. I look at the RED metrics. Which endpoints are affected? Is it all traffic or just specific paths? When exactly did the degradation start? Was it gradual or sudden? (Gradual usually means resource exhaustion. Sudden usually means a bad deploy or an upstream failure.)

Step 3: Check the dependencies dashboard. Is a downstream service or database also degraded? If the latency spike started 30 seconds after a downstream service's latency spiked, I have my root cause direction.

Step 4: Click an exemplar. On the latency graph, I click a data point in the degraded period. The exemplar gives me a trace ID. I open that trace.

Step 5: Read the trace. The trace shows me the full request lifecycle. I can see that 94% of the request time was spent waiting for a PostgreSQL query that normally takes 5ms but now takes 2.3 seconds. The trace includes the query (hashed for PII safety), the database host, and the connection pool wait time.

Step 6: Pivot to logs. From the trace, I click "View logs for this span." The logs show me that the database connection pool has been at capacity for the last 8 minutes and connections are timing out. The logs also show that a background job (a nightly data aggregation) started 10 minutes ago and is holding 40 of the 50 pool connections.

Step 7: Fix it. I kill the background job, the pool drains, latency returns to normal. Then I file a ticket to move the background job to a separate connection pool. Total time from page to resolution: seven minutes.

That workflow is only possible because every component is correlated. The metric has an exemplar that links to a trace. The trace has span IDs that link to logs. The logs have connection pool metrics that link back to infrastructure dashboards. Without those links, each step becomes "search through a different tool and try to match timestamps by hand."

Real Incident Walkthrough: Latency Spike Across Five Services#

Let me walk through an actual incident to show how this works end-to-end.

The Alert#

Tuesday, 03:14 AM. PagerDuty fires: "api-gateway burning error budget at 6x. P99 latency: 4.2s (SLO: 1s). Affected for 12 minutes."

The Investigation#

The on-call engineer opens the api-gateway dashboard on their phone. The p99 latency graph shows a cliff: latency went from 200ms to 4 seconds at exactly 03:02. Not gradual. Something changed.

Check recent deployments. Nothing deployed since 18:00 yesterday. So this is not a bad deploy.

Open the dependencies dashboard. The api-gateway calls five downstream services: user-service, order-service, product-service, search-service, and notification-service. Four of them show normal latency. The order-service shows p99 latency of 3.8 seconds, starting at 03:02.

Drill into order-service. Its dashboard shows the latency spike is concentrated on the GET /orders/:id endpoint. Other endpoints are fine. The error rate is zero, meaning requests are completing, just slowly.

Click an exemplar on the order-service latency graph. The trace shows:

api-gateway (4.1s total)
  └── order-service GET /orders/ord-88421 (3.8s)
        ├── redis.get orders:ord-88421 (0.3ms) — cache miss
        ├── postgresql SELECT * FROM orders WHERE id = $1 (3ms)
        ├── postgresql SELECT * FROM order_items WHERE order_id = $1 (2.2s) ← HERE
        └── product-service GET /products/batch (47ms)

The order_items query normally takes 5ms but is taking 2.2 seconds. Click on that span and pivot to logs. The logs show:

json

{
  "level": "warn",
  "msg": "Slow query detected",
  "trace_id": "a1b2c3d4e5f6...",
  "db.statement_hash": "sel_order_items_by_oid",
  "db.duration_ms": 2247,
  "db.rows_returned": 3,
  "db.plan": "Seq Scan on order_items (rows=3 actual, rows=2400000 est)"
}

Sequential scan. The query planner is choosing a sequential scan on a table with 2.4 million rows instead of using the index. Check the database infrastructure dashboard. CPU is at 35%, I/O wait is at 60%. The disk is the bottleneck.

SSH into the database and check: pg_stat_user_indexes shows the index on order_items.order_id exists but has zero scans. Then check pg_stat_activity and find a long-running VACUUM FULL process that started at 02:58. The VACUUM FULL acquired an exclusive lock on the table, and while it was running, the query planner's statistics became stale (the autovacuum worker reset them), so it fell back to a sequential scan.

The Fix#

Cancel the VACUUM FULL (it was triggered by an overly aggressive autovacuum configuration). Run ANALYZE order_items to refresh statistics. Within 30 seconds, the query planner switches back to the index scan, latency drops to normal, and the burn rate alert auto-resolves.

The Post-Incident Actions#

Move VACUUM FULL to a maintenance window with reduced traffic.
Add a metric for query planner choices (sequential scan vs index scan) so this gets caught earlier.
Add an alert for VACUUM FULL operations lasting more than 5 minutes on high-traffic tables.
Update the autovacuum config to avoid triggering VACUUM FULL during peak hours.

Total time from alert to resolution: eleven minutes. Without correlated telemetry, this same investigation would likely take 1-2 hours of grepping through logs to figure out which database table was involved.

Putting It All Together#

Building an observability stack is not a weekend project. It is an ongoing investment that pays dividends every time something goes wrong. Here is the order I recommend for teams starting from scratch.

Phase 1: Structured logging with correlation IDs. This is the highest-value, lowest-effort change. Switch to structured logging, add trace IDs to every log line, ship logs to Loki or Elasticsearch. You can do this in a week and it immediately improves your debugging experience.

Phase 2: RED metrics with SLO-based alerting. Instrument your services with OpenTelemetry, export metrics to Prometheus or Mimir, define SLOs, set up burn-rate alerts. This replaces your existing threshold-based alerts and dramatically reduces noise.

Phase 3: Distributed tracing. Deploy the OpenTelemetry Collector, enable auto-instrumentation, send traces to Tempo or Jaeger. Connect traces to logs and metrics using trace IDs and exemplars. This is the step that makes the debugging workflow I described above possible.

Phase 4: Cost optimization. Implement tail-based sampling, set retention policies, audit metric cardinality, and set up dashboards to monitor the cost of your observability stack itself. Yes, you should observe your observability.

The most important thing is to start. A imperfect observability stack that covers your critical services is infinitely better than a perfect plan that exists only in a design document. Ship something, get paged, learn what data you were missing, add it, and iterate. That is how every good observability stack was built: one incident at a time.

The Three Pillars Are a Lie#

Without correlation, you have three separate haystack-searching tools. With correlation, you have a unified debugging experience.

yaml

# What an uncorrelated log looks like — useless at 3 AM
2026-03-15 03:14:07 ERROR Database connection timeout
 
# What a correlated log looks like — you can actually find the trace
{
  "timestamp": "2026-03-15T03:14:07.234Z",
  "level": "error",
  "message": "Database connection timeout",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "service": "order-service",
  "db.system": "postgresql",
  "db.statement": "SELECT * FROM orders WHERE user_id = $1",
  "db.connection_pool.active": 48,
  "db.connection_pool.max": 50
}

OpenTelemetry: The Standard That Actually Won#

The architecture has three layers, and understanding all three is critical.

Layer 1: The SDK (Your Application Code)#

python

# Python: auto-instrumentation with zero code changes
# pip install opentelemetry-distro opentelemetry-exporter-otlp
# opentelemetry-bootstrap -a install
 
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor
 
# Set up the tracer provider
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
trace.set_tracer_provider(provider)
 
# Auto-instrument everything — Flask, SQLAlchemy, Redis
FlaskInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument()
RedisInstrumentor().instrument()
 
# That's it. Every incoming HTTP request, every DB query, every Redis call
# now generates spans with timing, status codes, and error details.

// Go: auto-instrumentation for net/http and database/sql
package main
 
import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/trace"
    "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
    "go.opentelemetry.io/contrib/instrumentation/database/sql/otelsql"
)
 
func main() {
    exporter, _ := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint("otel-collector:4317"),
        otlptracegrpc.WithInsecure(),
    )
    tp := trace.NewTracerProvider(trace.WithBatcher(exporter))
    otel.SetTracerProvider(tp)
 
    // Wrap your HTTP handler — all incoming requests get traced
    handler := otelhttp.NewHandler(mux, "server")
 
    // Wrap your database driver — all queries get traced
    db, _ := otelsql.Open("postgres", dsn)
}

Layer 2: The Collector (Your Telemetry Pipeline)#

Why not just export directly from your application to Prometheus or Jaeger? Three reasons.

yaml

# otel-collector-config.yaml — a production-ready pipeline
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
 
processors:
  batch:
    timeout: 5s
    send_batch_size: 8192
 
  memory_limiter:
    check_interval: 1s
    limit_mib: 2048
    spike_limit_mib: 512
 
  attributes:
    actions:
      - key: environment
        value: production
        action: upsert
      - key: db.statement
        action: hash # Don't store raw SQL — PII risk
 
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: error-traces
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-traces
        type: latency
        latency: { threshold_ms: 2000 }
      - name: probabilistic
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }
 
exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
 
  prometheusremotewrite:
    endpoint: http://mimir:9009/api/v1/push
 
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, attributes, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, attributes, batch]
      exporters: [loki]

Layer 3: The Storage Backends#

Structured Logging: Stop Paying the Printf Tax#

Structured logging means every log entry is a machine-parseable object with consistent fields. You never concatenate strings into log messages. Instead, you pass data as separate fields.

java

// Bad: string concatenation — impossible to query efficiently
logger.info("Order " + orderId + " placed by user " + userId
    + " for $" + amount + " with " + items.size() + " items");
 
// Good: structured fields — every field is independently queryable
logger.info("Order placed",
    kv("order_id", orderId),
    kv("user_id", userId),
    kv("amount", amount),
    kv("item_count", items.size()),
    kv("payment_method", paymentMethod),
    kv("trace_id", Span.current().getSpanContext().getTraceId())
);

typescript

// Node.js with Pino — structured by default
const logger = pino({
  level: process.env.LOG_LEVEL || "info",
  formatters: {
    level: (label) => ({ level: label }),
  },
  mixin() {
    const span = trace.getActiveSpan();
    if (span) {
      const ctx = span.spanContext();
      return {
        trace_id: ctx.traceId,
        span_id: ctx.spanId,
      };
    }
    return {};
  },
});
 
// Every log line automatically gets trace_id and span_id
logger.info({ orderId, userId, amount, itemCount: items.length }, "Order placed");
// Output: {"level":"info","trace_id":"abc123","span_id":"def456",
//   "orderId":"ord-789","userId":"usr-012","amount":149.99,
//   "itemCount":3,"msg":"Order placed"}

Log Levels: A Practical Framework#

Most teams either log everything at INFO or use log levels inconsistently. Here is a useful framework:

// Go: runtime-switchable log level via HTTP endpoint
var logLevel = new(slog.LevelVar)
 
func init() {
    logLevel.Set(slog.LevelInfo)
    handler := slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
        Level: logLevel,
    })
    slog.SetDefault(slog.New(handler))
}
 
// Expose an endpoint to change log level at runtime
// POST /admin/log-level?level=debug
func handleLogLevel(w http.ResponseWriter, r *http.Request) {
    level := r.URL.Query().Get("level")
    switch strings.ToLower(level) {
    case "debug":
        logLevel.Set(slog.LevelDebug)
    case "info":
        logLevel.Set(slog.LevelInfo)
    case "warn":
        logLevel.Set(slog.LevelWarn)
    case "error":
        logLevel.Set(slog.LevelError)
    }
    w.WriteHeader(http.StatusOK)
}

Metrics That Matter: RED vs USE#

There are two canonical frameworks for choosing which metrics to track, and using the wrong one for a given component will leave you blind.

The RED Method: For Request-Driven Services#

RED stands for Rate, Errors, and Duration. It applies to anything that serves requests: APIs, web servers, RPC services, GraphQL endpoints.

Rate: Requests per second. This tells you traffic volume. A sudden drop in rate is often more alarming than a spike, because it means clients cannot reach you or have given up.

yaml

# Prometheus recording rules for RED metrics
groups:
  - name: red_metrics
    interval: 30s
    rules:
      # Rate
      - record: service:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (service, method, path)
 
      # Errors
      - record: service:http_errors:rate5m
        expr: |
          sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)
 
      # Duration (p50, p95, p99)
      - record: service:http_duration:p99_5m
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m]))
            by (service, le)
          )
      - record: service:http_duration:p95_5m
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m]))
            by (service, le)
          )

The USE Method: For Infrastructure Resources#

USE stands for Utilization, Saturation, and Errors. It applies to infrastructure components: CPUs, memory, disks, network interfaces, connection pools, thread pools, queue depths.

Utilization: How busy is the resource? CPU at 85%, disk at 70% capacity, connection pool with 45 of 50 connections in use.

Errors: Hardware or resource-level errors. Disk I/O errors, network packet drops, connection pool timeout errors.

Distributed Tracing in Practice#

The concept of distributed tracing is simple: follow a request as it travels through multiple services, recording the time spent in each one. The implementation details are where teams struggle.

Spans and the Trace Tree#

python

# Manual span creation for business logic
from opentelemetry import trace
 
tracer = trace.get_tracer("order-service")
 
def process_order(order_data):
    with tracer.start_as_current_span("process_order",
        attributes={
            "order.id": order_data["id"],
            "order.item_count": len(order_data["items"]),
        }
    ) as span:
        # Validate inventory — child span created automatically by
        # the HTTP client instrumentation
        inventory = check_inventory(order_data["items"])
 
        if not inventory.available:
            span.set_status(trace.StatusCode.ERROR, "Inventory unavailable")
            span.set_attribute("order.failure_reason", "out_of_stock")
            raise OutOfStockError(inventory.missing_items)
 
        # Process payment — another child span
        with tracer.start_as_current_span("process_payment",
            attributes={"payment.method": order_data["payment_method"]}
        ):
            payment = charge_customer(order_data)
            span.set_attribute("payment.transaction_id", payment.tx_id)
 
        # Publish event — the trace context propagates into the message
        with tracer.start_as_current_span("publish_order_event"):
            publish_to_queue("order.completed", {
                "order_id": order_data["id"],
                "amount": order_data["total"],
            })

Context Propagation Across Boundaries#

But HTTP is the easy case. The tricky boundaries are:

java

// Kafka producer — context injected into message headers
try (Scope scope = span.makeCurrent()) {
    ProducerRecord<String, String> record =
        new ProducerRecord<>("orders", orderId, payload);
 
    // OpenTelemetry Kafka instrumentation injects traceparent
    // into Kafka record headers automatically
    producer.send(record);
}
 
// Kafka consumer — context extracted from message headers
// Again, auto-instrumentation handles this
@KafkaListener(topics = "orders")
public void processOrder(ConsumerRecord<String, String> record) {
    // The span created here is a child of the producer's span
    // You see the full trace: HTTP -> Kafka publish -> Kafka consume
    orderService.fulfill(record.value());
}

gRPC: Context propagation works through gRPC metadata. The OpenTelemetry gRPC interceptors handle this automatically for both unary and streaming calls.

The Collector Pipeline: Choosing Your Data Router#

The OpenTelemetry Collector is the default choice, but it is not the only option. Here is when to use what.

Storage Backends: Where Does Your Data Live#

This section could be its own book, so I will focus on the trade-offs that matter for each pillar.

Metrics: Prometheus vs InfluxDB vs Mimir#

Logs: Elasticsearch vs Loki#

My recommendation: start with Loki unless you have a specific need for full-text search across all services simultaneously. You can always add Elasticsearch later for specific use cases.

Traces: Jaeger vs Tempo vs Zipkin#

My recommendation: Tempo if you are using the Grafana stack, Jaeger if you need attribute-based trace search (e.g., "find me all traces where user_id = 12345").

Alerting That Does Not Cause Alert Fatigue#

The fix is SLO-based alerting with error budgets and burn rate.

SLOs and Error Budgets#

Burn Rate Alerts#

Instead of alerting on "error rate > 1%", alert on the rate at which you are consuming your error budget.

yaml

# Prometheus alerting rules using multi-window burn rate
groups:
  - name: slo_alerts
    rules:
      # Fast burn: 2% of error budget consumed in 1 hour
      # 14.4x burn rate, checked over 5m and 1h windows
      - alert: HighErrorBudgetBurn_Critical
        expr: |
          (
            sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service)
            /
            sum(rate(http_requests_total[5m])) by (service)
          ) > (14.4 * 0.001)
          and
          (
            sum(rate(http_requests_total{status_code=~"5.."}[1h])) by (service)
            /
            sum(rate(http_requests_total[1h])) by (service)
          ) > (14.4 * 0.001)
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.service }} burning error budget 14.4x"
          description: >
            At current error rate, 30-day error budget will be
            exhausted in ~50 hours. Immediate investigation required.
 
      # Slow burn: 10% of error budget consumed in 3 days
      # 1x burn rate, checked over 30m and 6h windows
      - alert: HighErrorBudgetBurn_Warning
        expr: |
          (
            sum(rate(http_requests_total{status_code=~"5.."}[30m])) by (service)
            /
            sum(rate(http_requests_total[30m])) by (service)
          ) > (1.0 * 0.001)
          and
          (
            sum(rate(http_requests_total{status_code=~"5.."}[6h])) by (service)
            /
            sum(rate(http_requests_total[6h])) by (service)
          ) > (1.0 * 0.001)
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.service }} slowly burning error budget"

Dashboards: Grafana Patterns That Help vs Vanity Dashboards#

Many Grafana dashboards in the wild are useless. They show metrics that look impressive in a demo but do not help anyone debug anything. Here is what actually works.

The Four-Dashboard Model#

Dashboard 3: Dependencies. Response time and error rate for every downstream service and database. This answers: "Is the problem in my service or in something my service calls?"

Dashboard 4: Business Metrics. Orders per minute, payment success rate, user registrations, whatever your business cares about. This answers: "Is the technical problem actually affecting users?"

Dashboard Anti-Patterns#

Vanity metrics: Showing total request count (ever increasing, tells you nothing) instead of request rate (tells you about current traffic).

Cost Management: The Cardinality Problem#

Rules for Metric Labels#

Never use unbounded values as labels. User IDs, request IDs, email addresses, IP addresses -- these are log fields, not metric labels. If you need per-user metrics, use a different approach (log aggregation or a dedicated analytics system).
Use histograms instead of per-value tracking. Instead of tracking every unique response time, use histogram buckets. This gives you percentile calculations with bounded cardinality.
Drop unnecessary labels at the collector. The OpenTelemetry Collector can strip labels before they hit storage. This is your safety net for catching unbounded labels before they explode your cardinality.

yaml

# OpenTelemetry Collector: drop high-cardinality attributes
processors:
  attributes:
    actions:
      # Remove user_id from metrics — use logs for per-user data
      - key: user_id
        action: delete
      # Remove full URL path, keep only the route template
      - key: http.target
        action: delete
      # Keep http.route which has bounded cardinality
      # /users/:id instead of /users/12345

Sampling Strategies for Traces#

Storing every trace is expensive and unnecessary. Most requests are successful, fast, and boring. You want to keep the interesting ones.

Head-based sampling decides at the start of a trace whether to sample it. It is simple (flip a coin, keep 10%) but blind. You might drop the one trace that shows an error.

Retention Policies#

Not all telemetry ages equally. Here is a sensible retention policy:

Metrics: 90 days at full resolution, then downsampled (5-minute averages) for 1 year, then deleted. You need full resolution for recent incidents and trending for capacity planning.
Logs: 30 days hot (fast query), 90 days warm (slower query, cheaper storage), then deleted. If you need logs older than 90 days, something has gone wrong with your incident response process.
Traces: 14 days. Traces are the most expensive to store and the least likely to be queried after the incident is resolved. If you need a trace from three weeks ago, you probably have a process problem, not a storage problem.

The Debugging Workflow: Alert to Root Cause#

Here is the workflow that makes all of this machinery worth the investment. This is the sequence I follow at 3 AM when my pager goes off.

Step 4: Click an exemplar. On the latency graph, I click a data point in the degraded period. The exemplar gives me a trace ID. I open that trace.

Real Incident Walkthrough: Latency Spike Across Five Services#

Let me walk through an actual incident to show how this works end-to-end.

The Alert#

Tuesday, 03:14 AM. PagerDuty fires: "api-gateway burning error budget at 6x. P99 latency: 4.2s (SLO: 1s). Affected for 12 minutes."

The Investigation#

The on-call engineer opens the api-gateway dashboard on their phone. The p99 latency graph shows a cliff: latency went from 200ms to 4 seconds at exactly 03:02. Not gradual. Something changed.

Check recent deployments. Nothing deployed since 18:00 yesterday. So this is not a bad deploy.

Click an exemplar on the order-service latency graph. The trace shows:

api-gateway (4.1s total)
  └── order-service GET /orders/ord-88421 (3.8s)
        ├── redis.get orders:ord-88421 (0.3ms) — cache miss
        ├── postgresql SELECT * FROM orders WHERE id = $1 (3ms)
        ├── postgresql SELECT * FROM order_items WHERE order_id = $1 (2.2s) ← HERE
        └── product-service GET /products/batch (47ms)

The order_items query normally takes 5ms but is taking 2.2 seconds. Click on that span and pivot to logs. The logs show:

json

{
  "level": "warn",
  "msg": "Slow query detected",
  "trace_id": "a1b2c3d4e5f6...",
  "db.statement_hash": "sel_order_items_by_oid",
  "db.duration_ms": 2247,
  "db.rows_returned": 3,
  "db.plan": "Seq Scan on order_items (rows=3 actual, rows=2400000 est)"
}

The Fix#

The Post-Incident Actions#

Move VACUUM FULL to a maintenance window with reduced traffic.
Add a metric for query planner choices (sequential scan vs index scan) so this gets caught earlier.
Add an alert for VACUUM FULL operations lasting more than 5 minutes on high-traffic tables.
Update the autovacuum config to avoid triggering VACUUM FULL during peak hours.

The Three Pillars Are a Lie#

OpenTelemetry: The Standard That Actually Won#

Layer 1: The SDK (Your Application Code)#

Layer 2: The Collector (Your Telemetry Pipeline)#

Layer 3: The Storage Backends#

Structured Logging: Stop Paying the Printf Tax#

Log Levels: A Practical Framework#

Metrics That Matter: RED vs USE#

The RED Method: For Request-Driven Services#

The USE Method: For Infrastructure Resources#

Distributed Tracing in Practice#

Spans and the Trace Tree#

Context Propagation Across Boundaries#

The Collector Pipeline: Choosing Your Data Router#

Storage Backends: Where Does Your Data Live#

Metrics: Prometheus vs InfluxDB vs Mimir#

Logs: Elasticsearch vs Loki#

Traces: Jaeger vs Tempo vs Zipkin#

Alerting That Does Not Cause Alert Fatigue#

SLOs and Error Budgets#

Burn Rate Alerts#

Dashboards: Grafana Patterns That Help vs Vanity Dashboards#

The Four-Dashboard Model#

Dashboard Anti-Patterns#

Cost Management: The Cardinality Problem#

Rules for Metric Labels#

Sampling Strategies for Traces#

Retention Policies#

The Debugging Workflow: Alert to Root Cause#

Real Incident Walkthrough: Latency Spike Across Five Services#

The Alert#

The Investigation#

The Fix#

The Post-Incident Actions#

Putting It All Together#

相关文章

Is It Down Right Now? How to Check If a Website Is Down

Database Migrations That Won't Destroy Your Weekend: Zero-Downtime Schema Changes

The Three Pillars Are a Lie#

OpenTelemetry: The Standard That Actually Won#

Layer 1: The SDK (Your Application Code)#

Layer 2: The Collector (Your Telemetry Pipeline)#

Layer 3: The Storage Backends#

Structured Logging: Stop Paying the Printf Tax#

Log Levels: A Practical Framework#

Metrics That Matter: RED vs USE#

The RED Method: For Request-Driven Services#

The USE Method: For Infrastructure Resources#

Distributed Tracing in Practice#

Spans and the Trace Tree#

Context Propagation Across Boundaries#

The Collector Pipeline: Choosing Your Data Router#

Storage Backends: Where Does Your Data Live#

Metrics: Prometheus vs InfluxDB vs Mimir#

Logs: Elasticsearch vs Loki#

Traces: Jaeger vs Tempo vs Zipkin#

Alerting That Does Not Cause Alert Fatigue#

SLOs and Error Budgets#

Burn Rate Alerts#

Dashboards: Grafana Patterns That Help vs Vanity Dashboards#

The Four-Dashboard Model#

Dashboard Anti-Patterns#

Cost Management: The Cardinality Problem#

Rules for Metric Labels#

Sampling Strategies for Traces#

Retention Policies#

The Debugging Workflow: Alert to Root Cause#

Real Incident Walkthrough: Latency Spike Across Five Services#

The Alert#

The Investigation#

The Fix#

The Post-Incident Actions#

Putting It All Together#

相关文章

Is It Down Right Now? How to Check If a Website Is Down

Database Migrations That Won't Destroy Your Weekend: Zero-Downtime Schema Changes