Observability in Node.js: Logs, Metrics, and Traces Without the Complexity
Structured logging with Pino, metrics with Prometheus, distributed tracing with OpenTelemetry. The observability stack I use for Node.js production apps, from zero to full visibility.
I used to think observability meant "add some console.logs and check them when something breaks." That worked until it didn't. The breaking point was a production incident where our API was returning 200s but the data was stale. No errors in the logs. No exceptions. Just silently wrong responses because a downstream cache had gone stale and nobody noticed for four hours.
That's when I learned the difference between monitoring and observability. Monitoring tells you something is wrong. Observability tells you why it's wrong. And the gap between those two is where production incidents live.
This is the observability stack I've settled on for Node.js services after trying most of the alternatives. It's not the most sophisticated setup in the world, but it catches problems before users notice them, and when something does slip through, I can diagnose it in minutes instead of hours.
The Three Pillars, and Why You Need All of Them#
Everyone talks about the "three pillars of observability" — logs, metrics, and traces. What nobody tells you is that each pillar answers a fundamentally different question, and you need all three because no single pillar can answer every question.
Logs answer: What happened?
A log line says "at 14:23:07, user 4821 requested /api/orders and got a 500 because the database connection timed out." It's a narrative. It tells you the story of one specific event.
Metrics answer: How much is happening?
A metric says "over the last 5 minutes, the p99 response time was 2.3 seconds and the error rate was 4.7%." It's aggregate data. It tells you about the health of the system as a whole, not about any individual request.
Traces answer: Where did time go?
A trace says "this request spent 12ms in the Express middleware, 3ms parsing the body, 847ms waiting for PostgreSQL, and 2ms serializing the response." It's a waterfall. It tells you exactly where the bottleneck is, across service boundaries.
Here's the practical implication: when your pager goes off at 3 AM, the sequence is almost always the same.
- Metrics tell you something is wrong (error rate spike, latency increase)
- Logs tell you what's happening (specific error messages, affected endpoints)
- Traces tell you why (which downstream service or database query is the bottleneck)
If you only have logs, you'll know what broke but not how bad it is. If you only have metrics, you'll know how bad but not what's causing it. If you only have traces, you'll have beautiful waterfalls but no way to know when to look at them.
Let's build each one.
Structured Logging with Pino#
Why console.log Is Not Enough#
I know. You've been using console.log in production and it's "fine." Let me show you why it's not.
// What you write
console.log("User login failed", email, error.message);
// What ends up in your log file
// User login failed john@example.com ECONNREFUSED
// Now try to:
// 1. Search for all login failures in the last hour
// 2. Count failures per user
// 3. Filter out just the ECONNREFUSED errors
// 4. Correlate this with the request that triggered it
// Good luck. It's an unstructured string. You're grepping through text.Structured logging means every log entry is a JSON object with consistent fields. Instead of a human-readable string that's machine-hostile, you get a machine-readable object that's also human-readable (with the right tools).
// What structured logging looks like
{
"level": 50,
"time": 1709312587000,
"msg": "User login failed",
"email": "john@example.com",
"error": "ECONNREFUSED",
"requestId": "req-abc-123",
"route": "POST /api/auth/login",
"responseTime": 1247,
"pid": 12345
}Now you can query this. level >= 50 AND msg = "User login failed" AND time > now() - 1h gives you exactly what you need.
Pino vs Winston#
I've used both extensively. Here's the short version:
Winston is more popular, more flexible, has more transports, and is significantly slower. It also encourages bad patterns — the "format" system makes it too easy to create unstructured, pretty-printed logs that look nice in development but are unparseable in production.
Pino is faster (5-10x in benchmarks), opinionated about JSON output, and follows the Unix philosophy: do one thing well (write JSON to stdout) and let other tools handle the rest (transport, formatting, aggregation).
I use Pino. The performance difference matters when you're logging thousands of requests per second, and the opinionated approach means every developer on the team produces consistent logs.
Basic Pino Setup#
// src/lib/logger.ts
import pino from "pino";
const isProduction = process.env.NODE_ENV === "production";
export const logger = pino({
level: process.env.LOG_LEVEL || (isProduction ? "info" : "debug"),
// In production, just JSON to stdout. PM2/container runtime handles the rest.
// In development, use pino-pretty for human-readable output.
...(isProduction
? {}
: {
transport: {
target: "pino-pretty",
options: {
colorize: true,
translateTime: "HH:MM:ss",
ignore: "pid,hostname",
},
},
}),
// Standard fields on every log line
base: {
service: process.env.SERVICE_NAME || "api",
version: process.env.APP_VERSION || "unknown",
},
// Serialize Error objects properly
serializers: {
err: pino.stdSerializers.err,
error: pino.stdSerializers.err,
req: pino.stdSerializers.req,
res: pino.stdSerializers.res,
},
// Redact sensitive fields
redact: {
paths: [
"req.headers.authorization",
"req.headers.cookie",
"password",
"creditCard",
"ssn",
],
censor: "[REDACTED]",
},
});The redact option is critical. Without it, you will eventually log a password or API key. It's not a matter of if, it's when. Some developer will add logger.info({ body: req.body }, "incoming request") and suddenly you're logging credit card numbers. Redaction is your safety net.
Log Levels: Use Them Properly#
// FATAL (60) - The process is about to crash. Wake someone up.
logger.fatal({ err }, "Unrecoverable database connection failure");
// ERROR (50) - Something failed that shouldn't have. Investigate soon.
logger.error({ err, userId, orderId }, "Payment processing failed");
// WARN (40) - Something unexpected but handled. Keep an eye on it.
logger.warn({ retryCount: 3, service: "email" }, "Retry limit approaching");
// INFO (30) - Normal operations worth recording. The "what happened" log.
logger.info({ userId, action: "login" }, "User authenticated");
// DEBUG (20) - Detailed information for debugging. Never in production.
logger.debug({ query, params }, "Database query executing");
// TRACE (10) - Extremely detailed. Only when you're desperate.
logger.trace({ headers: req.headers }, "Incoming request headers");The rule: if you're debating between INFO and DEBUG, it's DEBUG. If you're debating between WARN and ERROR, ask yourself: "Would I want to be alerted about this at 3 AM?" If yes, ERROR. If no, WARN.
Child Loggers and Request Context#
This is where Pino really shines. A child logger inherits all the parent's configuration but adds extra context fields.
// Every log from this child logger will include userId and sessionId
const userLogger = logger.child({ userId: "usr_4821", sessionId: "ses_xyz" });
userLogger.info("User viewed dashboard");
// Output includes userId and sessionId automatically
userLogger.info({ page: "/settings" }, "User navigated");
// Output includes userId, sessionId, AND pageFor HTTP servers, you want a child logger per request so every log line in that request's lifecycle includes the request ID:
// src/middleware/request-logger.ts
import { randomUUID } from "node:crypto";
import { logger } from "../lib/logger";
import type { Request, Response, NextFunction } from "express";
export function requestLogger(req: Request, res: Response, next: NextFunction) {
const requestId = req.headers["x-request-id"]?.toString() || randomUUID();
const startTime = performance.now();
// Attach a child logger to the request
req.log = logger.child({
requestId,
method: req.method,
url: req.originalUrl,
userAgent: req.headers["user-agent"],
ip: req.headers["x-forwarded-for"]?.toString().split(",").pop()?.trim()
|| req.socket.remoteAddress,
});
// Set the request ID header on the response for correlation
res.setHeader("x-request-id", requestId);
req.log.info("Request received");
res.on("finish", () => {
const duration = Math.round(performance.now() - startTime);
const logMethod = res.statusCode >= 500 ? "error"
: res.statusCode >= 400 ? "warn"
: "info";
req.log[logMethod]({
statusCode: res.statusCode,
duration,
contentLength: res.getHeader("content-length"),
}, "Request completed");
});
next();
}AsyncLocalStorage for Automatic Context Propagation#
The child logger approach works, but it requires you to pass req.log through every function call. That gets tedious. AsyncLocalStorage solves this — it provides a context store that follows the async execution flow without explicit passing.
// src/lib/async-context.ts
import { AsyncLocalStorage } from "node:async_hooks";
import { logger } from "./logger";
import type { Logger } from "pino";
interface RequestContext {
requestId: string;
logger: Logger;
userId?: string;
startTime: number;
}
export const asyncContext = new AsyncLocalStorage<RequestContext>();
// Get the contextual logger from anywhere in the call stack
export function getLogger(): Logger {
const store = asyncContext.getStore();
return store?.logger || logger;
}
export function getRequestId(): string | undefined {
return asyncContext.getStore()?.requestId;
}// src/middleware/async-context-middleware.ts
import { randomUUID } from "node:crypto";
import { asyncContext } from "../lib/async-context";
import { logger } from "../lib/logger";
import type { Request, Response, NextFunction } from "express";
export function asyncContextMiddleware(
req: Request,
res: Response,
next: NextFunction
) {
const requestId = req.headers["x-request-id"]?.toString() || randomUUID();
const requestLogger = logger.child({ requestId });
const context = {
requestId,
logger: requestLogger,
startTime: performance.now(),
};
asyncContext.run(context, () => {
res.setHeader("x-request-id", requestId);
next();
});
}Now any function, anywhere in the call stack, can get the request-scoped logger:
// src/services/order-service.ts
import { getLogger } from "../lib/async-context";
export async function processOrder(orderId: string) {
const log = getLogger(); // Automatically has requestId attached!
log.info({ orderId }, "Processing order");
const items = await fetchOrderItems(orderId);
log.debug({ itemCount: items.length }, "Order items fetched");
const total = calculateTotal(items);
log.info({ orderId, total }, "Order processed successfully");
return { orderId, total, items };
}
// No need to pass logger as parameter. It just works.Log Aggregation: Where Do Logs Go?#
In development, logs go to stdout and pino-pretty makes them readable. In production, it's more nuanced.
The PM2 Path#
If you're running on a VPS with PM2 (which I covered in my VPS setup post), PM2 captures stdout automatically:
# View logs in real-time
pm2 logs api --lines 100
# Logs are stored at ~/.pm2/logs/
# api-out.log — stdout (your JSON logs)
# api-error.log — stderr (uncaught exceptions, stack traces)PM2's built-in log rotation prevents disk space issues:
pm2 install pm2-logrotate
pm2 set pm2-logrotate:max_size 50M
pm2 set pm2-logrotate:retain 14
pm2 set pm2-logrotate:compress trueShipping Logs to Loki or Elasticsearch#
For anything beyond a single server, you need centralized log aggregation. The two main options:
Grafana Loki — The "Prometheus for logs." Lightweight, indexes only labels (not full text), works beautifully with Grafana. My recommendation for most teams.
Elasticsearch — Full-text search on logs. More powerful, more resource-hungry, more operational overhead. Use this if you genuinely need full-text search across millions of log lines.
For Loki, the simplest setup uses Promtail to ship logs:
# /etc/promtail/config.yml
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: node-api
static_configs:
- targets:
- localhost
labels:
job: node-api
environment: production
__path__: /home/deploy/.pm2/logs/api-out.log
pipeline_stages:
- json:
expressions:
level: level
msg: msg
service: service
- labels:
level:
service:
- timestamp:
source: time
format: UnixMsThe NDJSON Format#
Pino outputs Newline Delimited JSON (NDJSON) by default — one JSON object per line, separated by \n. This is important because:
- Every log aggregation tool understands it
- It's streamable (you can process logs line by line without buffering the entire file)
- Standard Unix tools work on it:
cat api-out.log | jq '.msg' | sort | uniq -c | sort -rn
Never configure Pino to output pretty-printed, multi-line JSON in production. You'll break every tool in the pipeline.
// WRONG in production — multi-line JSON breaks line-based processing
{
"level": 30,
"time": 1709312587000,
"msg": "Request completed"
}
// RIGHT in production — NDJSON, one object per line
{"level":30,"time":1709312587000,"msg":"Request completed"}Metrics with Prometheus#
Logs tell you what happened. Metrics tell you how the system is performing. The difference is like the difference between reading every transaction on your bank statement versus looking at your account balance.
The Four Metric Types#
Prometheus has four metric types. Understanding which to use when will save you from the most common mistakes.
Counter — A value that only goes up. Request count, error count, bytes processed. Resets to zero on restart.
// "How many requests have we served?"
const httpRequestsTotal = new Counter({
name: "http_requests_total",
help: "Total number of HTTP requests",
labelNames: ["method", "route", "status_code"],
});Gauge — A value that can go up or down. Current connections, queue size, temperature, heap usage.
// "How many connections are active right now?"
const activeConnections = new Gauge({
name: "active_connections",
help: "Number of currently active connections",
});Histogram — Observes values and counts them in configurable buckets. Request duration, response size. This is how you get percentiles (p50, p95, p99).
// "How long do requests take?" with buckets at 10ms, 50ms, 100ms, etc.
const httpRequestDuration = new Histogram({
name: "http_request_duration_seconds",
help: "Duration of HTTP requests in seconds",
labelNames: ["method", "route", "status_code"],
buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});Summary — Similar to Histogram but calculates quantiles on the client side. Use Histogram instead unless you have a specific reason not to. Summaries can't be aggregated across instances.
Full prom-client Setup#
// src/lib/metrics.ts
import {
Registry,
Counter,
Histogram,
Gauge,
collectDefaultMetrics,
} from "prom-client";
// Create a custom registry to avoid polluting the global one
export const metricsRegistry = new Registry();
// Collect default Node.js metrics:
// - process_cpu_seconds_total
// - process_resident_memory_bytes
// - nodejs_heap_size_total_bytes
// - nodejs_active_handles_total
// - nodejs_eventloop_lag_seconds
// - nodejs_gc_duration_seconds
collectDefaultMetrics({
register: metricsRegistry,
prefix: "nodeapp_",
// Collect every 10 seconds
gcDurationBuckets: [0.001, 0.01, 0.1, 1, 2, 5],
});
// --- HTTP Metrics ---
export const httpRequestsTotal = new Counter({
name: "nodeapp_http_requests_total",
help: "Total number of HTTP requests received",
labelNames: ["method", "route", "status_code"] as const,
registers: [metricsRegistry],
});
export const httpRequestDuration = new Histogram({
name: "nodeapp_http_request_duration_seconds",
help: "Duration of HTTP requests in seconds",
labelNames: ["method", "route", "status_code"] as const,
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
registers: [metricsRegistry],
});
export const httpRequestSizeBytes = new Histogram({
name: "nodeapp_http_request_size_bytes",
help: "Size of HTTP request bodies in bytes",
labelNames: ["method", "route"] as const,
buckets: [100, 1000, 10000, 100000, 1000000],
registers: [metricsRegistry],
});
// --- Business Metrics ---
export const ordersProcessed = new Counter({
name: "nodeapp_orders_processed_total",
help: "Total number of orders processed",
labelNames: ["status"] as const, // "success", "failed", "refunded"
registers: [metricsRegistry],
});
export const activeWebSocketConnections = new Gauge({
name: "nodeapp_active_websocket_connections",
help: "Number of active WebSocket connections",
registers: [metricsRegistry],
});
export const externalApiDuration = new Histogram({
name: "nodeapp_external_api_duration_seconds",
help: "Duration of external API calls",
labelNames: ["service", "endpoint", "status"] as const,
buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30],
registers: [metricsRegistry],
});
export const dbQueryDuration = new Histogram({
name: "nodeapp_db_query_duration_seconds",
help: "Duration of database queries",
labelNames: ["operation", "table"] as const,
buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5],
registers: [metricsRegistry],
});The Metrics Middleware#
// src/middleware/metrics-middleware.ts
import { httpRequestsTotal, httpRequestDuration } from "../lib/metrics";
import type { Request, Response, NextFunction } from "express";
// Normalize routes to avoid cardinality explosion
// /api/users/123 → /api/users/:id
// Without this, Prometheus will create a new time series for every user ID
function normalizeRoute(req: Request): string {
const route = req.route?.path || req.path;
// Replace common dynamic segments
return route
.replace(/\/[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/g, "/:uuid")
.replace(/\/\d+/g, "/:id")
.replace(/\/[a-f0-9]{24}/g, "/:objectId");
}
export function metricsMiddleware(
req: Request,
res: Response,
next: NextFunction
) {
// Don't track metrics for the metrics endpoint itself
if (req.path === "/metrics") {
return next();
}
const end = httpRequestDuration.startTimer();
res.on("finish", () => {
const route = normalizeRoute(req);
const labels = {
method: req.method,
route,
status_code: res.statusCode.toString(),
};
httpRequestsTotal.inc(labels);
end(labels);
});
next();
}The /metrics Endpoint#
// src/routes/metrics.ts
import { Router } from "express";
import { metricsRegistry } from "../lib/metrics";
const router = Router();
router.get("/metrics", async (req, res) => {
// Basic auth protection — don't expose metrics publicly
const authHeader = req.headers.authorization;
const expected = `Basic ${Buffer.from(
`${process.env.METRICS_USER}:${process.env.METRICS_PASSWORD}`
).toString("base64")}`;
if (!authHeader || authHeader !== expected) {
res.status(401).set("WWW-Authenticate", "Basic").send("Unauthorized");
return;
}
try {
const metrics = await metricsRegistry.metrics();
res.set("Content-Type", metricsRegistry.contentType);
res.send(metrics);
} catch (err) {
res.status(500).send("Error collecting metrics");
}
});
export default router;Custom Business Metrics Are the Real Power#
The default Node.js metrics (heap size, event loop lag, GC duration) are table stakes. They tell you about the runtime health. But business metrics tell you about the application health.
// In your order service
import { ordersProcessed, externalApiDuration } from "../lib/metrics";
export async function processOrder(order: Order) {
try {
// Time the payment provider call
const paymentTimer = externalApiDuration.startTimer({
service: "stripe",
endpoint: "charges.create",
});
const charge = await stripe.charges.create({
amount: order.total,
currency: "usd",
source: order.paymentToken,
});
paymentTimer({ status: "success" });
ordersProcessed.inc({ status: "success" });
return charge;
} catch (err) {
ordersProcessed.inc({ status: "failed" });
externalApiDuration.startTimer({
service: "stripe",
endpoint: "charges.create",
})({ status: "error" });
throw err;
}
}A spike in ordersProcessed{status="failed"} tells you something that no amount of CPU metrics ever will.
Label Cardinality: The Silent Killer#
A word of caution. Every unique combination of label values creates a new time series. If you add a userId label to your HTTP request counter, and you have 100,000 users, you just created 100,000+ time series. Prometheus will grind to a halt.
Rules for labels:
- Low cardinality only: HTTP method (7 values), status code (5 categories), route (tens, not thousands)
- Never use user IDs, request IDs, IP addresses, or timestamps as label values
- If you're not sure, don't add the label. You can always add it later, but removing it requires changing dashboards and alerts
Grafana Dashboards#
Prometheus stores the data. Grafana visualizes it. Here are the panels I put on every Node.js service dashboard.
The Essential Dashboard#
1. Request Rate (requests/second)
rate(nodeapp_http_requests_total[5m])Shows the traffic pattern. Useful for spotting sudden spikes or drops.
2. Error Rate (%)
100 * (
sum(rate(nodeapp_http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(nodeapp_http_requests_total[5m]))
)The single most important number. If this goes above 1%, something is wrong.
3. p50 / p95 / p99 Latency
histogram_quantile(0.99,
sum(rate(nodeapp_http_request_duration_seconds_bucket[5m])) by (le)
)p50 tells you the typical experience. p99 tells you the worst experience. If p99 is 10x the p50, you have a tail latency problem.
4. Event Loop Lag
nodeapp_nodejs_eventloop_lag_seconds{quantile="0.99"}If this goes above 100ms, your event loop is blocked. Probably a synchronous operation in an async path.
5. Heap Usage
nodeapp_nodejs_heap_size_used_bytes / nodeapp_nodejs_heap_size_total_bytes * 100Watch for a steady upward trend — that's a memory leak. Spikes during GC are normal.
6. Active Handles
nodeapp_nodejs_active_handles_totalOpen file descriptors, sockets, timers. A continuously growing number means you're leaking handles — probably not closing database connections or HTTP responses.
Grafana Dashboard as Code#
You can version-control your dashboards using Grafana's provisioning feature:
# /etc/grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
- name: "Node.js Services"
orgId: 1
folder: "Services"
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: trueExport your dashboard JSON from Grafana, commit it to your repo, and your dashboard survives Grafana reinstalls. This isn't optional for production — it's the same principle as infrastructure as code.
Distributed Tracing with OpenTelemetry#
Tracing is the pillar that most teams adopt last, and the one they wish they'd adopted first. When you have multiple services talking to each other (even if it's just "API server + database + Redis + external API"), tracing shows you the full picture of a request's journey.
What Is a Trace?#
A trace is a tree of spans. Each span represents a unit of work — an HTTP request, a database query, a function call. Spans have a start time, end time, status, and attributes. They're linked together by a trace ID that's propagated across service boundaries.
Trace: abc-123
├── [API Gateway] POST /api/orders (250ms)
│ ├── [Auth Service] validate-token (12ms)
│ ├── [Order Service] create-order (230ms)
│ │ ├── [PostgreSQL] INSERT INTO orders (15ms)
│ │ ├── [Redis] SET order:cache (2ms)
│ │ └── [Payment Service] charge (200ms)
│ │ ├── [Stripe API] POST /v1/charges (180ms)
│ │ └── [PostgreSQL] UPDATE orders SET status (8ms)
│ └── [Email Service] send-confirmation (async, 45ms)
One glance tells you: the 250ms request spent 180ms waiting for Stripe. That's where you optimize.
OpenTelemetry Setup#
OpenTelemetry (OTel) is the standard. It replaced the fragmented landscape of Jaeger clients, Zipkin clients, and vendor-specific SDKs with a single, vendor-neutral API.
// src/instrumentation.ts
// This file MUST be loaded before any other imports.
// In Node.js, use --require or --import flag.
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { OTLPMetricExporter } from "@opentelemetry/exporter-metrics-otlp-http";
import { PeriodicExportingMetricReader } from "@opentelemetry/sdk-metrics";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { Resource } from "@opentelemetry/resources";
import {
ATTR_SERVICE_NAME,
ATTR_SERVICE_VERSION,
} from "@opentelemetry/semantic-conventions";
const sdk = new NodeSDK({
resource: new Resource({
[ATTR_SERVICE_NAME]: process.env.SERVICE_NAME || "node-api",
[ATTR_SERVICE_VERSION]: process.env.APP_VERSION || "0.0.0",
"deployment.environment": process.env.NODE_ENV || "development",
}),
// Send traces to your collector (Jaeger, Tempo, etc.)
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || "http://localhost:4318/v1/traces",
}),
// Optionally send metrics through OTel too
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || "http://localhost:4318/v1/metrics",
}),
exportIntervalMillis: 15000,
}),
// Auto-instrumentation: automatically creates spans for
// HTTP requests, Express routes, PostgreSQL queries, Redis commands,
// DNS lookups, and more
instrumentations: [
getNodeAutoInstrumentations({
// Disable noisy instrumentations
"@opentelemetry/instrumentation-fs": { enabled: false },
"@opentelemetry/instrumentation-dns": { enabled: false },
// Configure specific ones
"@opentelemetry/instrumentation-http": {
ignoreIncomingPaths: ["/health", "/ready", "/metrics"],
},
"@opentelemetry/instrumentation-express": {
ignoreLayersType: ["middleware"],
},
}),
],
});
sdk.start();
// Graceful shutdown
process.on("SIGTERM", () => {
sdk.shutdown().then(
() => console.log("OTel SDK shut down successfully"),
(err) => console.error("Error shutting down OTel SDK", err)
);
});Start your app with:
node --import ./src/instrumentation.ts ./src/server.tsThat's it. With zero changes to your application code, you now get traces for every HTTP request, every database query, every Redis command.
Manual Span Creation#
Auto-instrumentation covers infrastructure calls, but sometimes you want to trace business logic:
// src/services/order-service.ts
import { trace, SpanStatusCode } from "@opentelemetry/api";
const tracer = trace.getTracer("order-service");
export async function processOrder(orderId: string): Promise<Order> {
return tracer.startActiveSpan("processOrder", async (span) => {
try {
span.setAttribute("order.id", orderId);
// This span becomes the parent of any auto-instrumented
// DB queries or HTTP calls inside these functions
const order = await fetchOrder(orderId);
span.setAttribute("order.total", order.total);
span.setAttribute("order.item_count", order.items.length);
const validationResult = await tracer.startActiveSpan(
"validateOrder",
async (validationSpan) => {
const result = await validateInventory(order);
validationSpan.setAttribute("validation.passed", result.valid);
if (!result.valid) {
validationSpan.setStatus({
code: SpanStatusCode.ERROR,
message: `Validation failed: ${result.reason}`,
});
}
validationSpan.end();
return result;
}
);
if (!validationResult.valid) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: "Order validation failed",
});
throw new Error(validationResult.reason);
}
const payment = await processPayment(order);
span.setAttribute("payment.id", payment.id);
span.setStatus({ code: SpanStatusCode.OK });
return order;
} catch (err) {
span.recordException(err as Error);
span.setStatus({
code: SpanStatusCode.ERROR,
message: (err as Error).message,
});
throw err;
} finally {
span.end();
}
});
}Trace Context Propagation#
The magic of distributed tracing is that the trace ID follows the request across services. When Service A calls Service B, the trace context is automatically injected into the HTTP headers (traceparent header per W3C Trace Context standard).
The auto-instrumentation handles this for outgoing HTTP calls. But if you're using a message queue, you need to propagate manually:
import { context, propagation } from "@opentelemetry/api";
// When publishing a message
function publishEvent(queue: string, payload: object) {
const carrier: Record<string, string> = {};
// Inject current trace context into the carrier
propagation.inject(context.active(), carrier);
// Send both the payload and the trace context
messageQueue.publish(queue, {
payload,
traceContext: carrier,
});
}
// When consuming a message
function consumeEvent(message: QueueMessage) {
// Extract the trace context from the message
const parentContext = propagation.extract(
context.active(),
message.traceContext
);
// Run the handler within the extracted context
// Now any spans created here will be children of the original trace
context.with(parentContext, () => {
tracer.startActiveSpan("processEvent", (span) => {
span.setAttribute("queue.message_id", message.id);
handleEvent(message.payload);
span.end();
});
});
}Where to Send Traces#
Jaeger — The classic open-source option. Good UI, easy to run locally with Docker. Limited long-term storage.
Grafana Tempo — If you're already using Grafana and Loki, Tempo is the natural choice for traces. Uses object storage (S3/GCS) for cost-effective long-term retention.
Grafana Cloud / Datadog / Honeycomb — If you don't want to run infrastructure. More expensive, less operational overhead.
For local development, Jaeger in Docker is perfect:
# docker-compose.otel.yml
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # Jaeger UI
- "4318:4318" # OTLP HTTP receiver
environment:
- COLLECTOR_OTLP_ENABLED=trueHealth Check Endpoints#
Health checks are the simplest form of observability and the first thing you should implement. They answer one question: "Is this service capable of serving requests right now?"
Three Types of Health Checks#
/health — General health. Is the process running and responsive?
/ready — Readiness. Can this service handle traffic? (Has it connected to the database? Has it loaded its configuration? Has it warmed its cache?)
/live — Liveness. Is the process alive and not deadlocked? (Can it respond to a simple request within a timeout?)
The distinction matters for Kubernetes, where liveness probes restart stuck containers and readiness probes remove containers from the load balancer during startup or dependency failures.
// src/routes/health.ts
import { Router } from "express";
import { Pool } from "pg";
import Redis from "ioredis";
const router = Router();
interface HealthCheckResult {
status: "ok" | "degraded" | "error";
checks: Record<
string,
{
status: "ok" | "error";
latency?: number;
message?: string;
}
>;
uptime: number;
timestamp: string;
version: string;
}
async function checkDatabase(pool: Pool): Promise<{ ok: boolean; latency: number }> {
const start = performance.now();
try {
await pool.query("SELECT 1");
return { ok: true, latency: Math.round(performance.now() - start) };
} catch {
return { ok: false, latency: Math.round(performance.now() - start) };
}
}
async function checkRedis(redis: Redis): Promise<{ ok: boolean; latency: number }> {
const start = performance.now();
try {
await redis.ping();
return { ok: true, latency: Math.round(performance.now() - start) };
} catch {
return { ok: false, latency: Math.round(performance.now() - start) };
}
}
export function createHealthRoutes(pool: Pool, redis: Redis) {
// Liveness — just check if the process can respond
router.get("/live", (_req, res) => {
res.status(200).json({ status: "ok" });
});
// Readiness — check all dependencies
router.get("/ready", async (_req, res) => {
const [db, cache] = await Promise.all([
checkDatabase(pool),
checkRedis(redis),
]);
const allOk = db.ok && cache.ok;
res.status(allOk ? 200 : 503).json({
status: allOk ? "ok" : "not_ready",
checks: {
database: db,
redis: cache,
},
});
});
// Full health — detailed status for dashboards and debugging
router.get("/health", async (_req, res) => {
const [db, cache] = await Promise.all([
checkDatabase(pool),
checkRedis(redis),
]);
const anyError = !db.ok || !cache.ok;
const allError = !db.ok && !cache.ok;
const result: HealthCheckResult = {
status: allError ? "error" : anyError ? "degraded" : "ok",
checks: {
database: {
status: db.ok ? "ok" : "error",
latency: db.latency,
...(!db.ok && { message: "Connection failed" }),
},
redis: {
status: cache.ok ? "ok" : "error",
latency: cache.latency,
...(!cache.ok && { message: "Connection failed" }),
},
},
uptime: process.uptime(),
timestamp: new Date().toISOString(),
version: process.env.APP_VERSION || "unknown",
};
// Return 200 for ok/degraded (service can still handle some traffic)
// Return 503 for error (service should be removed from rotation)
res.status(result.status === "error" ? 503 : 200).json(result);
});
return router;
}Kubernetes Probe Configuration#
# k8s/deployment.yml
spec:
containers:
- name: api
livenessProbe:
httpGet:
path: /live
port: 3000
initialDelaySeconds: 10
periodSeconds: 15
timeoutSeconds: 5
failureThreshold: 3 # Restart after 3 consecutive failures (45s)
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 2 # Remove from LB after 2 failures (20s)
startupProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 0
periodSeconds: 5
failureThreshold: 30 # Give up to 150s for startupA common mistake: making the liveness probe too aggressive. If your liveness probe checks the database, and the database is temporarily down, Kubernetes will restart your container. But restarting won't fix the database. Now you have a crash loop on top of a database outage. Keep liveness probes simple — they should only detect deadlocked or stuck processes.
Error Tracking with Sentry#
Logs catch errors you expected. Sentry catches the ones you didn't.
The difference is important. You add try/catch blocks around code you know might fail. But the bugs that matter most are the ones in code you thought was safe. Unhandled promise rejections, type errors from unexpected API responses, null pointer access on optional chains that weren't optional enough.
Sentry Setup for Node.js#
// src/lib/sentry.ts
import * as Sentry from "@sentry/node";
import { nodeProfilingIntegration } from "@sentry/profiling-node";
export function initSentry() {
Sentry.init({
dsn: process.env.SENTRY_DSN,
environment: process.env.NODE_ENV || "development",
release: process.env.APP_VERSION || "unknown",
// Sample 10% of transactions for performance monitoring
// (100% in development)
tracesSampleRate: process.env.NODE_ENV === "production" ? 0.1 : 1.0,
// Profile 100% of sampled transactions
profilesSampleRate: 1.0,
integrations: [
nodeProfilingIntegration(),
// Filter out noisy errors
Sentry.rewriteFramesIntegration({
root: process.cwd(),
}),
],
// Don't send errors from development
enabled: process.env.NODE_ENV === "production",
// Filter out known non-issues
ignoreErrors: [
// Client disconnects aren't bugs
"ECONNRESET",
"ECONNABORTED",
"EPIPE",
// Bots sending garbage
"SyntaxError: Unexpected token",
],
// Strip PII before sending
beforeSend(event) {
// Remove IP addresses
if (event.request) {
delete event.request.headers?.["x-forwarded-for"];
delete event.request.headers?.["x-real-ip"];
delete event.request.cookies;
}
// Remove sensitive query params
if (event.request?.query_string) {
const params = new URLSearchParams(event.request.query_string);
params.delete("token");
params.delete("api_key");
event.request.query_string = params.toString();
}
return event;
},
});
}Express Error Handler with Sentry#
// src/middleware/error-handler.ts
import * as Sentry from "@sentry/node";
import { getLogger } from "../lib/async-context";
import type { Request, Response, NextFunction } from "express";
// Sentry request handler must come first
export const sentryRequestHandler = Sentry.Handlers.requestHandler();
// Sentry tracing handler
export const sentryTracingHandler = Sentry.Handlers.tracingHandler();
// Your custom error handler comes last
export function errorHandler(
err: Error,
req: Request,
res: Response,
_next: NextFunction
) {
const log = getLogger();
// Add custom context to the Sentry event
Sentry.withScope((scope) => {
scope.setTag("route", req.route?.path || req.path);
scope.setTag("method", req.method);
if (req.user) {
scope.setUser({
id: req.user.id,
// Don't send email or username to Sentry
});
}
// Add breadcrumbs for debugging
scope.addBreadcrumb({
category: "request",
message: `${req.method} ${req.path}`,
level: "info",
data: {
query: req.query,
statusCode: res.statusCode,
},
});
Sentry.captureException(err);
});
// Log the error with full context
log.error(
{
err,
statusCode: 500,
route: req.route?.path || req.path,
method: req.method,
},
"Unhandled error in request handler"
);
// Send a generic error response
// Never expose error details to the client in production
res.status(500).json({
error: "Internal Server Error",
...(process.env.NODE_ENV !== "production" && {
message: err.message,
stack: err.stack,
}),
});
}Source Maps#
Without source maps, Sentry shows you minified/transpiled stack traces. Useless. Upload source maps during your build:
# In your CI/CD pipeline
npx @sentry/cli sourcemaps upload \
--org your-org \
--project your-project \
--release $APP_VERSION \
./distOr configure it in your bundler:
// vite.config.ts (or equivalent)
import { sentryVitePlugin } from "@sentry/vite-plugin";
export default defineConfig({
build: {
sourcemap: true, // Required for Sentry
},
plugins: [
sentryVitePlugin({
org: process.env.SENTRY_ORG,
project: process.env.SENTRY_PROJECT,
authToken: process.env.SENTRY_AUTH_TOKEN,
}),
],
});The Cost of Unhandled Promise Rejections#
Since Node.js 15, unhandled promise rejections crash the process by default. This is good — it forces you to handle errors. But you need a safety net:
// src/server.ts — near the top of your entry point
process.on("unhandledRejection", (reason, promise) => {
logger.fatal({ reason, promise }, "Unhandled promise rejection — crashing");
Sentry.captureException(reason);
// Flush Sentry events before crashing
Sentry.flush(2000).finally(() => {
process.exit(1);
});
});
process.on("uncaughtException", (error) => {
logger.fatal({ err: error }, "Uncaught exception — crashing");
Sentry.captureException(error);
Sentry.flush(2000).finally(() => {
process.exit(1);
});
});The important part: Sentry.flush() before process.exit(). Without it, the error event might not make it to Sentry before the process dies.
Alerting: The Alerts That Actually Matter#
Having 200 Prometheus metrics and zero alerts is just vanity monitoring. Having 50 alerts that fire every day is alert fatigue — you'll start ignoring them, and then you'll miss the one that matters.
The goal is a small number of high-signal alerts that mean "something is genuinely wrong and a human needs to look at it."
Prometheus AlertManager Configuration#
# alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: $SLACK_WEBHOOK_URL
route:
receiver: "slack-warnings"
group_by: ["alertname", "service"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: "pagerduty-critical"
repeat_interval: 15m
- match:
severity: warning
receiver: "slack-warnings"
receivers:
- name: "pagerduty-critical"
pagerduty_configs:
- routing_key: $PAGERDUTY_ROUTING_KEY
severity: critical
- name: "slack-warnings"
slack_configs:
- channel: "#alerts"
title: '{{ template "slack.title" . }}'
text: '{{ template "slack.text" . }}'The Alerts That Actually Wake Me Up#
# prometheus/rules/node-api.yml
groups:
- name: node-api-critical
rules:
# High error rate — something is broken
- alert: HighErrorRate
expr: |
(
sum(rate(nodeapp_http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(nodeapp_http_requests_total[5m]))
) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 1% for 5 minutes"
description: "{{ $value | humanizePercentage }} of requests are returning 5xx"
# Slow responses — users are suffering
- alert: HighP99Latency
expr: |
histogram_quantile(0.99,
sum(rate(nodeapp_http_request_duration_seconds_bucket[5m])) by (le)
) > 1
for: 5m
labels:
severity: critical
annotations:
summary: "p99 latency above 1 second for 5 minutes"
description: "p99 latency is {{ $value | humanizeDuration }}"
# Memory leak — will OOM soon
- alert: HighHeapUsage
expr: |
(
nodeapp_nodejs_heap_size_used_bytes
/
nodeapp_nodejs_heap_size_total_bytes
) > 0.80
for: 10m
labels:
severity: critical
annotations:
summary: "Heap usage above 80% for 10 minutes"
description: "Heap usage is at {{ $value | humanizePercentage }}"
# Process is down
- alert: ServiceDown
expr: up{job="node-api"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Node.js API is down"
- name: node-api-warnings
rules:
# Event loop is getting slow
- alert: HighEventLoopLag
expr: |
nodeapp_nodejs_eventloop_lag_seconds{quantile="0.99"} > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Event loop lag above 100ms"
description: "p99 event loop lag is {{ $value | humanizeDuration }}"
# Traffic dropped significantly — possible routing issue
- alert: TrafficDrop
expr: |
sum(rate(nodeapp_http_requests_total[5m]))
< (sum(rate(nodeapp_http_requests_total[5m] offset 1h)) * 0.5)
for: 10m
labels:
severity: warning
annotations:
summary: "Traffic dropped by more than 50% compared to 1 hour ago"
# Database queries getting slow
- alert: SlowDatabaseQueries
expr: |
histogram_quantile(0.99,
sum(rate(nodeapp_db_query_duration_seconds_bucket[5m])) by (le, operation)
) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "p99 database query time above 500ms"
description: "Slow {{ $labels.operation }} queries: {{ $value | humanizeDuration }}"
# External API is failing
- alert: ExternalAPIFailures
expr: |
(
sum(rate(nodeapp_external_api_duration_seconds_count{status="error"}[5m])) by (service)
/
sum(rate(nodeapp_external_api_duration_seconds_count[5m])) by (service)
) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "External API {{ $labels.service }} failing >10%"Notice the for clause on every alert. Without it, a single spike triggers an alert. A 5-minute for means the condition must be true for 5 continuous minutes. This eliminates noise from momentary blips.
Alert Hygiene#
Every alert must pass this test:
- Is it actionable? If nobody can do anything about it, don't alert. Log it, dashboard it, but don't wake someone up.
- Does it require human intervention? If it self-heals (like a brief network blip), the
forclause should filter it out. - Has it fired in the last 30 days? If not, it might be misconfigured or the threshold is wrong. Review it.
- When it fires, do people care? If the team regularly dismisses it, remove it or fix the threshold.
I audit my alerts quarterly. Every alert gets one of three outcomes: keep, adjust threshold, or delete.
Putting It All Together: The Express Application#
Here's how all the pieces fit together in a real application:
// src/server.ts
import { initSentry } from "./lib/sentry";
// Initialize Sentry first — before other imports
initSentry();
import express from "express";
import * as Sentry from "@sentry/node";
import { Pool } from "pg";
import Redis from "ioredis";
import { logger } from "./lib/logger";
import { asyncContextMiddleware } from "./middleware/async-context-middleware";
import { metricsMiddleware } from "./middleware/metrics-middleware";
import { requestLogger } from "./middleware/request-logger";
import {
sentryRequestHandler,
sentryTracingHandler,
errorHandler,
} from "./middleware/error-handler";
import { createHealthRoutes } from "./routes/health";
import metricsRouter from "./routes/metrics";
import apiRouter from "./routes/api";
const app = express();
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
const redis = new Redis(process.env.REDIS_URL);
// --- Middleware Order Matters ---
// 1. Sentry request handler (must be first)
app.use(sentryRequestHandler);
app.use(sentryTracingHandler);
// 2. Async context (creates request-scoped context)
app.use(asyncContextMiddleware);
// 3. Request logging
app.use(requestLogger);
// 4. Metrics collection
app.use(metricsMiddleware);
// 5. Body parsing
app.use(express.json({ limit: "1mb" }));
// --- Routes ---
// Health checks (no auth required)
app.use(createHealthRoutes(pool, redis));
// Metrics (basic auth protected)
app.use(metricsRouter);
// API routes
app.use("/api", apiRouter);
// --- Error Handling ---
// Sentry error handler (must be before custom error handler)
app.use(Sentry.Handlers.errorHandler());
// Custom error handler (must be last)
app.use(errorHandler);
// --- Start ---
const port = parseInt(process.env.PORT || "3000", 10);
app.listen(port, () => {
logger.info(
{
port,
nodeEnv: process.env.NODE_ENV,
version: process.env.APP_VERSION,
},
"Server started"
);
});
// Graceful shutdown
async function shutdown(signal: string) {
logger.info({ signal }, "Shutdown signal received");
// Stop accepting new connections
// Process in-flight requests (Express does this automatically)
// Close database pool
await pool.end().catch((err) => {
logger.error({ err }, "Error closing database pool");
});
// Close Redis connection
await redis.quit().catch((err) => {
logger.error({ err }, "Error closing Redis connection");
});
// Flush Sentry
await Sentry.close(2000);
logger.info("Shutdown complete");
process.exit(0);
}
process.on("SIGTERM", () => shutdown("SIGTERM"));
process.on("SIGINT", () => shutdown("SIGINT"));The Minimal Viable Stack#
Everything above is the "full" stack. You don't need all of it on day one. Here's how to scale your observability as your project grows.
Stage 1: Side Project / Solo Developer#
You need three things:
-
Structured console logs — Use Pino, output JSON to stdout. Even if you're just reading them with
pm2 logs, JSON logs are searchable and parseable. -
A /health endpoint — Takes 5 minutes to implement, saves you when you're debugging "is it even running?"
-
Sentry free tier — Catches the errors you didn't anticipate. The free tier gives you 5,000 events/month, which is plenty for a side project.
// This is the minimal setup. Under 50 lines. No excuses.
import pino from "pino";
import express from "express";
import * as Sentry from "@sentry/node";
const logger = pino({ level: "info" });
const app = express();
Sentry.init({ dsn: process.env.SENTRY_DSN });
app.use(Sentry.Handlers.requestHandler());
app.get("/health", (_req, res) => {
res.json({ status: "ok", uptime: process.uptime() });
});
app.use("/api", apiRoutes);
app.use(Sentry.Handlers.errorHandler());
app.use((err: Error, _req: express.Request, res: express.Response, _next: express.NextFunction) => {
logger.error({ err }, "Unhandled error");
res.status(500).json({ error: "Internal Server Error" });
});
app.listen(3000, () => logger.info("Server started on port 3000"));Stage 2: Growing Project / Small Team#
Add:
-
Prometheus metrics + Grafana — When "it feels slow" isn't enough and you need data. Start with request rate, error rate, and p99 latency.
-
Log aggregation — When
sshinto the server andgrepping through files stops scaling. Loki + Promtail if you're already using Grafana. -
Basic alerts — Error rate > 1%, p99 > 1s, service down. Three alerts. That's it.
Stage 3: Production Service / Multiple Services#
Add:
-
Distributed tracing with OpenTelemetry — When "the API is slow" becomes "which of the 5 services it calls is slow?" OTel auto-instrumentation gets you 80% of the value with zero code changes.
-
Dashboard as code — Version-control your Grafana dashboards. You'll thank yourself when you need to recreate them.
-
Structured alerting — AlertManager with proper routing, escalation, and silence rules.
-
Business metrics — Orders/second, conversion rate, queue depth. The metrics your product team cares about.
What to Skip#
- APM vendors with per-host pricing — At scale, the cost is absurd. Open source (Prometheus + Grafana + Tempo + Loki) gives you 95% of the functionality.
- Log levels below INFO in production — You'll generate terabytes of DEBUG logs and pay for the storage. Use DEBUG only when actively investigating issues, then turn it off.
- Custom metrics for everything — Start with the RED method (Rate, Errors, Duration) for each service. Add custom metrics only when you have a specific question to answer.
- Complex trace sampling — Start with a simple sample rate (10% in production). Adaptive sampling is a premature optimization for most teams.
Final Thoughts#
Observability isn't a product you buy or a tool you install. It's a practice. It's the difference between operating your service and hoping your service operates itself.
The stack I've described here — Pino for logs, Prometheus for metrics, OpenTelemetry for traces, Sentry for errors, Grafana for visualization, AlertManager for alerts — is not the simplest possible setup. But every piece earns its place by answering a question that the other pieces can't.
Start with structured logs and a health endpoint. Add metrics when you need to know "how bad is it." Add traces when you need to know "where is the time going." Each layer builds on the last, and none of them require you to rewrite your application.
The best time to add observability was before your last production incident. The second best time is now.