Node.js 可观测性:不复杂的日志、指标和链路追踪
使用 Pino 做结构化日志、Prometheus 做指标、OpenTelemetry 做分布式追踪。我用于 Node.js 生产应用的可观测性技术栈,从零到全面可视。
我曾经以为可观测性就是"加几个 console.log,出问题的时候看一看"。这招管用——直到不管用的那天。转折点是一次生产事故:我们的 API 返回的都是 200,但数据是过期的。日志里没有报错,没有异常。只是默默地返回错误的响应,因为一个下游缓存过期了,四个小时都没人发现。
那次之后我才搞清楚监控和可观测性的区别。监控告诉你有问题了。可观测性告诉你为什么有问题。而生产事故就活在这两者之间的缝隙里。
这是我在试过大部分替代方案之后,为 Node.js 服务选定的可观测性技术栈。它不是世上最复杂的方案,但它能在用户注意到之前发现问题,即使有漏网之鱼,我也能在几分钟内定位原因,而不是几个小时。
三大支柱,以及为什么缺一不可#
每个人都在谈论"可观测性的三大支柱"——日志、指标和链路追踪。但没人告诉你的是,每个支柱回答的是本质上不同的问题,你需要三个都有,因为没有任何一个能回答所有问题。
日志回答的是:发生了什么?
一条日志说的是"14:23:07,用户 4821 请求了 /api/orders,因为数据库连接超时返回了 500。"它是一段叙事,讲述一个具体事件的故事。
指标回答的是:发生了多少?
一个指标说的是"过去 5 分钟内,p99 响应时间是 2.3 秒,错误率是 4.7%。"它是聚合数据,反映系统整体的健康状况,而不是某个单独的请求。
链路追踪回答的是:时间花在哪了?
一条追踪说的是"这个请求在 Express 中间件上花了 12ms,解析请求体 3ms,等 PostgreSQL 花了 847ms,序列化响应 2ms。"它是一个瀑布图,精确地告诉你瓶颈在哪里,跨服务边界也没问题。
实际意义是这样的:凌晨 3 点你的告警响了,排查的顺序几乎总是一样的。
- 指标告诉你有问题了(错误率飙升,延迟增加)
- 日志告诉你发生了什么(具体的错误信息,受影响的接口)
- 追踪告诉你为什么(哪个下游服务或数据库查询是瓶颈)
如果你只有日志,你知道什么坏了,但不知道有多严重。如果你只有指标,你知道多严重,但不知道原因是什么。如果你只有追踪,你有漂亮的瀑布图,但不知道什么时候该去看它们。
让我们逐个搭建。
使用 Pino 做结构化日志#
为什么 console.log 不够用#
我知道。你在生产环境用 console.log,觉得"挺好的"。让我告诉你为什么不行。
// What you write
console.log("User login failed", email, error.message);
// What ends up in your log file
// User login failed john@example.com ECONNREFUSED
// Now try to:
// 1. Search for all login failures in the last hour
// 2. Count failures per user
// 3. Filter out just the ECONNREFUSED errors
// 4. Correlate this with the request that triggered it
// Good luck. It's an unstructured string. You're grepping through text.结构化日志意味着每条日志都是一个带一致字段的 JSON 对象。不再是一个对机器不友好的人类可读字符串,而是一个对机器友好(配合合适的工具也对人类友好)的对象。
// What structured logging looks like
{
"level": 50,
"time": 1709312587000,
"msg": "User login failed",
"email": "john@example.com",
"error": "ECONNREFUSED",
"requestId": "req-abc-123",
"route": "POST /api/auth/login",
"responseTime": 1247,
"pid": 12345
}现在你可以查询了。level >= 50 AND msg = "User login failed" AND time > now() - 1h 精确给出你需要的。
Pino vs Winston#
两个我都用过很长时间。简单说。
Winston 更流行、更灵活、有更多 transport,但也慢得多。它还容易诱导坏习惯——"format"系统让你太容易创建非结构化的、漂亮打印的日志,开发时好看但生产环境无法解析。
Pino 更快(基准测试中快 5-10 倍),对 JSON 输出有自己的坚持,遵循 Unix 哲学:只做好一件事(把 JSON 写到 stdout),其他的交给别的工具(传输、格式化、聚合)。
我用 Pino。当你每秒要记录几千个请求时,性能差异很重要,而且这种固执己见的方式确保团队里每个开发者产出的日志都是一致的。
基础 Pino 配置#
// src/lib/logger.ts
import pino from "pino";
const isProduction = process.env.NODE_ENV === "production";
export const logger = pino({
level: process.env.LOG_LEVEL || (isProduction ? "info" : "debug"),
// In production, just JSON to stdout. PM2/container runtime handles the rest.
// In development, use pino-pretty for human-readable output.
...(isProduction
? {}
: {
transport: {
target: "pino-pretty",
options: {
colorize: true,
translateTime: "HH:MM:ss",
ignore: "pid,hostname",
},
},
}),
// Standard fields on every log line
base: {
service: process.env.SERVICE_NAME || "api",
version: process.env.APP_VERSION || "unknown",
},
// Serialize Error objects properly
serializers: {
err: pino.stdSerializers.err,
error: pino.stdSerializers.err,
req: pino.stdSerializers.req,
res: pino.stdSerializers.res,
},
// Redact sensitive fields
redact: {
paths: [
"req.headers.authorization",
"req.headers.cookie",
"password",
"creditCard",
"ssn",
],
censor: "[REDACTED]",
},
});redact 选项至关重要。没有它,你迟早会把密码或 API key 写进日志。这不是"会不会"的问题,而是"什么时候"。某个开发者总会写出 logger.info({ body: req.body }, "incoming request"),然后你就在记录信用卡号了。脱敏是你的安全网。
日志级别:正确使用#
// FATAL (60) - The process is about to crash. Wake someone up.
logger.fatal({ err }, "Unrecoverable database connection failure");
// ERROR (50) - Something failed that shouldn't have. Investigate soon.
logger.error({ err, userId, orderId }, "Payment processing failed");
// WARN (40) - Something unexpected but handled. Keep an eye on it.
logger.warn({ retryCount: 3, service: "email" }, "Retry limit approaching");
// INFO (30) - Normal operations worth recording. The "what happened" log.
logger.info({ userId, action: "login" }, "User authenticated");
// DEBUG (20) - Detailed information for debugging. Never in production.
logger.debug({ query, params }, "Database query executing");
// TRACE (10) - Extremely detailed. Only when you're desperate.
logger.trace({ headers: req.headers }, "Incoming request headers");规则:如果你在 INFO 和 DEBUG 之间纠结,那就是 DEBUG。如果你在 WARN 和 ERROR 之间纠结,问自己:"凌晨 3 点我想被这个叫醒吗?"想的话就是 ERROR,不想就是 WARN。
子 Logger 和请求上下文#
这是 Pino 真正出彩的地方。子 logger 继承父级的所有配置,但添加额外的上下文字段。
// Every log from this child logger will include userId and sessionId
const userLogger = logger.child({ userId: "usr_4821", sessionId: "ses_xyz" });
userLogger.info("User viewed dashboard");
// Output includes userId and sessionId automatically
userLogger.info({ page: "/settings" }, "User navigated");
// Output includes userId, sessionId, AND page对于 HTTP 服务器,你希望每个请求有一个子 logger,这样该请求生命周期内的每条日志都包含请求 ID:
// src/middleware/request-logger.ts
import { randomUUID } from "node:crypto";
import { logger } from "../lib/logger";
import type { Request, Response, NextFunction } from "express";
export function requestLogger(req: Request, res: Response, next: NextFunction) {
const requestId = req.headers["x-request-id"]?.toString() || randomUUID();
const startTime = performance.now();
// Attach a child logger to the request
req.log = logger.child({
requestId,
method: req.method,
url: req.originalUrl,
userAgent: req.headers["user-agent"],
ip: req.headers["x-forwarded-for"]?.toString().split(",").pop()?.trim()
|| req.socket.remoteAddress,
});
// Set the request ID header on the response for correlation
res.setHeader("x-request-id", requestId);
req.log.info("Request received");
res.on("finish", () => {
const duration = Math.round(performance.now() - startTime);
const logMethod = res.statusCode >= 500 ? "error"
: res.statusCode >= 400 ? "warn"
: "info";
req.log[logMethod]({
statusCode: res.statusCode,
duration,
contentLength: res.getHeader("content-length"),
}, "Request completed");
});
next();
}AsyncLocalStorage 实现自动上下文传播#
子 logger 方式可以工作,但它需要你在每个函数调用中传递 req.log,很繁琐。AsyncLocalStorage 解决了这个问题——它提供一个上下文存储,跟随异步执行流传播,无需显式传递。
// src/lib/async-context.ts
import { AsyncLocalStorage } from "node:async_hooks";
import { logger } from "./logger";
import type { Logger } from "pino";
interface RequestContext {
requestId: string;
logger: Logger;
userId?: string;
startTime: number;
}
export const asyncContext = new AsyncLocalStorage<RequestContext>();
// Get the contextual logger from anywhere in the call stack
export function getLogger(): Logger {
const store = asyncContext.getStore();
return store?.logger || logger;
}
export function getRequestId(): string | undefined {
return asyncContext.getStore()?.requestId;
}// src/middleware/async-context-middleware.ts
import { randomUUID } from "node:crypto";
import { asyncContext } from "../lib/async-context";
import { logger } from "../lib/logger";
import type { Request, Response, NextFunction } from "express";
export function asyncContextMiddleware(
req: Request,
res: Response,
next: NextFunction
) {
const requestId = req.headers["x-request-id"]?.toString() || randomUUID();
const requestLogger = logger.child({ requestId });
const context = {
requestId,
logger: requestLogger,
startTime: performance.now(),
};
asyncContext.run(context, () => {
res.setHeader("x-request-id", requestId);
next();
});
}现在调用栈中的任何函数、任何位置,都能获取到请求作用域的 logger:
// src/services/order-service.ts
import { getLogger } from "../lib/async-context";
export async function processOrder(orderId: string) {
const log = getLogger(); // Automatically has requestId attached!
log.info({ orderId }, "Processing order");
const items = await fetchOrderItems(orderId);
log.debug({ itemCount: items.length }, "Order items fetched");
const total = calculateTotal(items);
log.info({ orderId, total }, "Order processed successfully");
return { orderId, total, items };
}
// No need to pass logger as parameter. It just works.日志聚合:日志去哪了?#
在开发环境,日志输出到 stdout,pino-pretty 让它们可读。在生产环境,情况更微妙。
PM2 方案#
如果你在 VPS 上用 PM2 跑服务(我在 VPS 部署那篇文章中介绍过),PM2 自动捕获 stdout:
# View logs in real-time
pm2 logs api --lines 100
# Logs are stored at ~/.pm2/logs/
# api-out.log — stdout (your JSON logs)
# api-error.log — stderr (uncaught exceptions, stack traces)PM2 内置的日志轮转防止磁盘空间被撑爆:
pm2 install pm2-logrotate
pm2 set pm2-logrotate:max_size 50M
pm2 set pm2-logrotate:retain 14
pm2 set pm2-logrotate:compress true将日志发送到 Loki 或 Elasticsearch#
一旦超出单台服务器的规模,你就需要集中式日志聚合。两个主要选项:
Grafana Loki — 被称为"日志界的 Prometheus"。轻量级,只索引标签(不全文索引),和 Grafana 配合得天衣无缝。对大多数团队来说是我的推荐。
Elasticsearch — 日志全文搜索。更强大,更吃资源,运维开销更大。如果你确实需要在上百万条日志中做全文搜索,用它。
对 Loki 来说,最简单的方案是用 Promtail 传输日志:
# /etc/promtail/config.yml
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: node-api
static_configs:
- targets:
- localhost
labels:
job: node-api
environment: production
__path__: /home/deploy/.pm2/logs/api-out.log
pipeline_stages:
- json:
expressions:
level: level
msg: msg
service: service
- labels:
level:
service:
- timestamp:
source: time
format: UnixMsNDJSON 格式#
Pino 默认输出换行分隔的 JSON(NDJSON)——每行一个 JSON 对象,用 \n 分隔。这很重要,因为:
- 所有日志聚合工具都理解它
- 它是可流式处理的(你可以逐行处理日志而不用缓冲整个文件)
- 标准 Unix 工具可以直接用:
cat api-out.log | jq '.msg' | sort | uniq -c | sort -rn
永远不要在生产环境配置 Pino 输出美化的多行 JSON。你会搞坏整条日志处理管线。
// WRONG in production — multi-line JSON breaks line-based processing
{
"level": 30,
"time": 1709312587000,
"msg": "Request completed"
}
// RIGHT in production — NDJSON, one object per line
{"level":30,"time":1709312587000,"msg":"Request completed"}使用 Prometheus 做指标#
日志告诉你发生了什么。指标告诉你系统表现如何。两者的区别就像逐条查看银行流水和看账户余额的区别。
四种指标类型#
Prometheus 有四种指标类型。理解什么时候用哪种,能帮你避开最常见的错误。
Counter — 只增不减的值。请求总数、错误总数、处理的字节数。重启时归零。
// "How many requests have we served?"
const httpRequestsTotal = new Counter({
name: "http_requests_total",
help: "Total number of HTTP requests",
labelNames: ["method", "route", "status_code"],
});Gauge — 可升可降的值。当前连接数、队列长度、温度、堆内存使用量。
// "How many connections are active right now?"
const activeConnections = new Gauge({
name: "active_connections",
help: "Number of currently active connections",
});Histogram — 观测值并按可配置的桶分类计数。请求耗时、响应大小。这就是你获取百分位数(p50、p95、p99)的方式。
// "How long do requests take?" with buckets at 10ms, 50ms, 100ms, etc.
const httpRequestDuration = new Histogram({
name: "http_request_duration_seconds",
help: "Duration of HTTP requests in seconds",
labelNames: ["method", "route", "status_code"],
buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});Summary — 类似 Histogram 但在客户端计算分位数。除非有特殊原因,否则用 Histogram。Summary 无法跨实例聚合。
完整 prom-client 配置#
// src/lib/metrics.ts
import {
Registry,
Counter,
Histogram,
Gauge,
collectDefaultMetrics,
} from "prom-client";
// Create a custom registry to avoid polluting the global one
export const metricsRegistry = new Registry();
// Collect default Node.js metrics:
// - process_cpu_seconds_total
// - process_resident_memory_bytes
// - nodejs_heap_size_total_bytes
// - nodejs_active_handles_total
// - nodejs_eventloop_lag_seconds
// - nodejs_gc_duration_seconds
collectDefaultMetrics({
register: metricsRegistry,
prefix: "nodeapp_",
// Collect every 10 seconds
gcDurationBuckets: [0.001, 0.01, 0.1, 1, 2, 5],
});
// --- HTTP Metrics ---
export const httpRequestsTotal = new Counter({
name: "nodeapp_http_requests_total",
help: "Total number of HTTP requests received",
labelNames: ["method", "route", "status_code"] as const,
registers: [metricsRegistry],
});
export const httpRequestDuration = new Histogram({
name: "nodeapp_http_request_duration_seconds",
help: "Duration of HTTP requests in seconds",
labelNames: ["method", "route", "status_code"] as const,
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
registers: [metricsRegistry],
});
export const httpRequestSizeBytes = new Histogram({
name: "nodeapp_http_request_size_bytes",
help: "Size of HTTP request bodies in bytes",
labelNames: ["method", "route"] as const,
buckets: [100, 1000, 10000, 100000, 1000000],
registers: [metricsRegistry],
});
// --- Business Metrics ---
export const ordersProcessed = new Counter({
name: "nodeapp_orders_processed_total",
help: "Total number of orders processed",
labelNames: ["status"] as const, // "success", "failed", "refunded"
registers: [metricsRegistry],
});
export const activeWebSocketConnections = new Gauge({
name: "nodeapp_active_websocket_connections",
help: "Number of active WebSocket connections",
registers: [metricsRegistry],
});
export const externalApiDuration = new Histogram({
name: "nodeapp_external_api_duration_seconds",
help: "Duration of external API calls",
labelNames: ["service", "endpoint", "status"] as const,
buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30],
registers: [metricsRegistry],
});
export const dbQueryDuration = new Histogram({
name: "nodeapp_db_query_duration_seconds",
help: "Duration of database queries",
labelNames: ["operation", "table"] as const,
buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5],
registers: [metricsRegistry],
});指标中间件#
// src/middleware/metrics-middleware.ts
import { httpRequestsTotal, httpRequestDuration } from "../lib/metrics";
import type { Request, Response, NextFunction } from "express";
// Normalize routes to avoid cardinality explosion
// /api/users/123 → /api/users/:id
// Without this, Prometheus will create a new time series for every user ID
function normalizeRoute(req: Request): string {
const route = req.route?.path || req.path;
// Replace common dynamic segments
return route
.replace(/\/[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/g, "/:uuid")
.replace(/\/\d+/g, "/:id")
.replace(/\/[a-f0-9]{24}/g, "/:objectId");
}
export function metricsMiddleware(
req: Request,
res: Response,
next: NextFunction
) {
// Don't track metrics for the metrics endpoint itself
if (req.path === "/metrics") {
return next();
}
const end = httpRequestDuration.startTimer();
res.on("finish", () => {
const route = normalizeRoute(req);
const labels = {
method: req.method,
route,
status_code: res.statusCode.toString(),
};
httpRequestsTotal.inc(labels);
end(labels);
});
next();
}/metrics 端点#
// src/routes/metrics.ts
import { Router } from "express";
import { metricsRegistry } from "../lib/metrics";
const router = Router();
router.get("/metrics", async (req, res) => {
// Basic auth protection — don't expose metrics publicly
const authHeader = req.headers.authorization;
const expected = `Basic ${Buffer.from(
`${process.env.METRICS_USER}:${process.env.METRICS_PASSWORD}`
).toString("base64")}`;
if (!authHeader || authHeader !== expected) {
res.status(401).set("WWW-Authenticate", "Basic").send("Unauthorized");
return;
}
try {
const metrics = await metricsRegistry.metrics();
res.set("Content-Type", metricsRegistry.contentType);
res.send(metrics);
} catch (err) {
res.status(500).send("Error collecting metrics");
}
});
export default router;自定义业务指标才是真正的力量#
默认的 Node.js 指标(堆内存大小、事件循环延迟、GC 耗时)只是基本功。它们反映的是运行时的健康状况。但业务指标反映的是应用的健康状况。
// In your order service
import { ordersProcessed, externalApiDuration } from "../lib/metrics";
export async function processOrder(order: Order) {
try {
// Time the payment provider call
const paymentTimer = externalApiDuration.startTimer({
service: "stripe",
endpoint: "charges.create",
});
const charge = await stripe.charges.create({
amount: order.total,
currency: "usd",
source: order.paymentToken,
});
paymentTimer({ status: "success" });
ordersProcessed.inc({ status: "success" });
return charge;
} catch (err) {
ordersProcessed.inc({ status: "failed" });
externalApiDuration.startTimer({
service: "stripe",
endpoint: "charges.create",
})({ status: "error" });
throw err;
}
}ordersProcessed{status="failed"} 飙升能告诉你的信息,是任何 CPU 指标都无法提供的。
标签基数:无声的杀手#
提个醒。每个唯一的标签值组合都会创建一个新的时间序列。如果你给 HTTP 请求计数器加了 userId 标签,而你有 10 万用户,你刚刚创建了 10 万+ 个时间序列。Prometheus 会被拖垮。
标签规则:
- 只用低基数:HTTP 方法(7 个值)、状态码(5 个类别)、路由(几十个,不是几千个)
- 永远不要用用户 ID、请求 ID、IP 地址或时间戳作为标签值
- 拿不准就别加。你以后随时可以加,但删除需要修改仪表盘和告警
Grafana 仪表盘#
Prometheus 存储数据,Grafana 可视化数据。这些是我在每个 Node.js 服务仪表盘上放的面板。
必备仪表盘#
1. 请求速率(requests/second)
rate(nodeapp_http_requests_total[5m])显示流量模式。用于发现突发的峰值或下降。
2. 错误率(%)
100 * (
sum(rate(nodeapp_http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(nodeapp_http_requests_total[5m]))
)最重要的一个数字。如果超过 1%,说明有问题。
3. p50 / p95 / p99 延迟
histogram_quantile(0.99,
sum(rate(nodeapp_http_request_duration_seconds_bucket[5m])) by (le)
)p50 告诉你典型体验,p99 告诉你最差体验。如果 p99 是 p50 的 10 倍,你有尾延迟问题。
4. 事件循环延迟
nodeapp_nodejs_eventloop_lag_seconds{quantile="0.99"}如果超过 100ms,你的事件循环被阻塞了。很可能是在异步路径中有同步操作。
5. 堆内存使用
nodeapp_nodejs_heap_size_used_bytes / nodeapp_nodejs_heap_size_total_bytes * 100注意持续上升的趋势——那是内存泄漏。GC 期间的峰值是正常的。
6. 活跃句柄
nodeapp_nodejs_active_handles_total打开的文件描述符、套接字、定时器。持续增长意味着你在泄漏句柄——很可能没有关闭数据库连接或 HTTP 响应。
Grafana 仪表盘即代码#
你可以使用 Grafana 的 provisioning 功能将仪表盘纳入版本控制:
# /etc/grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
- name: "Node.js Services"
orgId: 1
folder: "Services"
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: true从 Grafana 导出仪表盘 JSON,提交到仓库,你的仪表盘就能在 Grafana 重装后幸存。这在生产环境不是可选项——和基础设施即代码是同一个道理。
使用 OpenTelemetry 做分布式追踪#
追踪是大多数团队最后采纳的支柱,也是他们最希望早点采纳的。当你有多个服务相互通信时(即使只是"API 服务器 + 数据库 + Redis + 外部 API"),追踪能展示一个请求完整旅程的全貌。
什么是追踪?#
追踪是一棵 span 的树。每个 span 代表一个工作单元——一次 HTTP 请求、一次数据库查询、一次函数调用。Span 有开始时间、结束时间、状态和属性。它们通过一个跨服务边界传播的 trace ID 连接在一起。
Trace: abc-123
├── [API Gateway] POST /api/orders (250ms)
│ ├── [Auth Service] validate-token (12ms)
│ ├── [Order Service] create-order (230ms)
│ │ ├── [PostgreSQL] INSERT INTO orders (15ms)
│ │ ├── [Redis] SET order:cache (2ms)
│ │ └── [Payment Service] charge (200ms)
│ │ ├── [Stripe API] POST /v1/charges (180ms)
│ │ └── [PostgreSQL] UPDATE orders SET status (8ms)
│ └── [Email Service] send-confirmation (async, 45ms)
一眼就能看出:250ms 的请求里有 180ms 花在等 Stripe 上。优化方向明确了。
OpenTelemetry 配置#
OpenTelemetry(OTel)是行业标准。它用一个统一的、厂商中立的 API 替代了碎片化的 Jaeger 客户端、Zipkin 客户端和各种厂商专属 SDK。
// src/instrumentation.ts
// This file MUST be loaded before any other imports.
// In Node.js, use --require or --import flag.
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { OTLPMetricExporter } from "@opentelemetry/exporter-metrics-otlp-http";
import { PeriodicExportingMetricReader } from "@opentelemetry/sdk-metrics";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { Resource } from "@opentelemetry/resources";
import {
ATTR_SERVICE_NAME,
ATTR_SERVICE_VERSION,
} from "@opentelemetry/semantic-conventions";
const sdk = new NodeSDK({
resource: new Resource({
[ATTR_SERVICE_NAME]: process.env.SERVICE_NAME || "node-api",
[ATTR_SERVICE_VERSION]: process.env.APP_VERSION || "0.0.0",
"deployment.environment": process.env.NODE_ENV || "development",
}),
// Send traces to your collector (Jaeger, Tempo, etc.)
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || "http://localhost:4318/v1/traces",
}),
// Optionally send metrics through OTel too
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || "http://localhost:4318/v1/metrics",
}),
exportIntervalMillis: 15000,
}),
// Auto-instrumentation: automatically creates spans for
// HTTP requests, Express routes, PostgreSQL queries, Redis commands,
// DNS lookups, and more
instrumentations: [
getNodeAutoInstrumentations({
// Disable noisy instrumentations
"@opentelemetry/instrumentation-fs": { enabled: false },
"@opentelemetry/instrumentation-dns": { enabled: false },
// Configure specific ones
"@opentelemetry/instrumentation-http": {
ignoreIncomingPaths: ["/health", "/ready", "/metrics"],
},
"@opentelemetry/instrumentation-express": {
ignoreLayersType: ["middleware"],
},
}),
],
});
sdk.start();
// Graceful shutdown
process.on("SIGTERM", () => {
sdk.shutdown().then(
() => console.log("OTel SDK shut down successfully"),
(err) => console.error("Error shutting down OTel SDK", err)
);
});启动应用时这样跑:
node --import ./src/instrumentation.ts ./src/server.ts就这样。你的应用代码零改动,现在每个 HTTP 请求、每个数据库查询、每个 Redis 命令都有追踪了。
手动创建 Span#
自动探针覆盖了基础设施调用,但有时你想追踪业务逻辑:
// src/services/order-service.ts
import { trace, SpanStatusCode } from "@opentelemetry/api";
const tracer = trace.getTracer("order-service");
export async function processOrder(orderId: string): Promise<Order> {
return tracer.startActiveSpan("processOrder", async (span) => {
try {
span.setAttribute("order.id", orderId);
// This span becomes the parent of any auto-instrumented
// DB queries or HTTP calls inside these functions
const order = await fetchOrder(orderId);
span.setAttribute("order.total", order.total);
span.setAttribute("order.item_count", order.items.length);
const validationResult = await tracer.startActiveSpan(
"validateOrder",
async (validationSpan) => {
const result = await validateInventory(order);
validationSpan.setAttribute("validation.passed", result.valid);
if (!result.valid) {
validationSpan.setStatus({
code: SpanStatusCode.ERROR,
message: `Validation failed: ${result.reason}`,
});
}
validationSpan.end();
return result;
}
);
if (!validationResult.valid) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: "Order validation failed",
});
throw new Error(validationResult.reason);
}
const payment = await processPayment(order);
span.setAttribute("payment.id", payment.id);
span.setStatus({ code: SpanStatusCode.OK });
return order;
} catch (err) {
span.recordException(err as Error);
span.setStatus({
code: SpanStatusCode.ERROR,
message: (err as Error).message,
});
throw err;
} finally {
span.end();
}
});
}追踪上下文传播#
分布式追踪的魔法在于 trace ID 跟随请求跨服务传播。当服务 A 调用服务 B 时,追踪上下文自动注入到 HTTP 头中(按照 W3C Trace Context 标准使用 traceparent 头)。
自动探针为出站 HTTP 调用处理了这个。但如果你用的是消息队列,需要手动传播:
import { context, propagation } from "@opentelemetry/api";
// When publishing a message
function publishEvent(queue: string, payload: object) {
const carrier: Record<string, string> = {};
// Inject current trace context into the carrier
propagation.inject(context.active(), carrier);
// Send both the payload and the trace context
messageQueue.publish(queue, {
payload,
traceContext: carrier,
});
}
// When consuming a message
function consumeEvent(message: QueueMessage) {
// Extract the trace context from the message
const parentContext = propagation.extract(
context.active(),
message.traceContext
);
// Run the handler within the extracted context
// Now any spans created here will be children of the original trace
context.with(parentContext, () => {
tracer.startActiveSpan("processEvent", (span) => {
span.setAttribute("queue.message_id", message.id);
handleEvent(message.payload);
span.end();
});
});
}追踪数据发到哪里#
Jaeger — 经典的开源选项。UI 不错,用 Docker 本地跑很方便。长期存储能力有限。
Grafana Tempo — 如果你已经在用 Grafana 和 Loki,Tempo 是追踪的自然选择。使用对象存储(S3/GCS)实现低成本的长期保留。
Grafana Cloud / Datadog / Honeycomb — 如果你不想自己运维基础设施。更贵,但运维开销更低。
本地开发用 Docker 跑 Jaeger 非常合适:
# docker-compose.otel.yml
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # Jaeger UI
- "4318:4318" # OTLP HTTP receiver
environment:
- COLLECTOR_OTLP_ENABLED=true健康检查端点#
健康检查是最简单的可观测性形式,也应该是你最先实现的。它回答一个问题:"这个服务现在能不能正常处理请求?"
三种健康检查#
/health — 综合健康状况。进程在跑且有响应吗?
/ready — 就绪检查。这个服务能处理流量吗?(数据库连上了吗?配置加载了吗?缓存预热了吗?)
/live — 存活检查。进程是否存活且没有死锁?(能在超时时间内响应简单请求吗?)
在 Kubernetes 中这个区分很重要:存活探针重启卡死的容器,就绪探针在启动期间或依赖故障时从负载均衡器中移除容器。
// src/routes/health.ts
import { Router } from "express";
import { Pool } from "pg";
import Redis from "ioredis";
const router = Router();
interface HealthCheckResult {
status: "ok" | "degraded" | "error";
checks: Record<
string,
{
status: "ok" | "error";
latency?: number;
message?: string;
}
>;
uptime: number;
timestamp: string;
version: string;
}
async function checkDatabase(pool: Pool): Promise<{ ok: boolean; latency: number }> {
const start = performance.now();
try {
await pool.query("SELECT 1");
return { ok: true, latency: Math.round(performance.now() - start) };
} catch {
return { ok: false, latency: Math.round(performance.now() - start) };
}
}
async function checkRedis(redis: Redis): Promise<{ ok: boolean; latency: number }> {
const start = performance.now();
try {
await redis.ping();
return { ok: true, latency: Math.round(performance.now() - start) };
} catch {
return { ok: false, latency: Math.round(performance.now() - start) };
}
}
export function createHealthRoutes(pool: Pool, redis: Redis) {
// Liveness — just check if the process can respond
router.get("/live", (_req, res) => {
res.status(200).json({ status: "ok" });
});
// Readiness — check all dependencies
router.get("/ready", async (_req, res) => {
const [db, cache] = await Promise.all([
checkDatabase(pool),
checkRedis(redis),
]);
const allOk = db.ok && cache.ok;
res.status(allOk ? 200 : 503).json({
status: allOk ? "ok" : "not_ready",
checks: {
database: db,
redis: cache,
},
});
});
// Full health — detailed status for dashboards and debugging
router.get("/health", async (_req, res) => {
const [db, cache] = await Promise.all([
checkDatabase(pool),
checkRedis(redis),
]);
const anyError = !db.ok || !cache.ok;
const allError = !db.ok && !cache.ok;
const result: HealthCheckResult = {
status: allError ? "error" : anyError ? "degraded" : "ok",
checks: {
database: {
status: db.ok ? "ok" : "error",
latency: db.latency,
...(!db.ok && { message: "Connection failed" }),
},
redis: {
status: cache.ok ? "ok" : "error",
latency: cache.latency,
...(!cache.ok && { message: "Connection failed" }),
},
},
uptime: process.uptime(),
timestamp: new Date().toISOString(),
version: process.env.APP_VERSION || "unknown",
};
// Return 200 for ok/degraded (service can still handle some traffic)
// Return 503 for error (service should be removed from rotation)
res.status(result.status === "error" ? 503 : 200).json(result);
});
return router;
}Kubernetes 探针配置#
# k8s/deployment.yml
spec:
containers:
- name: api
livenessProbe:
httpGet:
path: /live
port: 3000
initialDelaySeconds: 10
periodSeconds: 15
timeoutSeconds: 5
failureThreshold: 3 # Restart after 3 consecutive failures (45s)
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 2 # Remove from LB after 2 failures (20s)
startupProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 0
periodSeconds: 5
failureThreshold: 30 # Give up to 150s for startup一个常见错误:存活探针配得太激进。如果你的存活探针检查数据库,而数据库暂时挂了,Kubernetes 会重启容器。但重启并不能修好数据库。现在你在数据库故障之上又多了一个崩溃循环。保持存活探针简单——它应该只检测死锁或卡住的进程。
使用 Sentry 做错误追踪#
日志捕获你预料到的错误。Sentry 捕获你没预料到的。
这个区分很重要。你在你知道可能出错的代码周围加 try/catch。但真正重要的 bug 是那些你以为安全的代码中的。未处理的 promise rejection、来自意外 API 响应的类型错误、可选链上的空指针访问——因为它其实不够"可选"。
Node.js 的 Sentry 配置#
// src/lib/sentry.ts
import * as Sentry from "@sentry/node";
import { nodeProfilingIntegration } from "@sentry/profiling-node";
export function initSentry() {
Sentry.init({
dsn: process.env.SENTRY_DSN,
environment: process.env.NODE_ENV || "development",
release: process.env.APP_VERSION || "unknown",
// Sample 10% of transactions for performance monitoring
// (100% in development)
tracesSampleRate: process.env.NODE_ENV === "production" ? 0.1 : 1.0,
// Profile 100% of sampled transactions
profilesSampleRate: 1.0,
integrations: [
nodeProfilingIntegration(),
// Filter out noisy errors
Sentry.rewriteFramesIntegration({
root: process.cwd(),
}),
],
// Don't send errors from development
enabled: process.env.NODE_ENV === "production",
// Filter out known non-issues
ignoreErrors: [
// Client disconnects aren't bugs
"ECONNRESET",
"ECONNABORTED",
"EPIPE",
// Bots sending garbage
"SyntaxError: Unexpected token",
],
// Strip PII before sending
beforeSend(event) {
// Remove IP addresses
if (event.request) {
delete event.request.headers?.["x-forwarded-for"];
delete event.request.headers?.["x-real-ip"];
delete event.request.cookies;
}
// Remove sensitive query params
if (event.request?.query_string) {
const params = new URLSearchParams(event.request.query_string);
params.delete("token");
params.delete("api_key");
event.request.query_string = params.toString();
}
return event;
},
});
}Express 错误处理器 + Sentry#
// src/middleware/error-handler.ts
import * as Sentry from "@sentry/node";
import { getLogger } from "../lib/async-context";
import type { Request, Response, NextFunction } from "express";
// Sentry request handler must come first
export const sentryRequestHandler = Sentry.Handlers.requestHandler();
// Sentry tracing handler
export const sentryTracingHandler = Sentry.Handlers.tracingHandler();
// Your custom error handler comes last
export function errorHandler(
err: Error,
req: Request,
res: Response,
_next: NextFunction
) {
const log = getLogger();
// Add custom context to the Sentry event
Sentry.withScope((scope) => {
scope.setTag("route", req.route?.path || req.path);
scope.setTag("method", req.method);
if (req.user) {
scope.setUser({
id: req.user.id,
// Don't send email or username to Sentry
});
}
// Add breadcrumbs for debugging
scope.addBreadcrumb({
category: "request",
message: `${req.method} ${req.path}`,
level: "info",
data: {
query: req.query,
statusCode: res.statusCode,
},
});
Sentry.captureException(err);
});
// Log the error with full context
log.error(
{
err,
statusCode: 500,
route: req.route?.path || req.path,
method: req.method,
},
"Unhandled error in request handler"
);
// Send a generic error response
// Never expose error details to the client in production
res.status(500).json({
error: "Internal Server Error",
...(process.env.NODE_ENV !== "production" && {
message: err.message,
stack: err.stack,
}),
});
}Source Map#
没有 source map,Sentry 显示的是压缩/转译后的堆栈跟踪。没用。在构建时上传 source map:
# In your CI/CD pipeline
npx @sentry/cli sourcemaps upload \
--org your-org \
--project your-project \
--release $APP_VERSION \
./dist或在打包器中配置:
// vite.config.ts (or equivalent)
import { sentryVitePlugin } from "@sentry/vite-plugin";
export default defineConfig({
build: {
sourcemap: true, // Required for Sentry
},
plugins: [
sentryVitePlugin({
org: process.env.SENTRY_ORG,
project: process.env.SENTRY_PROJECT,
authToken: process.env.SENTRY_AUTH_TOKEN,
}),
],
});未处理 Promise Rejection 的代价#
从 Node.js 15 开始,未处理的 promise rejection 默认会导致进程崩溃。这是好事——它迫使你处理错误。但你需要一个安全网:
// src/server.ts — near the top of your entry point
process.on("unhandledRejection", (reason, promise) => {
logger.fatal({ reason, promise }, "Unhandled promise rejection — crashing");
Sentry.captureException(reason);
// Flush Sentry events before crashing
Sentry.flush(2000).finally(() => {
process.exit(1);
});
});
process.on("uncaughtException", (error) => {
logger.fatal({ err: error }, "Uncaught exception — crashing");
Sentry.captureException(error);
Sentry.flush(2000).finally(() => {
process.exit(1);
});
});关键点:process.exit() 之前要 Sentry.flush()。否则错误事件可能来不及发到 Sentry 进程就死了。
告警:真正有意义的告警#
有 200 个 Prometheus 指标但零个告警,那只是虚荣的监控。有 50 个告警每天都在响,那是告警疲劳——你会开始忽略它们,然后错过那个真正重要的。
目标是少量高信号告警,意味着"确实有问题了,需要人来看看"。
Prometheus AlertManager 配置#
# alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: $SLACK_WEBHOOK_URL
route:
receiver: "slack-warnings"
group_by: ["alertname", "service"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: "pagerduty-critical"
repeat_interval: 15m
- match:
severity: warning
receiver: "slack-warnings"
receivers:
- name: "pagerduty-critical"
pagerduty_configs:
- routing_key: $PAGERDUTY_ROUTING_KEY
severity: critical
- name: "slack-warnings"
slack_configs:
- channel: "#alerts"
title: '{{ template "slack.title" . }}'
text: '{{ template "slack.text" . }}'真正会把我叫醒的告警#
# prometheus/rules/node-api.yml
groups:
- name: node-api-critical
rules:
# High error rate — something is broken
- alert: HighErrorRate
expr: |
(
sum(rate(nodeapp_http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(nodeapp_http_requests_total[5m]))
) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 1% for 5 minutes"
description: "{{ $value | humanizePercentage }} of requests are returning 5xx"
# Slow responses — users are suffering
- alert: HighP99Latency
expr: |
histogram_quantile(0.99,
sum(rate(nodeapp_http_request_duration_seconds_bucket[5m])) by (le)
) > 1
for: 5m
labels:
severity: critical
annotations:
summary: "p99 latency above 1 second for 5 minutes"
description: "p99 latency is {{ $value | humanizeDuration }}"
# Memory leak — will OOM soon
- alert: HighHeapUsage
expr: |
(
nodeapp_nodejs_heap_size_used_bytes
/
nodeapp_nodejs_heap_size_total_bytes
) > 0.80
for: 10m
labels:
severity: critical
annotations:
summary: "Heap usage above 80% for 10 minutes"
description: "Heap usage is at {{ $value | humanizePercentage }}"
# Process is down
- alert: ServiceDown
expr: up{job="node-api"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Node.js API is down"
- name: node-api-warnings
rules:
# Event loop is getting slow
- alert: HighEventLoopLag
expr: |
nodeapp_nodejs_eventloop_lag_seconds{quantile="0.99"} > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Event loop lag above 100ms"
description: "p99 event loop lag is {{ $value | humanizeDuration }}"
# Traffic dropped significantly — possible routing issue
- alert: TrafficDrop
expr: |
sum(rate(nodeapp_http_requests_total[5m]))
< (sum(rate(nodeapp_http_requests_total[5m] offset 1h)) * 0.5)
for: 10m
labels:
severity: warning
annotations:
summary: "Traffic dropped by more than 50% compared to 1 hour ago"
# Database queries getting slow
- alert: SlowDatabaseQueries
expr: |
histogram_quantile(0.99,
sum(rate(nodeapp_db_query_duration_seconds_bucket[5m])) by (le, operation)
) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "p99 database query time above 500ms"
description: "Slow {{ $labels.operation }} queries: {{ $value | humanizeDuration }}"
# External API is failing
- alert: ExternalAPIFailures
expr: |
(
sum(rate(nodeapp_external_api_duration_seconds_count{status="error"}[5m])) by (service)
/
sum(rate(nodeapp_external_api_duration_seconds_count[5m])) by (service)
) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "External API {{ $labels.service }} failing >10%"注意每个告警上的 for 子句。没有它,一次短暂的峰值就会触发告警。5 分钟的 for 意味着条件必须持续为真 5 分钟。这消除了瞬时毛刺带来的噪音。
告警卫生#
每个告警都必须通过这个测试:
- 可操作吗? 如果没人能做任何事,就别告警。记日志、放仪表盘,但别把人叫醒。
- 需要人工干预吗? 如果它会自愈(比如短暂的网络抖动),
for子句应该过滤掉它。 - 过去 30 天触发过吗? 如果没有,可能是配置有误或阈值不对。检查一下。
- 触发时团队在意吗? 如果团队经常忽略它,删掉或调整阈值。
我每季度审计一次告警。每个告警有三种结果之一:保留、调整阈值或删除。
组装到一起:Express 应用#
所有拼图是怎么在一个真实应用中组合在一起的:
// src/server.ts
import { initSentry } from "./lib/sentry";
// Initialize Sentry first — before other imports
initSentry();
import express from "express";
import * as Sentry from "@sentry/node";
import { Pool } from "pg";
import Redis from "ioredis";
import { logger } from "./lib/logger";
import { asyncContextMiddleware } from "./middleware/async-context-middleware";
import { metricsMiddleware } from "./middleware/metrics-middleware";
import { requestLogger } from "./middleware/request-logger";
import {
sentryRequestHandler,
sentryTracingHandler,
errorHandler,
} from "./middleware/error-handler";
import { createHealthRoutes } from "./routes/health";
import metricsRouter from "./routes/metrics";
import apiRouter from "./routes/api";
const app = express();
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
const redis = new Redis(process.env.REDIS_URL);
// --- Middleware Order Matters ---
// 1. Sentry request handler (must be first)
app.use(sentryRequestHandler);
app.use(sentryTracingHandler);
// 2. Async context (creates request-scoped context)
app.use(asyncContextMiddleware);
// 3. Request logging
app.use(requestLogger);
// 4. Metrics collection
app.use(metricsMiddleware);
// 5. Body parsing
app.use(express.json({ limit: "1mb" }));
// --- Routes ---
// Health checks (no auth required)
app.use(createHealthRoutes(pool, redis));
// Metrics (basic auth protected)
app.use(metricsRouter);
// API routes
app.use("/api", apiRouter);
// --- Error Handling ---
// Sentry error handler (must be before custom error handler)
app.use(Sentry.Handlers.errorHandler());
// Custom error handler (must be last)
app.use(errorHandler);
// --- Start ---
const port = parseInt(process.env.PORT || "3000", 10);
app.listen(port, () => {
logger.info(
{
port,
nodeEnv: process.env.NODE_ENV,
version: process.env.APP_VERSION,
},
"Server started"
);
});
// Graceful shutdown
async function shutdown(signal: string) {
logger.info({ signal }, "Shutdown signal received");
// Stop accepting new connections
// Process in-flight requests (Express does this automatically)
// Close database pool
await pool.end().catch((err) => {
logger.error({ err }, "Error closing database pool");
});
// Close Redis connection
await redis.quit().catch((err) => {
logger.error({ err }, "Error closing Redis connection");
});
// Flush Sentry
await Sentry.close(2000);
logger.info("Shutdown complete");
process.exit(0);
}
process.on("SIGTERM", () => shutdown("SIGTERM"));
process.on("SIGINT", () => shutdown("SIGINT"));最小可行技术栈#
以上是"完整"技术栈。你不需要第一天就全部用上。以下是随着项目增长逐步扩展可观测性的路线。
阶段 1:个人项目 / 独立开发者#
你需要三样东西:
-
结构化控制台日志 — 用 Pino,输出 JSON 到 stdout。即使你只是用
pm2 logs看日志,JSON 格式也是可搜索和可解析的。 -
一个 /health 端点 — 5 分钟实现,当你排查"它到底跑没跑"的时候能救你。
-
Sentry 免费版 — 捕获你没预料到的错误。免费版每月 5000 个事件,个人项目完全够用。
// This is the minimal setup. Under 50 lines. No excuses.
import pino from "pino";
import express from "express";
import * as Sentry from "@sentry/node";
const logger = pino({ level: "info" });
const app = express();
Sentry.init({ dsn: process.env.SENTRY_DSN });
app.use(Sentry.Handlers.requestHandler());
app.get("/health", (_req, res) => {
res.json({ status: "ok", uptime: process.uptime() });
});
app.use("/api", apiRoutes);
app.use(Sentry.Handlers.errorHandler());
app.use((err: Error, _req: express.Request, res: express.Response, _next: express.NextFunction) => {
logger.error({ err }, "Unhandled error");
res.status(500).json({ error: "Internal Server Error" });
});
app.listen(3000, () => logger.info("Server started on port 3000"));阶段 2:成长期项目 / 小团队#
加上:
-
Prometheus 指标 + Grafana — 当"感觉好像变慢了"不够用,你需要数据的时候。从请求速率、错误率和 p99 延迟开始。
-
日志聚合 — 当
ssh到服务器上grep文件不再能扩展的时候。如果已经在用 Grafana,就用 Loki + Promtail。 -
基础告警 — 错误率 > 1%,p99 > 1s,服务宕机。三个告警,就够了。
阶段 3:生产服务 / 多服务架构#
加上:
-
用 OpenTelemetry 做分布式追踪 — 当"API 慢"变成"它调用的 5 个服务中哪个慢?"OTel 自动探针零代码改动就能给你 80% 的价值。
-
仪表盘即代码 — 把 Grafana 仪表盘纳入版本控制。需要重建它们时你会感谢自己的。
-
结构化告警 — AlertManager 配合适当的路由、升级和静默规则。
-
业务指标 — 订单/秒、转化率、队列深度。你的产品团队关心的指标。
可以跳过的#
- 按主机收费的 APM 厂商 — 规模大了成本惊人。开源方案(Prometheus + Grafana + Tempo + Loki)能给你 95% 的功能。
- 生产环境 INFO 以下的日志级别 — 你会产生 TB 级的 DEBUG 日志并为存储付费。只在主动排查问题时用 DEBUG,然后关掉。
- 给所有东西加自定义指标 — 从 RED 方法(Rate、Errors、Duration)开始,给每个服务做。只在有具体问题要回答时才加自定义指标。
- 复杂的追踪采样 — 从简单的采样率开始(生产环境 10%)。自适应采样对大多数团队来说是过早优化。
最后的思考#
可观测性不是你买的产品或装的工具。它是一种实践。它是运维你的服务和期望你的服务自己运行之间的区别。
我在这里介绍的技术栈——Pino 做日志、Prometheus 做指标、OpenTelemetry 做追踪、Sentry 做错误捕获、Grafana 做可视化、AlertManager 做告警——不是最简单的方案。但每一个组件都通过回答其他组件无法回答的问题来证明自己的价值。
从结构化日志和健康检查端点开始。需要知道"有多严重"时加指标。需要知道"时间花在哪了"时加追踪。每一层都建立在上一层之上,而且没有一层要求你重写应用。
加可观测性的最佳时机是上一次生产事故之前。第二好的时机是现在。