Observability trong Node.js: Log, Metric và Trace Không Phức Tạp

Tôi từng nghĩ observability nghĩa là "thêm vài console.log và kiểm tra khi có gì đó hỏng." Điều đó hoạt động cho đến khi không còn hoạt động nữa. Điểm gãy là một sự cố production nơi API trả về 200 nhưng dữ liệu đã cũ. Không lỗi trong log. Không exception. Chỉ có phản hồi sai âm thầm vì cache downstream đã cũ và không ai nhận ra trong bốn giờ.

Đó là khi tôi học được sự khác biệt giữa monitoring và observability. Monitoring cho bạn biết có gì đó sai. Observability cho bạn biết tại sao nó sai. Và khoảng cách giữa hai thứ đó là nơi các sự cố production tồn tại.

Đây là stack observability tôi đã ổn định cho các service Node.js sau khi thử hầu hết các lựa chọn thay thế. Nó không phải setup tinh vi nhất thế giới, nhưng nó bắt vấn đề trước khi người dùng nhận ra, và khi có gì đó lọt qua, tôi có thể chẩn đoán trong vài phút thay vì hàng giờ.

Ba Trụ Cột, và Tại Sao Bạn Cần Cả Ba#

Ai cũng nói về "ba trụ cột của observability" — log, metric, và trace. Điều không ai nói cho bạn là mỗi trụ cột trả lời một câu hỏi cơ bản khác nhau, và bạn cần cả ba vì không trụ cột nào có thể trả lời mọi câu hỏi.

Log trả lời: Chuyện gì đã xảy ra?

Một dòng log nói "lúc 14:23:07, user 4821 request /api/orders và nhận 500 vì database connection timeout." Đó là tường thuật. Nó kể cho bạn câu chuyện của một sự kiện cụ thể.

Metric trả lời: Bao nhiêu đang xảy ra?

Một metric nói "trong 5 phút qua, p99 response time là 2.3 giây và error rate là 4.7%." Đó là dữ liệu tổng hợp. Nó cho bạn biết về sức khỏe của hệ thống tổng thể, không phải về bất kỳ request riêng lẻ nào.

Trace trả lời: Thời gian đi đâu?

Một trace nói "request này dành 12ms trong Express middleware, 3ms parse body, 847ms chờ PostgreSQL, và 2ms serialize response." Đó là waterfall. Nó cho bạn biết chính xác bottleneck ở đâu, xuyên ranh giới service.

Đây là hệ quả thực tế: khi pager reo lúc 3 giờ sáng, trình tự gần như luôn giống nhau.

Metric cho bạn biết có gì đó sai (error rate tăng đột biến, latency tăng)
Log cho bạn biết chuyện gì đang xảy ra (message lỗi cụ thể, endpoint bị ảnh hưởng)
Trace cho bạn biết tại sao (service downstream hoặc database query nào là bottleneck)

Nếu bạn chỉ có log, bạn sẽ biết cái gì hỏng nhưng không biết tệ đến mức nào. Nếu bạn chỉ có metric, bạn sẽ biết tệ thế nào nhưng không biết nguyên nhân gì. Nếu bạn chỉ có trace, bạn sẽ có waterfall đẹp nhưng không biết khi nào cần xem chúng.

Hãy xây dựng từng cái.

Structured Logging Với Pino#

Tại Sao console.log Không Đủ#

Tôi biết. Bạn đã dùng console.log trong production và nó "ổn." Cho tôi chỉ bạn tại sao nó không ổn.

typescript

// Cái bạn viết
console.log("User login failed", email, error.message);
 
// Cái kết thúc trong file log
// User login failed john@example.com ECONNREFUSED
 
// Giờ thử:
// 1. Tìm tất cả login failure trong giờ qua
// 2. Đếm failure mỗi user
// 3. Lọc chỉ lỗi ECONNREFUSED
// 4. Liên hệ cái này với request đã kích hoạt nó
// Chúc may mắn. Đó là string không có cấu trúc. Bạn đang grep qua text.

Structured logging nghĩa là mỗi entry log là JSON object với field nhất quán. Thay vì string đọc được cho người nhưng thù địch với máy, bạn có object đọc được cho máy mà cũng đọc được cho người (với công cụ phù hợp).

typescript

// Structured logging trông như thế nào
{
  "level": 50,
  "time": 1709312587000,
  "msg": "User login failed",
  "email": "john@example.com",
  "error": "ECONNREFUSED",
  "requestId": "req-abc-123",
  "route": "POST /api/auth/login",
  "responseTime": 1247,
  "pid": 12345
}

Giờ bạn có thể truy vấn. level >= 50 AND msg = "User login failed" AND time > now() - 1h cho bạn chính xác những gì cần.

Pino vs Winston#

Tôi đã dùng cả hai rộng rãi. Đây là phiên bản ngắn:

Winston phổ biến hơn, linh hoạt hơn, có nhiều transport hơn, và chậm hơn đáng kể. Nó cũng khuyến khích mẫu xấu — hệ thống "format" khiến việc tạo log không có cấu trúc, pretty-printed trông đẹp trong development nhưng không parse được trong production quá dễ.

Pino nhanh hơn (5-10 lần trong benchmark), có chính kiến về output JSON, và theo triết lý Unix: làm một việc tốt (ghi JSON ra stdout) và để công cụ khác xử lý phần còn lại (transport, format, aggregation).

Tôi dùng Pino. Sự khác biệt hiệu năng quan trọng khi bạn log hàng nghìn request mỗi giây, và cách tiếp cận có chính kiến nghĩa là mọi developer trong team tạo log nhất quán.

Setup Pino Cơ Bản#

typescript

// src/lib/logger.ts
import pino from "pino";
 
const isProduction = process.env.NODE_ENV === "production";
 
export const logger = pino({
  level: process.env.LOG_LEVEL || (isProduction ? "info" : "debug"),
  // Trong production, chỉ JSON ra stdout. PM2/container runtime xử lý phần còn lại.
  // Trong development, dùng pino-pretty cho output đọc được.
  ...(isProduction
    ? {}
    : {
        transport: {
          target: "pino-pretty",
          options: {
            colorize: true,
            translateTime: "HH:MM:ss",
            ignore: "pid,hostname",
          },
        },
      }),
  // Field chuẩn trên mỗi dòng log
  base: {
    service: process.env.SERVICE_NAME || "api",
    version: process.env.APP_VERSION || "unknown",
  },
  // Serialize object Error đúng cách
  serializers: {
    err: pino.stdSerializers.err,
    error: pino.stdSerializers.err,
    req: pino.stdSerializers.req,
    res: pino.stdSerializers.res,
  },
  // Ẩn field nhạy cảm
  redact: {
    paths: [
      "req.headers.authorization",
      "req.headers.cookie",
      "password",
      "creditCard",
      "ssn",
    ],
    censor: "[REDACTED]",
  },
});

Tùy chọn redact cực kỳ quan trọng. Không có nó, bạn sẽ cuối cùng log mật khẩu hoặc API key. Không phải vấn đề nếu, mà là khi. Developer nào đó sẽ thêm logger.info({ body: req.body }, "incoming request") và đột nhiên bạn đang log số thẻ tín dụng. Redaction là lưới an toàn.

Log Level: Sử Dụng Đúng Cách#

typescript

// FATAL (60) - Process sắp crash. Đánh thức ai đó.
logger.fatal({ err }, "Unrecoverable database connection failure");
 
// ERROR (50) - Có gì đó fail không nên. Điều tra sớm.
logger.error({ err, userId, orderId }, "Payment processing failed");
 
// WARN (40) - Có gì đó bất ngờ nhưng đã xử lý. Theo dõi.
logger.warn({ retryCount: 3, service: "email" }, "Retry limit approaching");
 
// INFO (30) - Hoạt động bình thường đáng ghi lại. Log "chuyện gì đã xảy ra".
logger.info({ userId, action: "login" }, "User authenticated");
 
// DEBUG (20) - Thông tin chi tiết cho debug. Không bao giờ trong production.
logger.debug({ query, params }, "Database query executing");
 
// TRACE (10) - Cực kỳ chi tiết. Chỉ khi bạn tuyệt vọng.
logger.trace({ headers: req.headers }, "Incoming request headers");

Quy tắc: nếu bạn phân vân giữa INFO và DEBUG, đó là DEBUG. Nếu bạn phân vân giữa WARN và ERROR, hãy tự hỏi: "Tôi có muốn được alert về việc này lúc 3 giờ sáng không?" Nếu có, ERROR. Nếu không, WARN.

Child Logger và Request Context#

Đây là nơi Pino thực sự tỏa sáng. Child logger kế thừa tất cả cấu hình của parent nhưng thêm field context.

typescript

// Mỗi log từ child logger này sẽ bao gồm userId và sessionId
const userLogger = logger.child({ userId: "usr_4821", sessionId: "ses_xyz" });
 
userLogger.info("User viewed dashboard");
// Output bao gồm userId và sessionId tự động
 
userLogger.info({ page: "/settings" }, "User navigated");
// Output bao gồm userId, sessionId, VÀ page

Cho HTTP server, bạn muốn child logger cho mỗi request để mỗi dòng log trong vòng đời request đó bao gồm request ID:

typescript

// src/middleware/request-logger.ts
import { randomUUID } from "node:crypto";
import { logger } from "../lib/logger";
import type { Request, Response, NextFunction } from "express";
 
export function requestLogger(req: Request, res: Response, next: NextFunction) {
  const requestId = req.headers["x-request-id"]?.toString() || randomUUID();
  const startTime = performance.now();
 
  // Gắn child logger vào request
  req.log = logger.child({
    requestId,
    method: req.method,
    url: req.originalUrl,
    userAgent: req.headers["user-agent"],
    ip: req.headers["x-forwarded-for"]?.toString().split(",").pop()?.trim()
        || req.socket.remoteAddress,
  });
 
  // Đặt request ID header trên response để correlation
  res.setHeader("x-request-id", requestId);
 
  req.log.info("Request received");
 
  res.on("finish", () => {
    const duration = Math.round(performance.now() - startTime);
    const logMethod = res.statusCode >= 500 ? "error"
                    : res.statusCode >= 400 ? "warn"
                    : "info";
 
    req.log[logMethod]({
      statusCode: res.statusCode,
      duration,
      contentLength: res.getHeader("content-length"),
    }, "Request completed");
  });
 
  next();
}

AsyncLocalStorage Cho Propagation Context Tự Động#

Cách tiếp cận child logger hoạt động, nhưng nó yêu cầu bạn truyền req.log qua mỗi lời gọi hàm. Điều đó tẻ nhạt. AsyncLocalStorage giải quyết vấn đề này — nó cung cấp context store đi theo luồng thực thi async mà không cần truyền rõ ràng.

typescript

// src/lib/async-context.ts
import { AsyncLocalStorage } from "node:async_hooks";
import { logger } from "./logger";
import type { Logger } from "pino";
 
interface RequestContext {
  requestId: string;
  logger: Logger;
  userId?: string;
  startTime: number;
}
 
export const asyncContext = new AsyncLocalStorage<RequestContext>();
 
// Lấy logger theo context từ bất cứ đâu trong call stack
export function getLogger(): Logger {
  const store = asyncContext.getStore();
  return store?.logger || logger;
}
 
export function getRequestId(): string | undefined {
  return asyncContext.getStore()?.requestId;
}

typescript

// src/middleware/async-context-middleware.ts
import { randomUUID } from "node:crypto";
import { asyncContext } from "../lib/async-context";
import { logger } from "../lib/logger";
import type { Request, Response, NextFunction } from "express";
 
export function asyncContextMiddleware(
  req: Request,
  res: Response,
  next: NextFunction
) {
  const requestId = req.headers["x-request-id"]?.toString() || randomUUID();
  const requestLogger = logger.child({ requestId });
 
  const context = {
    requestId,
    logger: requestLogger,
    startTime: performance.now(),
  };
 
  asyncContext.run(context, () => {
    res.setHeader("x-request-id", requestId);
    next();
  });
}

Giờ bất kỳ hàm nào, bất cứ đâu trong call stack, có thể lấy logger theo scope request:

typescript

// src/services/order-service.ts
import { getLogger } from "../lib/async-context";
 
export async function processOrder(orderId: string) {
  const log = getLogger(); // Tự động có requestId gắn vào!
 
  log.info({ orderId }, "Processing order");
 
  const items = await fetchOrderItems(orderId);
  log.debug({ itemCount: items.length }, "Order items fetched");
 
  const total = calculateTotal(items);
  log.info({ orderId, total }, "Order processed successfully");
 
  return { orderId, total, items };
}
 
// Không cần truyền logger làm tham số. Nó chỉ hoạt động.

Log Aggregation: Log Đi Đâu?#

Trong development, log đi ra stdout và pino-pretty làm chúng đọc được. Trong production, phức tạp hơn.

Con Đường PM2#

Nếu bạn chạy trên VPS với PM2 (mà tôi đã bao gồm trong bài viết setup VPS), PM2 capture stdout tự động:

bash

# Xem log real-time
pm2 logs api --lines 100
 
# Log được lưu tại ~/.pm2/logs/
# api-out.log  — stdout (log JSON của bạn)
# api-error.log — stderr (uncaught exception, stack trace)

Log rotation tích hợp của PM2 ngăn vấn đề dung lượng đĩa:

bash

pm2 install pm2-logrotate
pm2 set pm2-logrotate:max_size 50M
pm2 set pm2-logrotate:retain 14
pm2 set pm2-logrotate:compress true

Gửi Log Đến Loki hoặc Elasticsearch#

Cho bất cứ thứ gì ngoài một server, bạn cần log aggregation tập trung. Hai lựa chọn chính:

Grafana Loki — "Prometheus cho log." Nhẹ, chỉ index label (không full text), hoạt động tuyệt với Grafana. Khuyến nghị của tôi cho hầu hết team.

Elasticsearch — Tìm kiếm full-text trên log. Mạnh hơn, tốn tài nguyên hơn, overhead vận hành nhiều hơn. Dùng nếu bạn thực sự cần full-text search trên hàng triệu dòng log.

Cho Loki, setup đơn giản nhất dùng Promtail để gửi log:

yaml

# /etc/promtail/config.yml
server:
  http_listen_port: 9080
 
positions:
  filename: /tmp/positions.yaml
 
clients:
  - url: http://loki:3100/loki/api/v1/push
 
scrape_configs:
  - job_name: node-api
    static_configs:
      - targets:
          - localhost
        labels:
          job: node-api
          environment: production
          __path__: /home/deploy/.pm2/logs/api-out.log
    pipeline_stages:
      - json:
          expressions:
            level: level
            msg: msg
            service: service
      - labels:
          level:
          service:
      - timestamp:
          source: time
          format: UnixMs

Định Dạng NDJSON#

Pino output Newline Delimited JSON (NDJSON) mặc định — một JSON object mỗi dòng, phân cách bằng \n. Điều này quan trọng vì:

Mọi công cụ log aggregation đều hiểu nó
Nó streamable (bạn có thể xử lý log dòng theo dòng mà không buffer toàn bộ file)
Công cụ Unix chuẩn hoạt động: cat api-out.log | jq '.msg' | sort | uniq -c | sort -rn

Không bao giờ cấu hình Pino output JSON pretty-printed, nhiều dòng trong production. Bạn sẽ phá mọi công cụ trong pipeline.

typescript

// SAI trong production — JSON nhiều dòng phá xử lý dựa trên dòng
{
  "level": 30,
  "time": 1709312587000,
  "msg": "Request completed"
}
 
// ĐÚNG trong production — NDJSON, một object mỗi dòng
{"level":30,"time":1709312587000,"msg":"Request completed"}

Metric Với Prometheus#

Log cho bạn biết chuyện gì đã xảy ra. Metric cho bạn biết hệ thống hoạt động như thế nào. Sự khác biệt giống như đọc mọi giao dịch trên sao kê ngân hàng so với nhìn số dư tài khoản.

Bốn Loại Metric#

Prometheus có bốn loại metric. Hiểu khi nào dùng cái nào sẽ cứu bạn khỏi sai lầm phổ biến nhất.

Counter — Giá trị chỉ tăng. Số request, số lỗi, byte đã xử lý. Reset về zero khi restart.

typescript

// "Chúng ta đã phục vụ bao nhiêu request?"
const httpRequestsTotal = new Counter({
  name: "http_requests_total",
  help: "Total number of HTTP requests",
  labelNames: ["method", "route", "status_code"],
});

Gauge — Giá trị có thể tăng hoặc giảm. Connection hiện tại, kích thước queue, nhiệt độ, sử dụng heap.

typescript

// "Có bao nhiêu connection đang active ngay bây giờ?"
const activeConnections = new Gauge({
  name: "active_connections",
  help: "Number of currently active connections",
});

Histogram — Quan sát giá trị và đếm trong bucket có thể cấu hình. Duration request, kích thước response. Đây là cách bạn có percentile (p50, p95, p99).

typescript

// "Request mất bao lâu?" với bucket tại 10ms, 50ms, 100ms, v.v.
const httpRequestDuration = new Histogram({
  name: "http_request_duration_seconds",
  help: "Duration of HTTP requests in seconds",
  labelNames: ["method", "route", "status_code"],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});

Summary — Giống Histogram nhưng tính quantile phía client. Dùng Histogram thay vì trừ khi bạn có lý do cụ thể. Summary không thể aggregate qua các instance.

Setup prom-client Đầy Đủ#

typescript

// src/lib/metrics.ts
import {
  Registry,
  Counter,
  Histogram,
  Gauge,
  collectDefaultMetrics,
} from "prom-client";
 
// Tạo registry tùy chỉnh để tránh ô nhiễm registry global
export const metricsRegistry = new Registry();
 
// Thu thập metric Node.js mặc định:
// - process_cpu_seconds_total
// - process_resident_memory_bytes
// - nodejs_heap_size_total_bytes
// - nodejs_active_handles_total
// - nodejs_eventloop_lag_seconds
// - nodejs_gc_duration_seconds
collectDefaultMetrics({
  register: metricsRegistry,
  prefix: "nodeapp_",
  // Thu thập mỗi 10 giây
  gcDurationBuckets: [0.001, 0.01, 0.1, 1, 2, 5],
});
 
// --- HTTP Metric ---
 
export const httpRequestsTotal = new Counter({
  name: "nodeapp_http_requests_total",
  help: "Total number of HTTP requests received",
  labelNames: ["method", "route", "status_code"] as const,
  registers: [metricsRegistry],
});
 
export const httpRequestDuration = new Histogram({
  name: "nodeapp_http_request_duration_seconds",
  help: "Duration of HTTP requests in seconds",
  labelNames: ["method", "route", "status_code"] as const,
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
  registers: [metricsRegistry],
});
 
export const httpRequestSizeBytes = new Histogram({
  name: "nodeapp_http_request_size_bytes",
  help: "Size of HTTP request bodies in bytes",
  labelNames: ["method", "route"] as const,
  buckets: [100, 1000, 10000, 100000, 1000000],
  registers: [metricsRegistry],
});
 
// --- Business Metric ---
 
export const ordersProcessed = new Counter({
  name: "nodeapp_orders_processed_total",
  help: "Total number of orders processed",
  labelNames: ["status"] as const, // "success", "failed", "refunded"
  registers: [metricsRegistry],
});
 
export const activeWebSocketConnections = new Gauge({
  name: "nodeapp_active_websocket_connections",
  help: "Number of active WebSocket connections",
  registers: [metricsRegistry],
});
 
export const externalApiDuration = new Histogram({
  name: "nodeapp_external_api_duration_seconds",
  help: "Duration of external API calls",
  labelNames: ["service", "endpoint", "status"] as const,
  buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30],
  registers: [metricsRegistry],
});
 
export const dbQueryDuration = new Histogram({
  name: "nodeapp_db_query_duration_seconds",
  help: "Duration of database queries",
  labelNames: ["operation", "table"] as const,
  buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5],
  registers: [metricsRegistry],
});

Metric Middleware#

typescript

// src/middleware/metrics-middleware.ts
import { httpRequestsTotal, httpRequestDuration } from "../lib/metrics";
import type { Request, Response, NextFunction } from "express";
 
// Chuẩn hóa route để tránh cardinality explosion
// /api/users/123 → /api/users/:id
// Không có cái này, Prometheus sẽ tạo time series mới cho mỗi user ID
function normalizeRoute(req: Request): string {
  const route = req.route?.path || req.path;
 
  // Thay thế dynamic segment phổ biến
  return route
    .replace(/\/[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/g, "/:uuid")
    .replace(/\/\d+/g, "/:id")
    .replace(/\/[a-f0-9]{24}/g, "/:objectId");
}
 
export function metricsMiddleware(
  req: Request,
  res: Response,
  next: NextFunction
) {
  // Không track metric cho chính endpoint metric
  if (req.path === "/metrics") {
    return next();
  }
 
  const end = httpRequestDuration.startTimer();
 
  res.on("finish", () => {
    const route = normalizeRoute(req);
    const labels = {
      method: req.method,
      route,
      status_code: res.statusCode.toString(),
    };
 
    httpRequestsTotal.inc(labels);
    end(labels);
  });
 
  next();
}

Endpoint /metrics#

typescript

// src/routes/metrics.ts
import { Router } from "express";
import { metricsRegistry } from "../lib/metrics";
 
const router = Router();
 
router.get("/metrics", async (req, res) => {
  // Bảo vệ basic auth — không expose metric công khai
  const authHeader = req.headers.authorization;
  const expected = `Basic ${Buffer.from(
    `${process.env.METRICS_USER}:${process.env.METRICS_PASSWORD}`
  ).toString("base64")}`;
 
  if (!authHeader || authHeader !== expected) {
    res.status(401).set("WWW-Authenticate", "Basic").send("Unauthorized");
    return;
  }
 
  try {
    const metrics = await metricsRegistry.metrics();
    res.set("Content-Type", metricsRegistry.contentType);
    res.send(metrics);
  } catch (err) {
    res.status(500).send("Error collecting metrics");
  }
});
 
export default router;

Business Metric Tùy Chỉnh Mới Là Sức Mạnh Thực#

Metric Node.js mặc định (heap size, event loop lag, GC duration) là table stake. Chúng cho bạn biết về sức khỏe runtime. Nhưng business metric cho bạn biết về sức khỏe ứng dụng.

typescript

// Trong order service
import { ordersProcessed, externalApiDuration } from "../lib/metrics";
 
export async function processOrder(order: Order) {
  try {
    // Đo thời gian gọi payment provider
    const paymentTimer = externalApiDuration.startTimer({
      service: "stripe",
      endpoint: "charges.create",
    });
 
    const charge = await stripe.charges.create({
      amount: order.total,
      currency: "usd",
      source: order.paymentToken,
    });
 
    paymentTimer({ status: "success" });
 
    ordersProcessed.inc({ status: "success" });
    return charge;
  } catch (err) {
    ordersProcessed.inc({ status: "failed" });
 
    externalApiDuration.startTimer({
      service: "stripe",
      endpoint: "charges.create",
    })({ status: "error" });
 
    throw err;
  }
}

Spike trong ordersProcessed{status="failed"} cho bạn biết điều mà không lượng CPU metric nào có thể.

Label Cardinality: Kẻ Giết Thầm Lặng#

Một lời cảnh báo. Mỗi tổ hợp duy nhất của giá trị label tạo time series mới. Nếu bạn thêm label userId vào HTTP request counter, và bạn có 100,000 user, bạn vừa tạo 100,000+ time series. Prometheus sẽ chết.

Quy tắc cho label:

Chỉ cardinality thấp: HTTP method (7 giá trị), status code (5 category), route (hàng chục, không hàng nghìn)
Không bao giờ dùng user ID, request ID, IP address, hoặc timestamp làm giá trị label
Nếu không chắc, đừng thêm label. Bạn luôn có thể thêm sau, nhưng xóa yêu cầu thay đổi dashboard và alert

Dashboard Grafana#

Prometheus lưu dữ liệu. Grafana trực quan hóa. Đây là các panel tôi đặt trên mỗi dashboard service Node.js.

Dashboard Thiết Yếu#

1. Request Rate (request/giây)

promql

rate(nodeapp_http_requests_total[5m])

Hiển thị mẫu traffic. Hữu ích cho phát hiện spike hoặc drop đột ngột.

2. Error Rate (%)

promql

100 * (
  sum(rate(nodeapp_http_requests_total{status_code=~"5.."}[5m]))
  /
  sum(rate(nodeapp_http_requests_total[5m]))
)

Con số quan trọng nhất. Nếu vượt 1%, có gì đó sai.

3. p50 / p95 / p99 Latency

promql

histogram_quantile(0.99,
  sum(rate(nodeapp_http_request_duration_seconds_bucket[5m])) by (le)
)

p50 cho bạn biết trải nghiệm thông thường. p99 cho bạn biết trải nghiệm tệ nhất. Nếu p99 gấp 10 lần p50, bạn có vấn đề tail latency.

4. Event Loop Lag

promql

nodeapp_nodejs_eventloop_lag_seconds{quantile="0.99"}

Nếu vượt 100ms, event loop bị block. Có lẽ thao tác đồng bộ trong path async.

5. Sử Dụng Heap

promql

nodeapp_nodejs_heap_size_used_bytes / nodeapp_nodejs_heap_size_total_bytes * 100

Theo dõi xu hướng tăng đều — đó là memory leak. Spike trong GC là bình thường.

6. Active Handle

promql

nodeapp_nodejs_active_handles_total

File descriptor, socket, timer đang mở. Số tăng liên tục nghĩa là bạn đang leak handle — có lẽ không đóng database connection hoặc HTTP response.

Dashboard Grafana Dưới Dạng Code#

Bạn có thể version-control dashboard bằng tính năng provisioning của Grafana:

yaml

# /etc/grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
  - name: "Node.js Services"
    orgId: 1
    folder: "Services"
    type: file
    disableDeletion: false
    editable: true
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true

Export dashboard JSON từ Grafana, commit vào repo, và dashboard tồn tại qua việc cài lại Grafana. Đây không phải tùy chọn cho production — cùng nguyên tắc như infrastructure as code.

Distributed Tracing Với OpenTelemetry#

Tracing là trụ cột mà hầu hết team áp dụng cuối cùng, và là cái họ ước đã áp dụng đầu tiên. Khi bạn có nhiều service nói chuyện với nhau (ngay cả chỉ "API server + database + Redis + external API"), tracing cho bạn bức tranh toàn cảnh hành trình của request.

Trace Là Gì?#

Trace là cây các span. Mỗi span đại diện một đơn vị công việc — HTTP request, database query, lời gọi hàm. Span có thời gian bắt đầu, thời gian kết thúc, trạng thái, và thuộc tính. Chúng liên kết bằng trace ID được truyền qua ranh giới service.

Trace: abc-123
├── [API Gateway] POST /api/orders (250ms)
│   ├── [Auth Service] validate-token (12ms)
│   ├── [Order Service] create-order (230ms)
│   │   ├── [PostgreSQL] INSERT INTO orders (15ms)
│   │   ├── [Redis] SET order:cache (2ms)
│   │   └── [Payment Service] charge (200ms)
│   │       ├── [Stripe API] POST /v1/charges (180ms)
│   │       └── [PostgreSQL] UPDATE orders SET status (8ms)
│   └── [Email Service] send-confirmation (async, 45ms)

Nhìn một lần cho bạn biết: request 250ms dành 180ms chờ Stripe. Đó là nơi tối ưu.

Setup OpenTelemetry#

OpenTelemetry (OTel) là tiêu chuẩn. Nó thay thế cảnh quan phân mảnh của client Jaeger, client Zipkin, và SDK riêng của vendor bằng một API duy nhất, trung lập vendor.

typescript

// src/instrumentation.ts
// File này PHẢI được load trước mọi import khác.
// Trong Node.js, dùng flag --require hoặc --import.
 
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { OTLPMetricExporter } from "@opentelemetry/exporter-metrics-otlp-http";
import { PeriodicExportingMetricReader } from "@opentelemetry/sdk-metrics";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { Resource } from "@opentelemetry/resources";
import {
  ATTR_SERVICE_NAME,
  ATTR_SERVICE_VERSION,
} from "@opentelemetry/semantic-conventions";
 
const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: process.env.SERVICE_NAME || "node-api",
    [ATTR_SERVICE_VERSION]: process.env.APP_VERSION || "0.0.0",
    "deployment.environment": process.env.NODE_ENV || "development",
  }),
 
  // Gửi trace đến collector (Jaeger, Tempo, v.v.)
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || "http://localhost:4318/v1/traces",
  }),
 
  // Tùy chọn gửi metric qua OTel
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || "http://localhost:4318/v1/metrics",
    }),
    exportIntervalMillis: 15000,
  }),
 
  // Auto-instrumentation: tự động tạo span cho
  // HTTP request, Express route, PostgreSQL query, Redis command,
  // DNS lookup, và nhiều hơn
  instrumentations: [
    getNodeAutoInstrumentations({
      // Tắt instrumentation ồn ào
      "@opentelemetry/instrumentation-fs": { enabled: false },
      "@opentelemetry/instrumentation-dns": { enabled: false },
      // Cấu hình cụ thể
      "@opentelemetry/instrumentation-http": {
        ignoreIncomingPaths: ["/health", "/ready", "/metrics"],
      },
      "@opentelemetry/instrumentation-express": {
        ignoreLayersType: ["middleware"],
      },
    }),
  ],
});
 
sdk.start();
 
// Graceful shutdown
process.on("SIGTERM", () => {
  sdk.shutdown().then(
    () => console.log("OTel SDK shut down successfully"),
    (err) => console.error("Error shutting down OTel SDK", err)
  );
});

Khởi động ứng dụng với:

bash

node --import ./src/instrumentation.ts ./src/server.ts

Vậy thôi. Với zero thay đổi code ứng dụng, giờ bạn có trace cho mọi HTTP request, mọi database query, mọi Redis command.

Tạo Span Thủ Công#

Auto-instrumentation bao gồm infrastructure call, nhưng đôi khi bạn muốn trace business logic:

typescript

// src/services/order-service.ts
import { trace, SpanStatusCode } from "@opentelemetry/api";
 
const tracer = trace.getTracer("order-service");
 
export async function processOrder(orderId: string): Promise<Order> {
  return tracer.startActiveSpan("processOrder", async (span) => {
    try {
      span.setAttribute("order.id", orderId);
 
      // Span này trở thành parent của mọi auto-instrumented
      // DB query hoặc HTTP call bên trong các hàm này
      const order = await fetchOrder(orderId);
      span.setAttribute("order.total", order.total);
      span.setAttribute("order.item_count", order.items.length);
 
      const validationResult = await tracer.startActiveSpan(
        "validateOrder",
        async (validationSpan) => {
          const result = await validateInventory(order);
          validationSpan.setAttribute("validation.passed", result.valid);
          if (!result.valid) {
            validationSpan.setStatus({
              code: SpanStatusCode.ERROR,
              message: `Validation failed: ${result.reason}`,
            });
          }
          validationSpan.end();
          return result;
        }
      );
 
      if (!validationResult.valid) {
        span.setStatus({
          code: SpanStatusCode.ERROR,
          message: "Order validation failed",
        });
        throw new Error(validationResult.reason);
      }
 
      const payment = await processPayment(order);
      span.setAttribute("payment.id", payment.id);
 
      span.setStatus({ code: SpanStatusCode.OK });
      return order;
    } catch (err) {
      span.recordException(err as Error);
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: (err as Error).message,
      });
      throw err;
    } finally {
      span.end();
    }
  });
}

Trace Context Propagation#

Phép thuật của distributed tracing là trace ID đi theo request qua các service. Khi Service A gọi Service B, trace context tự động được inject vào HTTP header (header traceparent theo tiêu chuẩn W3C Trace Context).

Auto-instrumentation xử lý điều này cho outgoing HTTP call. Nhưng nếu bạn dùng message queue, bạn cần propagate thủ công:

typescript

import { context, propagation } from "@opentelemetry/api";
 
// Khi publish message
function publishEvent(queue: string, payload: object) {
  const carrier: Record<string, string> = {};
 
  // Inject trace context hiện tại vào carrier
  propagation.inject(context.active(), carrier);
 
  // Gửi cả payload và trace context
  messageQueue.publish(queue, {
    payload,
    traceContext: carrier,
  });
}
 
// Khi consume message
function consumeEvent(message: QueueMessage) {
  // Extract trace context từ message
  const parentContext = propagation.extract(
    context.active(),
    message.traceContext
  );
 
  // Chạy handler trong context đã extract
  // Giờ mọi span tạo ở đây sẽ là con của trace gốc
  context.with(parentContext, () => {
    tracer.startActiveSpan("processEvent", (span) => {
      span.setAttribute("queue.message_id", message.id);
      handleEvent(message.payload);
      span.end();
    });
  });
}

Gửi Trace Đi Đâu#

Jaeger — Lựa chọn open-source kinh điển. UI tốt, dễ chạy local với Docker. Lưu trữ dài hạn hạn chế.

Grafana Tempo — Nếu bạn đã dùng Grafana và Loki, Tempo là lựa chọn tự nhiên cho trace. Dùng object storage (S3/GCS) cho retention dài hạn tiết kiệm chi phí.

Grafana Cloud / Datadog / Honeycomb — Nếu bạn không muốn chạy infrastructure. Đắt hơn, ít overhead vận hành hơn.

Cho development local, Jaeger trong Docker hoàn hảo:

yaml

# docker-compose.otel.yml
services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"   # Jaeger UI
      - "4318:4318"     # OTLP HTTP receiver
    environment:
      - COLLECTOR_OTLP_ENABLED=true

Endpoint Health Check#

Health check là hình thức observability đơn giản nhất và là thứ đầu tiên bạn nên triển khai. Chúng trả lời một câu hỏi: "Service này có khả năng phục vụ request ngay bây giờ không?"

Ba Loại Health Check#

/health — Sức khỏe chung. Process có đang chạy và phản hồi không?

/ready — Sẵn sàng. Service này có thể xử lý traffic không? (Đã kết nối database chưa? Đã load cấu hình chưa? Đã warm cache chưa?)

/live — Sống. Process có sống và không bị deadlock không? (Có thể phản hồi request đơn giản trong timeout không?)

Sự khác biệt quan trọng cho Kubernetes, nơi liveness probe restart container bị kẹt và readiness probe loại container khỏi load balancer trong lúc startup hoặc dependency failure.

typescript

// src/routes/health.ts
import { Router } from "express";
import { Pool } from "pg";
import Redis from "ioredis";
 
const router = Router();
 
interface HealthCheckResult {
  status: "ok" | "degraded" | "error";
  checks: Record<
    string,
    {
      status: "ok" | "error";
      latency?: number;
      message?: string;
    }
  >;
  uptime: number;
  timestamp: string;
  version: string;
}
 
async function checkDatabase(pool: Pool): Promise<{ ok: boolean; latency: number }> {
  const start = performance.now();
  try {
    await pool.query("SELECT 1");
    return { ok: true, latency: Math.round(performance.now() - start) };
  } catch {
    return { ok: false, latency: Math.round(performance.now() - start) };
  }
}
 
async function checkRedis(redis: Redis): Promise<{ ok: boolean; latency: number }> {
  const start = performance.now();
  try {
    await redis.ping();
    return { ok: true, latency: Math.round(performance.now() - start) };
  } catch {
    return { ok: false, latency: Math.round(performance.now() - start) };
  }
}
 
export function createHealthRoutes(pool: Pool, redis: Redis) {
  // Liveness — chỉ kiểm tra process có phản hồi được không
  router.get("/live", (_req, res) => {
    res.status(200).json({ status: "ok" });
  });
 
  // Readiness — kiểm tra tất cả dependency
  router.get("/ready", async (_req, res) => {
    const [db, cache] = await Promise.all([
      checkDatabase(pool),
      checkRedis(redis),
    ]);
 
    const allOk = db.ok && cache.ok;
 
    res.status(allOk ? 200 : 503).json({
      status: allOk ? "ok" : "not_ready",
      checks: {
        database: db,
        redis: cache,
      },
    });
  });
 
  // Health đầy đủ — trạng thái chi tiết cho dashboard và debug
  router.get("/health", async (_req, res) => {
    const [db, cache] = await Promise.all([
      checkDatabase(pool),
      checkRedis(redis),
    ]);
 
    const anyError = !db.ok || !cache.ok;
    const allError = !db.ok && !cache.ok;
 
    const result: HealthCheckResult = {
      status: allError ? "error" : anyError ? "degraded" : "ok",
      checks: {
        database: {
          status: db.ok ? "ok" : "error",
          latency: db.latency,
          ...(!db.ok && { message: "Connection failed" }),
        },
        redis: {
          status: cache.ok ? "ok" : "error",
          latency: cache.latency,
          ...(!cache.ok && { message: "Connection failed" }),
        },
      },
      uptime: process.uptime(),
      timestamp: new Date().toISOString(),
      version: process.env.APP_VERSION || "unknown",
    };
 
    // Trả 200 cho ok/degraded (service vẫn xử lý được traffic)
    // Trả 503 cho error (service nên được loại khỏi rotation)
    res.status(result.status === "error" ? 503 : 200).json(result);
  });
 
  return router;
}

Cấu Hình Kubernetes Probe#

yaml

# k8s/deployment.yml
spec:
  containers:
    - name: api
      livenessProbe:
        httpGet:
          path: /live
          port: 3000
        initialDelaySeconds: 10
        periodSeconds: 15
        timeoutSeconds: 5
        failureThreshold: 3    # Restart sau 3 lần fail liên tiếp (45s)
      readinessProbe:
        httpGet:
          path: /ready
          port: 3000
        initialDelaySeconds: 5
        periodSeconds: 10
        timeoutSeconds: 5
        failureThreshold: 2    # Loại khỏi LB sau 2 lần fail (20s)
      startupProbe:
        httpGet:
          path: /ready
          port: 3000
        initialDelaySeconds: 0
        periodSeconds: 5
        failureThreshold: 30   # Cho đến 150s để startup

Sai lầm phổ biến: làm liveness probe quá aggressive. Nếu liveness probe kiểm tra database, và database tạm thời down, Kubernetes sẽ restart container. Nhưng restart không sửa được database. Giờ bạn có crash loop trên nền database outage. Giữ liveness probe đơn giản — chúng chỉ nên phát hiện process bị deadlock hoặc kẹt.

Error Tracking Với Sentry#

Log bắt lỗi bạn dự kiến. Sentry bắt những lỗi bạn không dự kiến.

Sự khác biệt quan trọng. Bạn thêm try/catch quanh code bạn biết có thể fail. Nhưng bug quan trọng nhất là ở code bạn nghĩ an toàn. Unhandled promise rejection, type error từ API response bất ngờ, null pointer access trên optional chain không optional đủ.

Setup Sentry Cho Node.js#

typescript

// src/lib/sentry.ts
import * as Sentry from "@sentry/node";
import { nodeProfilingIntegration } from "@sentry/profiling-node";
 
export function initSentry() {
  Sentry.init({
    dsn: process.env.SENTRY_DSN,
    environment: process.env.NODE_ENV || "development",
    release: process.env.APP_VERSION || "unknown",
 
    // Sample 10% transaction cho performance monitoring
    // (100% trong development)
    tracesSampleRate: process.env.NODE_ENV === "production" ? 0.1 : 1.0,
 
    // Profile 100% transaction đã sample
    profilesSampleRate: 1.0,
 
    integrations: [
      nodeProfilingIntegration(),
      // Lọc lỗi ồn ào
      Sentry.rewriteFramesIntegration({
        root: process.cwd(),
      }),
    ],
 
    // Không gửi lỗi từ development
    enabled: process.env.NODE_ENV === "production",
 
    // Lọc non-issue đã biết
    ignoreErrors: [
      // Client disconnect không phải bug
      "ECONNRESET",
      "ECONNABORTED",
      "EPIPE",
      // Bot gửi rác
      "SyntaxError: Unexpected token",
    ],
 
    // Loại PII trước khi gửi
    beforeSend(event) {
      // Xóa địa chỉ IP
      if (event.request) {
        delete event.request.headers?.["x-forwarded-for"];
        delete event.request.headers?.["x-real-ip"];
        delete event.request.cookies;
      }
 
      // Xóa query param nhạy cảm
      if (event.request?.query_string) {
        const params = new URLSearchParams(event.request.query_string);
        params.delete("token");
        params.delete("api_key");
        event.request.query_string = params.toString();
      }
 
      return event;
    },
  });
}

Express Error Handler Với Sentry#

typescript

// src/middleware/error-handler.ts
import * as Sentry from "@sentry/node";
import { getLogger } from "../lib/async-context";
import type { Request, Response, NextFunction } from "express";
 
// Sentry request handler phải đứng đầu tiên
export const sentryRequestHandler = Sentry.Handlers.requestHandler();
 
// Sentry tracing handler
export const sentryTracingHandler = Sentry.Handlers.tracingHandler();
 
// Error handler tùy chỉnh đứng cuối cùng
export function errorHandler(
  err: Error,
  req: Request,
  res: Response,
  _next: NextFunction
) {
  const log = getLogger();
 
  // Thêm context tùy chỉnh vào Sentry event
  Sentry.withScope((scope) => {
    scope.setTag("route", req.route?.path || req.path);
    scope.setTag("method", req.method);
 
    if (req.user) {
      scope.setUser({
        id: req.user.id,
        // Không gửi email hoặc username đến Sentry
      });
    }
 
    // Thêm breadcrumb cho debug
    scope.addBreadcrumb({
      category: "request",
      message: `${req.method} ${req.path}`,
      level: "info",
      data: {
        query: req.query,
        statusCode: res.statusCode,
      },
    });
 
    Sentry.captureException(err);
  });
 
  // Log lỗi với context đầy đủ
  log.error(
    {
      err,
      statusCode: 500,
      route: req.route?.path || req.path,
      method: req.method,
    },
    "Unhandled error in request handler"
  );
 
  // Gửi phản hồi lỗi chung
  // Không bao giờ expose chi tiết lỗi cho client trong production
  res.status(500).json({
    error: "Internal Server Error",
    ...(process.env.NODE_ENV !== "production" && {
      message: err.message,
      stack: err.stack,
    }),
  });
}

Source Map#

Không có source map, Sentry hiển thị stack trace đã minified/transpiled. Vô dụng. Upload source map trong build:

bash

# Trong pipeline CI/CD
npx @sentry/cli sourcemaps upload \
  --org your-org \
  --project your-project \
  --release $APP_VERSION \
  ./dist

Hoặc cấu hình trong bundler:

typescript

// vite.config.ts (hoặc tương đương)
import { sentryVitePlugin } from "@sentry/vite-plugin";
 
export default defineConfig({
  build: {
    sourcemap: true, // Bắt buộc cho Sentry
  },
  plugins: [
    sentryVitePlugin({
      org: process.env.SENTRY_ORG,
      project: process.env.SENTRY_PROJECT,
      authToken: process.env.SENTRY_AUTH_TOKEN,
    }),
  ],
});

Chi Phí Của Unhandled Promise Rejection#

Từ Node.js 15, unhandled promise rejection crash process mặc định. Điều này tốt — nó buộc bạn xử lý lỗi. Nhưng bạn cần lưới an toàn:

typescript

// src/server.ts — gần đầu entry point
 
process.on("unhandledRejection", (reason, promise) => {
  logger.fatal({ reason, promise }, "Unhandled promise rejection — crashing");
  Sentry.captureException(reason);
 
  // Flush Sentry event trước khi crash
  Sentry.flush(2000).finally(() => {
    process.exit(1);
  });
});
 
process.on("uncaughtException", (error) => {
  logger.fatal({ err: error }, "Uncaught exception — crashing");
  Sentry.captureException(error);
 
  Sentry.flush(2000).finally(() => {
    process.exit(1);
  });
});

Phần quan trọng: Sentry.flush() trước process.exit(). Không có nó, error event có thể không đến được Sentry trước khi process chết.

Alerting: Những Alert Thực Sự Quan Trọng#

Có 200 Prometheus metric và zero alert chỉ là vanity monitoring. Có 50 alert kêu mỗi ngày là alert fatigue — bạn sẽ bắt đầu bỏ qua, và rồi bạn sẽ bỏ lỡ cái quan trọng.

Mục tiêu là số lượng nhỏ alert tín hiệu cao nghĩa là "có gì đó thực sự sai và cần người nhìn vào."

Cấu Hình Prometheus AlertManager#

yaml

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: $SLACK_WEBHOOK_URL
 
route:
  receiver: "slack-warnings"
  group_by: ["alertname", "service"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: "pagerduty-critical"
      repeat_interval: 15m
    - match:
        severity: warning
      receiver: "slack-warnings"
 
receivers:
  - name: "pagerduty-critical"
    pagerduty_configs:
      - routing_key: $PAGERDUTY_ROUTING_KEY
        severity: critical
  - name: "slack-warnings"
    slack_configs:
      - channel: "#alerts"
        title: '{{ template "slack.title" . }}'
        text: '{{ template "slack.text" . }}'

Những Alert Thực Sự Đánh Thức Tôi#

yaml

# prometheus/rules/node-api.yml
groups:
  - name: node-api-critical
    rules:
      # Error rate cao — có gì đó hỏng
      - alert: HighErrorRate
        expr: |
          (
            sum(rate(nodeapp_http_requests_total{status_code=~"5.."}[5m]))
            /
            sum(rate(nodeapp_http_requests_total[5m]))
          ) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate trên 1% trong 5 phút"
          description: "{{ $value | humanizePercentage }} request đang trả về 5xx"
 
      # Response chậm — user đang chịu khổ
      - alert: HighP99Latency
        expr: |
          histogram_quantile(0.99,
            sum(rate(nodeapp_http_request_duration_seconds_bucket[5m])) by (le)
          ) > 1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "p99 latency trên 1 giây trong 5 phút"
          description: "p99 latency là {{ $value | humanizeDuration }}"
 
      # Memory leak — sẽ OOM sớm
      - alert: HighHeapUsage
        expr: |
          (
            nodeapp_nodejs_heap_size_used_bytes
            /
            nodeapp_nodejs_heap_size_total_bytes
          ) > 0.80
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Heap usage trên 80% trong 10 phút"
          description: "Heap usage đang ở {{ $value | humanizePercentage }}"
 
      # Process down
      - alert: ServiceDown
        expr: up{job="node-api"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Node.js API đang down"
 
  - name: node-api-warnings
    rules:
      # Event loop chậm dần
      - alert: HighEventLoopLag
        expr: |
          nodeapp_nodejs_eventloop_lag_seconds{quantile="0.99"} > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Event loop lag trên 100ms"
          description: "p99 event loop lag là {{ $value | humanizeDuration }}"
 
      # Traffic giảm đáng kể — có thể vấn đề routing
      - alert: TrafficDrop
        expr: |
          sum(rate(nodeapp_http_requests_total[5m]))
          < (sum(rate(nodeapp_http_requests_total[5m] offset 1h)) * 0.5)
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Traffic giảm hơn 50% so với 1 giờ trước"
 
      # Database query chậm dần
      - alert: SlowDatabaseQueries
        expr: |
          histogram_quantile(0.99,
            sum(rate(nodeapp_db_query_duration_seconds_bucket[5m])) by (le, operation)
          ) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "p99 database query time trên 500ms"
          description: "Query {{ $labels.operation }} chậm: {{ $value | humanizeDuration }}"
 
      # External API đang fail
      - alert: ExternalAPIFailures
        expr: |
          (
            sum(rate(nodeapp_external_api_duration_seconds_count{status="error"}[5m])) by (service)
            /
            sum(rate(nodeapp_external_api_duration_seconds_count[5m])) by (service)
          ) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "External API {{ $labels.service }} đang fail >10%"

Lưu ý mệnh đề for trên mọi alert. Không có nó, một spike đơn lẻ kích hoạt alert. for 5 phút nghĩa là điều kiện phải đúng trong 5 phút liên tục. Điều này loại bỏ nhiễu từ blip nhất thời.

Vệ Sinh Alert#

Mỗi alert phải qua bài test này:

Có actionable không? Nếu không ai có thể làm gì, đừng alert. Log nó, dashboard nó, nhưng đừng đánh thức ai.
Có cần can thiệp của người không? Nếu tự heal (như blip mạng ngắn), mệnh đề for nên lọc nó.
Nó đã kêu trong 30 ngày qua chưa? Nếu không, có thể nó cấu hình sai hoặc threshold sai. Review nó.
Khi nó kêu, mọi người có quan tâm không? Nếu team thường xuyên dismiss nó, xóa hoặc sửa threshold.

Tôi audit alert mỗi quý. Mỗi alert nhận một trong ba kết quả: giữ, điều chỉnh threshold, hoặc xóa.

Kết Hợp Tất Cả: Ứng Dụng Express#

Đây là cách tất cả mảnh ghép khớp với nhau trong ứng dụng thực:

typescript

// src/server.ts
import { initSentry } from "./lib/sentry";
 
// Khởi tạo Sentry đầu tiên — trước các import khác
initSentry();
 
import express from "express";
import * as Sentry from "@sentry/node";
import { Pool } from "pg";
import Redis from "ioredis";
import { logger } from "./lib/logger";
import { asyncContextMiddleware } from "./middleware/async-context-middleware";
import { metricsMiddleware } from "./middleware/metrics-middleware";
import { requestLogger } from "./middleware/request-logger";
import {
  sentryRequestHandler,
  sentryTracingHandler,
  errorHandler,
} from "./middleware/error-handler";
import { createHealthRoutes } from "./routes/health";
import metricsRouter from "./routes/metrics";
import apiRouter from "./routes/api";
 
const app = express();
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
const redis = new Redis(process.env.REDIS_URL);
 
// --- Thứ Tự Middleware Quan Trọng ---
 
// 1. Sentry request handler (phải đứng đầu)
app.use(sentryRequestHandler);
app.use(sentryTracingHandler);
 
// 2. Async context (tạo context theo scope request)
app.use(asyncContextMiddleware);
 
// 3. Request logging
app.use(requestLogger);
 
// 4. Thu thập metric
app.use(metricsMiddleware);
 
// 5. Body parsing
app.use(express.json({ limit: "1mb" }));
 
// --- Route ---
 
// Health check (không cần auth)
app.use(createHealthRoutes(pool, redis));
 
// Metric (bảo vệ basic auth)
app.use(metricsRouter);
 
// API route
app.use("/api", apiRouter);
 
// --- Xử Lý Lỗi ---
 
// Sentry error handler (phải trước error handler tùy chỉnh)
app.use(Sentry.Handlers.errorHandler());
 
// Error handler tùy chỉnh (phải cuối cùng)
app.use(errorHandler);
 
// --- Khởi Động ---
 
const port = parseInt(process.env.PORT || "3000", 10);
 
app.listen(port, () => {
  logger.info(
    {
      port,
      nodeEnv: process.env.NODE_ENV,
      version: process.env.APP_VERSION,
    },
    "Server started"
  );
});
 
// Graceful shutdown
async function shutdown(signal: string) {
  logger.info({ signal }, "Shutdown signal received");
 
  // Ngừng nhận connection mới
  // Xử lý request đang in-flight (Express làm tự động)
 
  // Đóng database pool
  await pool.end().catch((err) => {
    logger.error({ err }, "Error closing database pool");
  });
 
  // Đóng Redis connection
  await redis.quit().catch((err) => {
    logger.error({ err }, "Error closing Redis connection");
  });
 
  // Flush Sentry
  await Sentry.close(2000);
 
  logger.info("Shutdown complete");
  process.exit(0);
}
 
process.on("SIGTERM", () => shutdown("SIGTERM"));
process.on("SIGINT", () => shutdown("SIGINT"));

Stack Khả Thi Tối Thiểu#

Tất cả ở trên là stack "đầy đủ." Bạn không cần tất cả từ ngày đầu. Đây là cách scale observability khi dự án phát triển.

Giai Đoạn 1: Dự Án Cá Nhân / Solo Developer#

Bạn cần ba thứ:

Structured console log — Dùng Pino, output JSON ra stdout. Ngay cả khi bạn chỉ đọc chúng với pm2 logs, JSON log có thể tìm kiếm và parse được.
Endpoint /health — Mất 5 phút triển khai, cứu bạn khi debug "nó có đang chạy không?"
Sentry free tier — Bắt lỗi bạn không dự đoán. Free tier cho 5,000 event/tháng, quá đủ cho dự án cá nhân.

typescript

// Đây là setup tối thiểu. Dưới 50 dòng. Không có lý do bào chữa.
import pino from "pino";
import express from "express";
import * as Sentry from "@sentry/node";
 
const logger = pino({ level: "info" });
const app = express();
 
Sentry.init({ dsn: process.env.SENTRY_DSN });
app.use(Sentry.Handlers.requestHandler());
 
app.get("/health", (_req, res) => {
  res.json({ status: "ok", uptime: process.uptime() });
});
 
app.use("/api", apiRoutes);
 
app.use(Sentry.Handlers.errorHandler());
app.use((err: Error, _req: express.Request, res: express.Response, _next: express.NextFunction) => {
  logger.error({ err }, "Unhandled error");
  res.status(500).json({ error: "Internal Server Error" });
});
 
app.listen(3000, () => logger.info("Server started on port 3000"));

Giai Đoạn 2: Dự Án Đang Phát Triển / Team Nhỏ#

Thêm:

Prometheus metric + Grafana — Khi "cảm giác chậm" không đủ và bạn cần dữ liệu. Bắt đầu với request rate, error rate, và p99 latency.
Log aggregation — Khi ssh vào server và grep qua file không scale nữa. Loki + Promtail nếu bạn đã dùng Grafana.
Alert cơ bản — Error rate > 1%, p99 > 1s, service down. Ba alert. Vậy thôi.

Giai Đoạn 3: Service Production / Nhiều Service#

Thêm:

Distributed tracing với OpenTelemetry — Khi "API chậm" trở thành "trong 5 service nó gọi, cái nào chậm?" OTel auto-instrumentation cho bạn 80% giá trị với zero thay đổi code.
Dashboard as code — Version-control Grafana dashboard. Bạn sẽ cảm ơn mình khi cần tạo lại chúng.
Alerting có cấu trúc — AlertManager với routing, escalation, và silence rule đúng cách.
Business metric — Order/giây, tỷ lệ conversion, queue depth. Metric team sản phẩm quan tâm.

Những Gì Nên Bỏ Qua#

APM vendor với pricing theo host — Ở scale, chi phí điên rồ. Open source (Prometheus + Grafana + Tempo + Loki) cho bạn 95% chức năng.
Log level dưới INFO trong production — Bạn sẽ tạo terabyte log DEBUG và trả tiền lưu trữ. Dùng DEBUG chỉ khi đang tích cực điều tra vấn đề, rồi tắt.
Custom metric cho mọi thứ — Bắt đầu với phương pháp RED (Rate, Error, Duration) cho mỗi service. Thêm custom metric chỉ khi bạn có câu hỏi cụ thể cần trả lời.
Complex trace sampling — Bắt đầu với sample rate đơn giản (10% trong production). Adaptive sampling là premature optimization cho hầu hết team.

Suy Nghĩ Cuối Cùng#

Observability không phải sản phẩm bạn mua hay công cụ bạn cài đặt. Đó là thực hành. Đó là sự khác biệt giữa vận hành service và hy vọng service tự vận hành.

Stack tôi mô tả ở đây — Pino cho log, Prometheus cho metric, OpenTelemetry cho trace, Sentry cho error, Grafana cho trực quan hóa, AlertManager cho alert — không phải setup đơn giản nhất có thể. Nhưng mỗi phần kiếm được chỗ đứng bằng cách trả lời câu hỏi mà các phần khác không thể.

Bắt đầu với structured log và endpoint health. Thêm metric khi bạn cần biết "tệ đến mức nào." Thêm trace khi bạn cần biết "thời gian đi đâu." Mỗi lớp xây dựng trên lớp trước, và không lớp nào yêu cầu bạn viết lại ứng dụng.

Thời điểm tốt nhất để thêm observability là trước sự cố production cuối cùng. Thời điểm tốt thứ hai là bây giờ.

Ba Trụ Cột, và Tại Sao Bạn Cần Cả Ba#

Structured Logging Với Pino#

Tại Sao console.log Không Đủ#

Pino vs Winston#

Setup Pino Cơ Bản#

Log Level: Sử Dụng Đúng Cách#

Child Logger và Request Context#

AsyncLocalStorage Cho Propagation Context Tự Động#

Log Aggregation: Log Đi Đâu?#

Con Đường PM2#

Gửi Log Đến Loki hoặc Elasticsearch#

Định Dạng NDJSON#

Metric Với Prometheus#

Bốn Loại Metric#

Setup prom-client Đầy Đủ#

Metric Middleware#

Endpoint /metrics#

Business Metric Tùy Chỉnh Mới Là Sức Mạnh Thực#

Label Cardinality: Kẻ Giết Thầm Lặng#

Dashboard Grafana#

Dashboard Thiết Yếu#

Dashboard Grafana Dưới Dạng Code#

Distributed Tracing Với OpenTelemetry#

Trace Là Gì?#

Setup OpenTelemetry#

Tạo Span Thủ Công#

Trace Context Propagation#

Gửi Trace Đi Đâu#

Endpoint Health Check#

Ba Loại Health Check#

Cấu Hình Kubernetes Probe#

Error Tracking Với Sentry#

Setup Sentry Cho Node.js#

Express Error Handler Với Sentry#

Source Map#

Chi Phí Của Unhandled Promise Rejection#

Alerting: Những Alert Thực Sự Quan Trọng#

Cấu Hình Prometheus AlertManager#

Những Alert Thực Sự Đánh Thức Tôi#

Vệ Sinh Alert#

Kết Hợp Tất Cả: Ứng Dụng Express#

Stack Khả Thi Tối Thiểu#

Giai Đoạn 1: Dự Án Cá Nhân / Solo Developer#

Giai Đoạn 2: Dự Án Đang Phát Triển / Team Nhỏ#

Giai Đoạn 3: Service Production / Nhiều Service#

Những Gì Nên Bỏ Qua#

Suy Nghĩ Cuối Cùng#

Bài viết liên quan

Bảo Mật API: Danh Sách Kiểm Tra Tôi Chạy Cho Mọi Dự Án

Xác Thực Hiện Đại năm 2026: JWT, Sessions, OAuth và Passkeys