A confession to start with. It is easy to convince a team to go "fully serverless" for a new product. No containers. No VMs. Pure Lambda, DynamoDB, SQS, Step Functions — the works. The architecture diagrams look beautiful. The pitch to leadership is compelling: "We only pay for what we use. It scales to zero. It scales to infinity."

Eighteen months later, the monthly bill stabilizes at over $20,000. The equivalent workload on three well-provisioned VPS instances would cost roughly $600. The traffic — about 2 million requests per day — sounds like a lot until you realize that's about 23 requests per second on average, a workload any $200/month server handles while yawning.

This post collects lessons from that kind of experience and the serverless projects that follow it. Some genuinely benefit from serverless. Many do not. The difference is understanding the patterns, the anti-patterns, and most critically, the cost model.

The Serverless Mental Model Shift#

Before we get into patterns, let's fix the mental model. When most developers hear "serverless," they think "functions as a service" — write a function, deploy it, it runs when triggered. That's accurate but incomplete, like saying a car is "an engine with seats."

Serverless is a deployment and execution model where:

You don't manage servers or containers
You pay per invocation/duration, not per hour
Scaling is automatic and (theoretically) limitless
State is external — your compute is ephemeral

That last point is the one people underestimate. Everything you know about in-process caching, connection pooling, background threads, and local file storage needs to be rethought. Your function is a stateless transformer: input comes in, output goes out, anything you want to remember lives somewhere else.

Traditional Server Mental Model:
  [Long-lived process] -> in-memory cache
                       -> connection pool
                       -> local filesystem
                       -> background workers
                       -> cron scheduling

Serverless Mental Model:
  [Short-lived function] -> external cache (Redis/ElastiCache)
                         -> per-invocation connections (or RDS Proxy)
                         -> S3 / EFS
                         -> SQS + separate functions
                         -> EventBridge Scheduler

This shift isn't just architectural — it changes how you think about every problem. And that's where the patterns come in.

Pattern 1: Event-Driven Architecture#

This is the foundational serverless pattern, and it's genuinely powerful when applied correctly. Instead of services calling each other synchronously, they emit events. Other services react to those events. Nobody waits for anybody.

Here's what a synchronous order flow looks like:

Synchronous (traditional):
  Client -> API -> Validate Order
                -> Charge Payment
                -> Update Inventory
                -> Send Confirmation Email
                -> Return Response

Total latency: sum of all steps (800-2000ms)
If email service is down: entire order fails

And the event-driven equivalent:

Event-Driven (serverless):
  Client -> API Gateway -> Lambda (validate + save order)
                        -> emit "OrderCreated" event
                        -> Return 202 Accepted (50ms)

  EventBridge picks up "OrderCreated":
    -> Lambda: charge payment -> emit "PaymentProcessed"
    -> Lambda: reserve inventory -> emit "InventoryReserved"
    -> Lambda: send confirmation email

  Each step is independent. Email failure doesn't block payment.

The implementation with EventBridge looks like this:

typescript

// order-handler.ts — the entry point
import { EventBridgeClient, PutEventsCommand } from "@aws-sdk/client-eventbridge";
import { DynamoDBDocumentClient, PutCommand } from "@aws-sdk/lib-dynamodb";
 
const eb = new EventBridgeClient({});
const ddb = DynamoDBDocumentClient.from(new DynamoDBClient({}));
 
export const handler = async (event: APIGatewayProxyEventV2) => {
  const order = JSON.parse(event.body || "{}");
 
  // Validate
  const errors = validateOrder(order);
  if (errors.length > 0) {
    return { statusCode: 400, body: JSON.stringify({ errors }) };
  }
 
  // Save to DynamoDB
  const orderId = ulid();
  await ddb.send(
    new PutCommand({
      TableName: "Orders",
      Item: {
        PK: `ORDER#${orderId}`,
        SK: "METADATA",
        ...order,
        status: "PENDING",
        createdAt: Date.now(),
      },
    }),
  );
 
  // Emit event — this is the key
  await eb.send(
    new PutEventsCommand({
      Entries: [
        {
          Source: "orders.service",
          DetailType: "OrderCreated",
          Detail: JSON.stringify({ orderId, ...order }),
          EventBusName: "main-bus",
        },
      ],
    }),
  );
 
  return {
    statusCode: 202,
    body: JSON.stringify({ orderId, status: "PENDING" }),
  };
};

The beauty is in the decoupling. The order handler has no idea what happens after it emits that event. Payment processing, inventory management, email notifications — they're all separate Lambda functions triggered by EventBridge rules. You can add a new reaction (say, updating an analytics dashboard) without touching the order handler.

But here's where the honest part comes in: debugging this is a nightmare. When a customer says "I placed an order but never got a confirmation email," you need to trace across multiple Lambda invocations, check EventBridge delivery logs, look at the email Lambda's CloudWatch logs, and hope you logged enough correlation IDs. We'll get to observability later, but know this: the operational complexity of event-driven architectures is real and substantial.

Pattern 2: Fan-Out / Fan-In#

This pattern is where serverless genuinely earns its keep. You have a job that can be parallelized — processing a large file, running computations across multiple data sets, generating reports from different sources. Instead of processing sequentially, you fan out to many concurrent Lambda invocations and fan back in when they're all done.

Fan-Out / Fan-In:

                          -> Lambda (chunk 1) --\
                         /-> Lambda (chunk 2) ---\
  S3 Upload -> Lambda --+--> Lambda (chunk 3) ----+--> Lambda (aggregate)
              (splitter) \-> Lambda (chunk 4) ---/        |
                          -> Lambda (chunk 5) --/    Final Result -> S3

  Processing 5GB CSV file:
    Single server: 45 minutes
    Fan-out to 100 Lambdas: 27 seconds

A common use case is processing nightly analytics data. A 3GB file lands in S3. A "splitter" Lambda reads the file, chunks it into pieces, and drops each chunk into an SQS queue. A fleet of Lambda workers processes the chunks concurrently. When they're all done, a final Lambda aggregates the results.

typescript

// splitter.ts — triggered by S3 event
import { S3Client, GetObjectCommand } from "@aws-sdk/client-s3";
import { SQSClient, SendMessageBatchCommand } from "@aws-sdk/client-sqs";
 
const s3 = new S3Client({});
const sqs = new SQSClient({});
 
export const handler = async (event: S3Event) => {
  const bucket = event.Records[0].s3.bucket.name;
  const key = event.Records[0].s3.object.key;
 
  // Stream the file to count lines and determine chunks
  const obj = await s3.send(new GetObjectCommand({ Bucket: bucket, Key: key }));
  const content = await obj.Body?.transformToString();
  const lines = content?.split("\n").filter(Boolean) || [];
 
  const CHUNK_SIZE = 10_000;
  const chunks = Math.ceil(lines.length / CHUNK_SIZE);
  const jobId = `job-${Date.now()}`;
 
  // Track total chunks for the aggregator
  await ddb.send(
    new PutCommand({
      TableName: "Jobs",
      Item: {
        PK: `JOB#${jobId}`,
        totalChunks: chunks,
        completedChunks: 0,
        status: "PROCESSING",
      },
    }),
  );
 
  // Fan out — send each chunk reference to SQS
  for (let i = 0; i < chunks; i++) {
    const batch = lines.slice(i * CHUNK_SIZE, (i + 1) * CHUNK_SIZE);
    // In practice, write chunks to S3 and send references
    await sqs.send(
      new SendMessageBatchCommand({
        QueueUrl: process.env.QUEUE_URL,
        Entries: [
          {
            Id: `chunk-${i}`,
            MessageBody: JSON.stringify({
              jobId,
              chunkIndex: i,
              bucket,
              chunkKey: `chunks/${jobId}/chunk-${i}.json`,
            }),
          },
        ],
      }),
    );
  }
};

typescript

// worker.ts — triggered by SQS
export const handler = async (event: SQSEvent) => {
  for (const record of event.Records) {
    const { jobId, chunkIndex, bucket, chunkKey } = JSON.parse(record.body);
 
    // Process the chunk
    const data = await getChunkFromS3(bucket, chunkKey);
    const result = processAnalytics(data);
 
    // Write partial result
    await ddb.send(
      new PutCommand({
        TableName: "Results",
        Item: {
          PK: `JOB#${jobId}`,
          SK: `CHUNK#${chunkIndex}`,
          result,
        },
      }),
    );
 
    // Atomically increment completed count
    const updated = await ddb.send(
      new UpdateCommand({
        TableName: "Jobs",
        Key: { PK: `JOB#${jobId}` },
        UpdateExpression: "SET completedChunks = completedChunks + :one",
        ExpressionAttributeValues: { ":one": 1 },
        ReturnValues: "ALL_NEW",
      }),
    );
 
    // Check if all chunks are done — trigger aggregation
    if (updated.Attributes?.completedChunks === updated.Attributes?.totalChunks) {
      await eb.send(
        new PutEventsCommand({
          Entries: [
            {
              Source: "analytics.processor",
              DetailType: "AllChunksComplete",
              Detail: JSON.stringify({ jobId }),
            },
          ],
        }),
      );
    }
  }
};

The atomic counter on DynamoDB is the trick here — it's how you know when all workers have finished without needing a central coordinator polling for completion. This is genuinely elegant and hard to replicate this cleanly with traditional infrastructure.

The cost for this specific workload: about $4.50 per nightly run. Running a server 24/7 just to process one file each night would cost more. This is the serverless sweet spot — bursty, parallelizable workloads with long idle periods.

Pattern 3: The Saga Pattern for Distributed Transactions#

This is where serverless architecture gets seriously complicated, and where most teams get burned. In a monolith, if you need to charge a payment and update inventory atomically, you wrap it in a database transaction. Done. In a distributed serverless system, there's no such thing as a distributed transaction (and if someone tries to sell you one, run).

The saga pattern is the alternative. Instead of one atomic transaction, you execute a series of local transactions, each with a compensating action that can undo it if a later step fails.

Saga: Book a Trip

  Step 1: Reserve Flight     -> Compensate: Cancel Flight Reservation
  Step 2: Reserve Hotel      -> Compensate: Cancel Hotel Reservation
  Step 3: Charge Credit Card -> Compensate: Refund Credit Card
  Step 4: Send Confirmation  -> (no compensation needed)

  If Step 3 fails:
    -> Execute compensation for Step 2 (cancel hotel)
    -> Execute compensation for Step 1 (cancel flight)
    -> Notify user of failure

AWS Step Functions is the managed way to implement this. The state machine definition looks like this:

json

{
  "Comment": "Trip Booking Saga",
  "StartAt": "ReserveFlight",
  "States": {
    "ReserveFlight": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123:function:reserve-flight",
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "Next": "FlightReservationFailed"
        }
      ],
      "Next": "ReserveHotel"
    },
    "ReserveHotel": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123:function:reserve-hotel",
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "Next": "CancelFlightReservation"
        }
      ],
      "Next": "ChargePayment"
    },
    "ChargePayment": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123:function:charge-payment",
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "Next": "CancelHotelReservation"
        }
      ],
      "Next": "SendConfirmation"
    },
    "SendConfirmation": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123:function:send-confirmation",
      "End": true
    },
    "CancelHotelReservation": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123:function:cancel-hotel",
      "Next": "CancelFlightReservation"
    },
    "CancelFlightReservation": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123:function:cancel-flight",
      "Next": "NotifyBookingFailed"
    },
    "FlightReservationFailed": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123:function:notify-failure",
      "End": true
    },
    "NotifyBookingFailed": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123:function:notify-failure",
      "End": true
    }
  }
}

This looks clean in a diagram. In practice, it's where the pain lives. What happens if the compensation itself fails? What if CancelHotelReservation times out? Now you have an inconsistent state: the flight was canceled but the hotel reservation is stuck. You need retry logic on compensations, dead-letter queues for failed compensations, and monitoring that alerts you when a saga is stuck in an intermediate state.

Real-world saga workflows almost always require manual intervention within the first month — not because the logic is wrong, but because third-party APIs are unreliable. A hotel booking API returns a 500 during cancellation. A payment refund endpoint has a different timeout than the charge endpoint. Real systems are messy, and the saga pattern doesn't hide that messiness — it just gives you a structured way to deal with it.

My honest recommendation: if you can restructure your domain to avoid distributed transactions entirely, do that. Use the saga pattern as a last resort, not a first choice.

Pattern 4: Step Functions and Orchestration#

Step Functions go beyond sagas. They're a general-purpose workflow orchestration service, and they're genuinely useful for complex multi-step processes. The two flavors matter:

Standard Workflows: Exactly-once execution, up to one year duration, priced per state transition ($0.025 per 1,000 transitions). Good for long-running processes where you need guaranteed completion.

Express Workflows: At-least-once execution, up to 5 minutes, priced per invocation and duration. Good for high-volume, short-duration workflows.

Here's a frequently used pattern — a document processing pipeline:

Document Processing Pipeline (Step Functions Standard):

  Start
    |
    v
  [Extract Text]  -- Lambda: OCR / textract
    |
    v
  [Classify Document]  -- Lambda: ML classification
    |
    v
  <Is Sensitive?>  -- Choice State
   /          \
  Yes          No
  |             |
  v             v
  [Redact PII]  [Skip Redaction]
  |             |
  v             v
  [Store in S3]
    |
    v
  [Update Database]
    |
    v
  [Notify Subscribers]
    |
    v
  End

The Step Functions definition handles retries, timeouts, and error handling declaratively:

typescript

// CDK definition — much cleaner than raw JSON
const extractText = new tasks.LambdaInvoke(this, "ExtractText", {
  lambdaFunction: extractFn,
  retryOnServiceExceptions: true,
  resultPath: "$.extractResult",
});
 
const classifyDoc = new tasks.LambdaInvoke(this, "ClassifyDocument", {
  lambdaFunction: classifyFn,
  retryOnServiceExceptions: true,
  resultPath: "$.classification",
});
 
const isSensitive = new sfn.Choice(this, "IsSensitive?")
  .when(sfn.Condition.stringEquals("$.classification.Payload.type", "SENSITIVE"), redactPII)
  .otherwise(skipRedaction);
 
const pipeline = extractText.next(classifyDoc).next(isSensitive);
 
// Both paths converge
redactPII.next(storeInS3);
skipRedaction.next(storeInS3);
storeInS3.next(updateDB).next(notifySubscribers);
 
new sfn.StateMachine(this, "DocProcessor", {
  definition: pipeline,
  timeout: Duration.hours(1),
});

The cost trap with Step Functions Standard: each state transition costs money. A workflow with 10 states processing 100,000 documents per month is 1,000,000 transitions = $25. Not bad. But add error handling, retries, and parallel branches, and a single execution might have 40-50 transitions. Now you're at $125/month. Still reasonable, but it creeps up fast.

Express Workflows are better for high-volume scenarios. I switched a webhook processing pipeline from Standard to Express and the cost dropped from $340/month to $18/month. The trade-off is at-least-once semantics — your Lambda functions need to be idempotent.

The Cold Start Reality#

Let's talk numbers, because cold start discussions without numbers are useless.

Systematic cold start benchmarks across different runtimes and memory configurations in late 2025 looked roughly like this:

Cold Start Latency (p50 / p99):

Node.js 20 (128MB):   320ms / 890ms
Node.js 20 (512MB):   180ms / 410ms
Node.js 20 (1024MB):  120ms / 280ms
Node.js 20 (1769MB):  95ms  / 210ms    <- 1 full vCPU

Python 3.12 (128MB):  380ms / 1100ms
Python 3.12 (512MB):  210ms / 520ms
Python 3.12 (1024MB): 140ms / 340ms

Java 21 (512MB):      3200ms / 8400ms  <- yes, seconds
Java 21 (1024MB):     1800ms / 4200ms
Java 21 (SnapStart):  280ms  / 620ms   <- SnapStart is essential

.NET 8 (512MB):       680ms / 1800ms
.NET 8 (1024MB):      420ms / 980ms

Rust (custom runtime): 12ms / 28ms     <- not a typo
Go (provided.al2023):  18ms / 45ms

The Node.js numbers are the most relevant for web developers. At 512MB with a typical Express-on-Lambda setup (using a framework adapter), cold starts average 180ms. That's perceptible but not terrible. But if you're importing heavy SDKs — the full AWS SDK v3 with DynamoDB, S3, SES, and SQS clients — that jumps to 400-600ms because the module graph is massive.

Here's what actually helps with cold starts, ranked by impact:

1. Bundle size (highest impact). Use esbuild to tree-shake and bundle your Lambda code. A typical Express app with full dependencies: 45MB. After esbuild bundling: 2-4MB. Cold start improvement: 40-60%.

typescript

// esbuild config for Lambda
import { build } from "esbuild";
 
await build({
  entryPoints: ["src/handlers/order.ts"],
  bundle: true,
  minify: true,
  platform: "node",
  target: "node20",
  outfile: "dist/order/index.js",
  external: ["@aws-sdk/*"], // AWS SDK v3 is included in the Lambda runtime
  treeShaking: true,
});

2. Memory allocation. Lambda allocates CPU proportional to memory. At 128MB you get a fraction of a vCPU. At 1769MB you get a full vCPU. The sweet spot for Node.js is 512MB-1024MB. Going higher doesn't help cold starts much but costs more per ms.

3. Lazy initialization. Don't initialize everything at module load time. If a handler only sometimes needs the S3 client, create it on first use:

typescript

// Bad: always initializes both clients
const s3 = new S3Client({});
const ddb = new DynamoDBDocumentClient.from(new DynamoDBClient({}));
 
export const handler = async (event) => {
  // Most invocations only use DynamoDB
  // but S3 client was initialized regardless
};
 
// Good: lazy initialization
let s3Client: S3Client | null = null;
const getS3 = () => {
  if (!s3Client) s3Client = new S3Client({});
  return s3Client;
};
 
// DynamoDB is always needed, so eagerly initialize it
const ddb = DynamoDBDocumentClient.from(new DynamoDBClient({}));
 
export const handler = async (event) => {
  // S3 only initialized when actually needed
  if (event.needsS3) {
    const s3 = getS3();
    // ...
  }
};

4. Provisioned concurrency. This is the "just keep instances warm" solution. It works, but it's expensive. You're essentially paying for servers again — which defeats the "pay per use" promise. At $0.015 per GB-hour for provisioned concurrency, keeping 10 instances warm at 512MB costs about $55/month. That's on top of your invocation costs.

Provisioned concurrency makes sense in exactly two scenarios: user-facing API endpoints where p99 latency matters, and scheduled functions that need to respond within a tight SLA. Everything else just eats the cold start.

5. Keep-alive pinging. The poor man's provisioned concurrency. A CloudWatch Events rule triggers your function every 5 minutes with a "warm-up" event. The function detects the warm-up and returns immediately. This keeps one instance warm for free.

typescript

export const handler = async (event: any) => {
  // Warm-up detection
  if (event.source === "warmup" || event.detail?.warmup) {
    return { statusCode: 200, body: "warm" };
  }
 
  // Actual logic
  // ...
};

This only keeps one instance warm. If you get concurrent requests, new instances still cold-start. But for low-traffic endpoints, it eliminates 90% of user-visible cold starts.

API Gateway Patterns#

API Gateway is the front door to most serverless APIs, and it comes in two flavors that confuse everyone.

REST API (v1): The original. Full-featured. Request/response transformations, API keys, usage plans, WAF integration, caching, request validation, resource policies. Priced at $3.50 per million requests plus data transfer.

HTTP API (v2): The lightweight version. About 70% cheaper ($1.00 per million), lower latency (typically 5-10ms less), but fewer features. No caching, limited request validation, no usage plans.

For most new projects, start with HTTP API. You can always upgrade to REST API if you need the advanced features.

Here's a pattern that works well for most API Gateway setups — a Lambda authorizer that validates JWTs and attaches user context:

typescript

// authorizer.ts
import { APIGatewayRequestAuthorizerEventV2 } from "aws-lambda";
import { jwtVerify } from "jose";
 
const JWKS_CACHE: Map<string, CryptoKey> = new Map();
 
export const handler = async (event: APIGatewayRequestAuthorizerEventV2) => {
  const token = event.headers?.authorization?.replace("Bearer ", "");
 
  if (!token) {
    return { isAuthorized: false };
  }
 
  try {
    const { payload } = await jwtVerify(token, getPublicKey, {
      algorithms: ["RS256"],
      issuer: "https://auth.example.com",
    });
 
    return {
      isAuthorized: true,
      context: {
        userId: payload.sub,
        email: payload.email,
        role: payload.role,
        // These are available in $context.authorizer in your Lambda
      },
    };
  } catch {
    return { isAuthorized: false };
  }
};

A critical performance pattern: enable authorizer caching. By default, the authorizer runs on every single request. With caching enabled (TTL up to 3600 seconds), identical tokens reuse the cached authorization result. This can reduce your authorizer invocations by 80-95%.

Another pattern worth knowing: direct integrations. API Gateway can talk to DynamoDB, SQS, Step Functions, and other AWS services directly, without a Lambda function in the middle. For simple CRUD operations, this eliminates a Lambda invocation entirely:

Without direct integration:
  Client -> API Gateway -> Lambda -> DynamoDB
  Cost: API GW + Lambda invocation + DynamoDB read
  Latency: ~50-80ms

With direct integration:
  Client -> API Gateway -> DynamoDB (direct)
  Cost: API GW + DynamoDB read
  Latency: ~15-25ms

The trade-off is that the VTL (Velocity Template Language) mapping templates are horrific to write and debug. For anything beyond simple get/put operations, the Lambda is worth the extra cost and latency just for the developer experience.

DynamoDB Single-Table Design#

DynamoDB is the database of serverless. It scales automatically, has single-digit millisecond latency, and integrates natively with Lambda through DynamoDB Streams. It's also the most misunderstood database in the AWS ecosystem.

The "single-table design" pattern means putting all your entities in one table, using composite primary keys and GSIs to support different access patterns. This sounds insane if you come from a relational background. It is, kind of. But it's how you get the most out of DynamoDB.

Single-Table Design for an E-Commerce App:

PK                    SK                    Data
-------------------------------------------------------------
USER#u123             METADATA              {name, email, ...}
USER#u123             ORDER#o456            {total, status, ...}
USER#u123             ORDER#o789            {total, status, ...}
ORDER#o456            METADATA              {userId, total, ...}
ORDER#o456            ITEM#i001             {productId, qty, ...}
ORDER#o456            ITEM#i002             {productId, qty, ...}
PRODUCT#p100          METADATA              {name, price, ...}
PRODUCT#p100          REVIEW#r001           {userId, rating, ...}

GSI1:
GSI1PK                GSI1SK
-------------------------------------------
USER#u123             2025-12-01T10:30:00Z   (orders by date)
PRODUCT#p100          5#r001                 (reviews by rating)

This design lets you:

Get a user's profile: PK = USER#u123, SK = METADATA
Get all orders for a user: PK = USER#u123, SK begins_with ORDER#
Get all items in an order: PK = ORDER#o456, SK begins_with ITEM#
Get recent orders for a user: GSI1 with GSI1PK = USER#u123, sorted by date

typescript

// Get user profile + recent orders in one query
const userProfile = await ddb.send(
  new QueryCommand({
    TableName: "Main",
    KeyConditionExpression: "PK = :pk",
    ExpressionAttributeValues: { ":pk": `USER#${userId}` },
  }),
);
 
// Single query returns both METADATA and ORDER# items
const metadata = userProfile.Items?.find((i) => i.SK === "METADATA");
const orders = userProfile.Items?.filter((i) => i.SK.startsWith("ORDER#"));

The advantage: one round-trip to DynamoDB gets you everything. No joins, no multiple queries, no connection pooling nightmares. In a Lambda function with a 50ms DynamoDB query, this is critical — you don't have a long-lived connection pool to absorb multiple sequential queries efficiently.

The disadvantage: the schema is painful to evolve. Adding a new access pattern might require a new GSI or even a data migration. And if you get the key design wrong initially, you might need to rebuild the table. This is a real and common cost.

My advice: use single-table design for high-traffic, well-understood access patterns. For exploratory features where the queries are evolving, use a separate table (or honestly, just use PostgreSQL through RDS Proxy).

Queues and topics are the connective tissue of serverless architectures. The distinction matters:

SQS (Simple Queue Service): Point-to-point. One message, one consumer. Messages persist until processed. Great for work distribution, rate limiting, and buffering.

SNS (Simple Notification Service): Pub/sub. One message, many subscribers. No persistence — if a subscriber is down, the message is lost (unless it delivers to SQS). Great for event fan-out.

The power pattern is combining them — the "SNS to SQS fan-out":

SNS + SQS Fan-Out:

  Producer -> SNS Topic "OrderCreated"
                |
                +-> SQS Queue: Payment Processing
                |     -> Lambda: process-payment
                |
                +-> SQS Queue: Inventory Service
                |     -> Lambda: update-inventory
                |
                +-> SQS Queue: Analytics
                |     -> Lambda: track-analytics
                |
                +-> SQS Queue: Email Notifications
                      -> Lambda: send-email

  Each queue has its own:
    - Retry policy (maxReceiveCount)
    - Dead letter queue
    - Concurrency limit
    - Batch size

This gives you the fan-out of pub/sub with the durability and retry semantics of queues. If the email Lambda is broken and failing, messages stack up in its SQS queue. The payment and inventory services are completely unaffected. When you fix the email Lambda, it processes the backlog automatically.

The SQS-to-Lambda integration has important configuration options that people get wrong:

typescript

// CDK — SQS -> Lambda event source mapping
new lambda.EventSourceMapping(this, "OrderQueueMapping", {
  target: processOrderFn,
  eventSourceArn: orderQueue.queueArn,
  batchSize: 10, // Process up to 10 messages per invocation
  maxBatchingWindow: Duration.seconds(5), // Wait up to 5s to fill batch
  maxConcurrency: 50, // Maximum concurrent Lambda invocations
  reportBatchItemFailures: true, // Critical: report per-item failures
});

That reportBatchItemFailures flag is essential. Without it, if one message in a batch of 10 fails, all 10 messages return to the queue and get reprocessed. With it, you report which specific messages failed, and only those go back to the queue:

typescript

// Handler with batch item failure reporting
import { SQSBatchResponse, SQSEvent } from "aws-lambda";
 
export const handler = async (event: SQSEvent): Promise<SQSBatchResponse> => {
  const failures: { itemIdentifier: string }[] = [];
 
  for (const record of event.Records) {
    try {
      await processMessage(JSON.parse(record.body));
    } catch (error) {
      console.error(`Failed to process ${record.messageId}`, error);
      failures.push({ itemIdentifier: record.messageId });
    }
  }
 
  return {
    batchItemFailures: failures,
  };
};

One lesson learned the hard way: set maxConcurrency on your SQS-to-Lambda mappings. Without it, a sudden spike of messages can trigger hundreds of concurrent Lambda invocations, which can overwhelm your downstream services (especially databases). I once had a backlog of 50,000 messages drain simultaneously and take down a downstream API with 800 concurrent connections. The maxConcurrency setting would have prevented that entirely.

The Cost Analysis Nobody Wants to Hear#

Here's the honest breakdown, based on numbers from a few real production workloads.

Workload 1: Low-traffic API (1,000 requests/day)

Serverless:
  API Gateway HTTP API:        1K * 30 = 30K requests/month = $0.03
  Lambda (256MB, 100ms avg):   30K invocations = $0.01
  DynamoDB (on-demand):        ~$0.50
  CloudWatch Logs:             ~$0.30
  Total: ~$0.84/month

VPS (smallest Hetzner):
  CX22 (2 vCPU, 4GB RAM):     $4.49/month
  Total: $4.49/month

Winner: Serverless (5x cheaper)

Workload 2: Medium-traffic API (100,000 requests/day)

Serverless:
  API Gateway HTTP API:        3M requests/month = $3.00
  Lambda (512MB, 200ms avg):   3M invocations = $1.88
  Lambda compute:              3M * 0.2s * 0.5GB = 300K GB-s = $5.00
  DynamoDB (on-demand):        ~$15.00
  CloudWatch Logs:             ~$8.00
  NAT Gateway (if in VPC):     ~$32.00 + data processing
  Total: ~$65/month (without NAT) or ~$97/month (with NAT)

VPS:
  Hetzner CX42 (4 vCPU, 16GB): $17.49/month
  Total: $17.49/month

Winner: VPS (4-6x cheaper)

Workload 3: High-traffic API (2M requests/day)

Serverless:
  API Gateway:                 60M requests/month = $60.00
  Lambda (1024MB, 150ms avg):  60M invocations = $12.00
  Lambda compute:              60M * 0.15s * 1GB = 9M GB-s = $150.03
  DynamoDB (provisioned):      ~$120.00
  CloudWatch Logs:             ~$45.00
  NAT Gateway:                 ~$95.00
  ElastiCache (for caching):   ~$50.00
  Total: ~$532/month

VPS (3x Hetzner dedicated):
  3x AX52 (8 core, 64GB):     3 * $63 = $189/month
  Load balancer:               $5.49
  Total: ~$195/month

Winner: VPS (2.7x cheaper)

Workload 4: Bursty processing (idle 23 hours, massive spike for 1 hour)

Serverless:
  Lambda during spike: 500 concurrent, 1 hour = ~$45
  Lambda during idle: $0
  SQS/EventBridge: ~$5
  Total: ~$50/month

VPS (sized for peak):
  Need to handle peak concurrency = expensive
  Hetzner AX102 (16 core, 128GB): $128/month
  Running 24/7 even when idle
  Total: $128/month

Winner: Serverless (2.5x cheaper)

The pattern is clear: serverless wins for very low traffic and very bursty workloads. For anything with consistent, moderate-to-high traffic, traditional servers are significantly cheaper. The break-even point is typically around 50,000-100,000 requests per day, depending on your Lambda memory and execution duration.

The hidden cost that kills budgets: NAT Gateway. If your Lambda functions need to access resources in a VPC (which they do if you're using RDS, ElastiCache, or any private resource), you need a NAT Gateway for outbound internet access. That's $0.045 per hour ($32.40/month minimum) plus $0.045 per GB of data processed. NAT Gateway costs can easily exceed the Lambda costs themselves.

Vendor Lock-In: The Honest Assessment#

Let me be direct about this: if you use DynamoDB, Step Functions, EventBridge, SQS, SNS, and API Gateway, you are locked into AWS. Deeply. That's not necessarily wrong, but you should make that decision consciously, not accidentally.

Here's a realistic portability assessment:

Portability Spectrum:

Easy to port:
  - Lambda function code (just Node.js/Python/Go code)
  - S3 object storage (standard API, many alternatives)
  - SQS messaging (replace with RabbitMQ, Redis queues)

Moderate effort:
  - API Gateway (replace with Express/Fastify + reverse proxy)
  - SNS (replace with Redis pub/sub, NATS)
  - CloudWatch Logs (replace with any logging service)

Significant rewrite:
  - DynamoDB (unique data model, no direct equivalent)
  - Step Functions (rewrite as code-based orchestration)
  - EventBridge (custom event bus implementation)
  - Cognito (entire auth system replacement)

Practically impossible without redesign:
  - DynamoDB Streams -> Lambda triggers
  - AppSync + DynamoDB resolvers
  - Multi-service IAM-based auth

The mitigation strategy I recommend: use a hexagonal architecture (ports and adapters) for your Lambda functions. Keep your business logic completely free of AWS SDK calls. The Lambda handler is a thin adapter that translates between AWS events and your domain:

typescript

// Adapter layer (AWS-specific)
import { APIGatewayProxyEventV2 } from "aws-lambda";
import { createOrder } from "./domain/orders";
import { DynamoDBOrderRepository } from "./infra/dynamodb-order-repo";
import { SQSEventPublisher } from "./infra/sqs-event-publisher";
 
export const handler = async (event: APIGatewayProxyEventV2) => {
  const input = JSON.parse(event.body || "{}");
 
  // Domain logic knows nothing about AWS
  const result = await createOrder(input, {
    orderRepo: new DynamoDBOrderRepository(),
    eventPublisher: new SQSEventPublisher(),
  });
 
  return {
    statusCode: result.success ? 201 : 400,
    body: JSON.stringify(result),
  };
};
 
// Domain layer (portable)
interface OrderRepository {
  save(order: Order): Promise<void>;
}
 
interface EventPublisher {
  publish(event: DomainEvent): Promise<void>;
}
 
export async function createOrder(
  input: CreateOrderInput,
  deps: { orderRepo: OrderRepository; eventPublisher: EventPublisher },
) {
  const order = Order.create(input);
  await deps.orderRepo.save(order);
  await deps.eventPublisher.publish(new OrderCreatedEvent(order));
  return { success: true, orderId: order.id };
}

With this structure, swapping DynamoDB for PostgreSQL means implementing a new PostgresOrderRepository. Swapping SQS for RabbitMQ means implementing a new RabbitMQEventPublisher. Your domain logic — the valuable part — doesn't change.

Do I actually do this for every project? No. For small, clearly temporary services, I write AWS-coupled code because the abstraction isn't worth the effort. But for any service that might live more than a year or process critical business logic, the adapter layer pays for itself.

Observability: The Serverless Achilles Heel#

Debugging a serverless application in production is fundamentally harder than debugging a traditional application. There's no server to SSH into. There's no long-running process to attach a debugger to. Your function ran for 200ms, processed a message, and the execution environment might already be gone.

Here's what you need, at minimum:

Structured logging with correlation IDs. Every log line needs to be JSON with a correlation ID that ties together all the functions involved in processing a single request:

typescript

import { Logger } from "@aws-lambda-powertools/logger";
import { Tracer } from "@aws-lambda-powertools/tracer";
 
const logger = new Logger({ serviceName: "order-service" });
const tracer = new Tracer({ serviceName: "order-service" });
 
export const handler = async (event: any) => {
  // Extract or generate correlation ID
  const correlationId = event.headers?.["x-correlation-id"] || event.detail?.correlationId || crypto.randomUUID();
 
  logger.appendKeys({ correlationId });
 
  logger.info("Processing order", {
    orderId: event.detail?.orderId,
    source: event.source,
  });
 
  try {
    const result = await processOrder(event);
    logger.info("Order processed successfully", { result });
    return result;
  } catch (error) {
    logger.error("Order processing failed", { error });
    throw error;
  }
};

X-Ray tracing. AWS X-Ray traces requests across services. It's not perfect — the console is clunky and it misses some async patterns — but it's the only way to get a visual trace of a request flowing through Lambda -> SQS -> Lambda -> DynamoDB. Enable it:

typescript

// In your CDK stack
const fn = new lambda.Function(this, "OrderHandler", {
  // ...
  tracing: lambda.Tracing.ACTIVE, // Enable X-Ray
  environment: {
    POWERTOOLS_SERVICE_NAME: "order-service",
    LOG_LEVEL: "INFO",
  },
});

CloudWatch Insights queries. Learn these. When something breaks at 3 AM, you'll be querying CloudWatch Logs Insights to find what happened:

# Find all errors for a specific correlation ID
fields @timestamp, @message
| filter correlationId = "abc-123-def"
| sort @timestamp asc

# Find all cold starts in the last hour
fields @timestamp, @initDuration, @memorySize
| filter @type = "REPORT" and @initDuration > 0
| sort @initDuration desc
| limit 50

# Find functions with highest error rate
filter @type = "REPORT"
| stats count() as invocations,
        sum(strcontains(@message, "ERROR")) as errors,
        (sum(strcontains(@message, "ERROR")) / count()) * 100 as errorRate
  by @logGroup
| sort errorRate desc

The reality: even with all of this, debugging a 10-service event-driven architecture takes 3-5x longer than debugging a monolith. That's a genuine, permanent trade-off. You can improve it with better tooling (Lumigo, Epsagon, Thundra, and others offer serverless-specific observability), but the fundamental complexity of distributed tracing across ephemeral functions is irreducible.

Local Development: The Ongoing Pain#

This is the area where serverless developer experience still lags the most. Your code runs on AWS. Your local machine is not AWS. Bridging that gap is painful.

The options, ranked by my preference:

1. SST (Serverless Stack) with Live Lambda Dev. This is one of the best developer experiences available. It deploys your infrastructure to AWS but proxies Lambda invocations back to your local machine. You write code locally, save the file, and the next invocation runs your updated code — no deployment needed. It's close to the hot-reload experience of local development.

2. AWS SAM Local. Runs Lambda functions locally in Docker containers. It works for simple cases but falls apart with complex triggers (EventBridge, DynamoDB Streams, Step Functions). You end up mocking half of AWS.

3. LocalStack. Emulates AWS services locally. The free tier covers S3, SQS, SNS, DynamoDB, and Lambda. The pro tier adds Step Functions, EventBridge, and others. It's good but not identical to real AWS — tests can pass on LocalStack and fail on real AWS due to subtle behavioral differences.

4. Deploy and test in a dev account. The "just deploy it" approach. It's the most accurate because you're testing on real AWS. It's also the slowest — even with SAM Accelerate or CDK Hotswap, a deploy cycle is 10-30 seconds. That loop of code-deploy-test-debug is brutal when you're iterating.

My actual workflow: SST for feature development, LocalStack for integration tests in CI, and a shared dev account for pre-production validation. It's three environments to maintain, and none of them perfectly replicate production.

typescript

// sst.config.ts — SST v3 configuration
export default $config({
  app(input) {
    return {
      name: "my-service",
      removal: input?.stage === "production" ? "retain" : "remove",
    };
  },
  async run() {
    const table = new sst.aws.Dynamo("Orders", {
      fields: { PK: "string", SK: "string" },
      primaryIndex: { hashKey: "PK", rangeKey: "SK" },
    });
 
    const api = new sst.aws.ApiGatewayV2("Api");
 
    api.route("POST /orders", {
      handler: "src/handlers/order.handler",
      link: [table],
    });
 
    // During `sst dev`, Lambda invocations are proxied to your machine
    // During `sst deploy`, everything runs on AWS normally
  },
});

Testing Serverless Applications#

Testing serverless code requires a different strategy than testing traditional applications. You have three layers to test, and each requires a different approach.

Unit tests: Test your business logic in isolation. If you followed the hexagonal architecture pattern from the vendor lock-in section, this is straightforward. Your domain logic has no AWS dependencies and can be tested with standard test frameworks:

typescript

// domain/orders.test.ts
import { createOrder } from "./orders";
 
const mockRepo = {
  save: vi.fn().mockResolvedValue(undefined),
};
const mockPublisher = {
  publish: vi.fn().mockResolvedValue(undefined),
};
 
describe("createOrder", () => {
  it("creates an order and publishes an event", async () => {
    const result = await createOrder(
      { userId: "u123", items: [{ productId: "p1", quantity: 2 }] },
      { orderRepo: mockRepo, eventPublisher: mockPublisher },
    );
 
    expect(result.success).toBe(true);
    expect(mockRepo.save).toHaveBeenCalledOnce();
    expect(mockPublisher.publish).toHaveBeenCalledWith(expect.objectContaining({ type: "OrderCreated" }));
  });
 
  it("rejects empty orders", async () => {
    const result = await createOrder(
      { userId: "u123", items: [] },
      { orderRepo: mockRepo, eventPublisher: mockPublisher },
    );
 
    expect(result.success).toBe(false);
    expect(mockRepo.save).not.toHaveBeenCalled();
  });
});

Integration tests: Test AWS interactions with LocalStack. Use Testcontainers to spin up LocalStack in Docker and run your actual DynamoDB queries, SQS operations, and S3 uploads against it:

typescript

// infra/dynamodb-order-repo.integration.test.ts
import { GenericContainer } from "testcontainers";
import { DynamoDBClient, CreateTableCommand } from "@aws-sdk/client-dynamodb";
 
let localstackContainer: any;
let ddbClient: DynamoDBClient;
 
beforeAll(async () => {
  localstackContainer = await new GenericContainer("localstack/localstack").withExposedPorts(4566).start();
 
  const endpoint = `http://localhost:${localstackContainer.getMappedPort(4566)}`;
 
  ddbClient = new DynamoDBClient({
    endpoint,
    region: "us-east-1",
    credentials: { accessKeyId: "test", secretAccessKey: "test" },
  });
 
  // Create table
  await ddbClient.send(
    new CreateTableCommand({
      TableName: "Orders",
      KeySchema: [
        { AttributeName: "PK", KeyType: "HASH" },
        { AttributeName: "SK", KeyType: "RANGE" },
      ],
      AttributeDefinitions: [
        { AttributeName: "PK", AttributeType: "S" },
        { AttributeName: "SK", AttributeType: "S" },
      ],
      BillingMode: "PAY_PER_REQUEST",
    }),
  );
}, 60_000);
 
afterAll(async () => {
  await localstackContainer?.stop();
});
 
it("saves and retrieves an order", async () => {
  const repo = new DynamoDBOrderRepository(ddbClient, "Orders");
 
  const order = Order.create({
    userId: "u123",
    items: [{ productId: "p1", quantity: 2 }],
  });
 
  await repo.save(order);
  const retrieved = await repo.findById(order.id);
 
  expect(retrieved).toEqual(order);
});

End-to-end tests: Test the full flow on a real AWS account. Deploy to a test stage and run tests that hit the actual API Gateway endpoint, verify DynamoDB writes, check SQS messages, and confirm the full event chain works:

typescript

// e2e/order-flow.e2e.test.ts
const API_URL = process.env.API_URL; // Set by CI/CD after deployment
 
it("creates an order and processes it end-to-end", async () => {
  // Create order via API
  const response = await fetch(`${API_URL}/orders`, {
    method: "POST",
    headers: { "Content-Type": "application/json", Authorization: `Bearer ${token}` },
    body: JSON.stringify({
      items: [{ productId: "p1", quantity: 2 }],
    }),
  });
 
  expect(response.status).toBe(202);
  const { orderId } = await response.json();
 
  // Poll for completion (event-driven processing takes time)
  const order = await pollUntil(
    () => fetch(`${API_URL}/orders/${orderId}`).then((r) => r.json()),
    (result) => result.status === "COMPLETED",
    { timeout: 30_000, interval: 1_000 },
  );
 
  expect(order.status).toBe("COMPLETED");
  expect(order.paymentStatus).toBe("CHARGED");
  expect(order.inventoryStatus).toBe("RESERVED");
});

The test pyramid for serverless is inverted compared to traditional apps. You need more integration and E2E tests because the glue between services (IAM permissions, event formats, queue configurations) is where most bugs hide. A unit test won't catch that your Lambda doesn't have permission to write to the SQS queue, or that the EventBridge rule pattern doesn't match your event format.

Anti-Patterns: What Not to Do#

All of these mistakes are common in real-world serverless projects. Worth avoiding ahead of time.

1. Lambda-lith. Putting your entire Express app inside a single Lambda function. The cold start is massive (2-5 seconds with all dependencies), you can't scale individual routes independently, and you're paying for 1GB of memory to serve a health check endpoint.

Bad: Lambda-lith
  API Gateway /* -> Single Lambda (Express app)
                    - 50MB bundle
                    - 2 second cold start
                    - All routes, all middleware, all the time

Good: Function-per-route (or function-per-domain)
  API Gateway /orders/* -> Order Lambda (3MB, 120ms cold start)
  API Gateway /users/*  -> User Lambda (2MB, 100ms cold start)
  API Gateway /search/* -> Search Lambda (4MB, 150ms cold start)

However — and this is a nuanced point — a Lambda-lith for a low-traffic internal tool is actually fine. If you have 100 requests/day and don't care about cold starts, the simplicity of a single function outweighs the architectural purity of function-per-route. Context matters.

2. Synchronous chains. Lambda A calls Lambda B, which calls Lambda C, which calls Lambda D. Each one waits for the next. You're paying for all four simultaneously, the total latency is the sum of all four, and if any one fails, you need retry logic at every level.

Bad: Synchronous chain
  Lambda A (waiting...) -> Lambda B (waiting...) -> Lambda C -> Lambda D
  Total cost: A duration + B duration + C duration + D duration
  Latency: A + B + C + D
  If D fails: everything fails

Good: Async with queues
  Lambda A -> SQS -> Lambda B -> SQS -> Lambda C -> SQS -> Lambda D
  Total cost: A + B + C + D (but A finishes immediately)
  Latency (user-facing): just A
  If D fails: D's message goes to DLQ, everything else succeeded

3. Using Lambda for long-running tasks. Lambda has a 15-minute timeout. Some teams chain Lambda invocations to process hour-long jobs, each function invoking the next at the 14-minute mark. This is fragile, expensive, and impossible to debug. Use ECS Fargate tasks for long-running work. You can still trigger them from Lambda.

4. Ignoring DynamoDB throughput modes. On-demand pricing is convenient but expensive at scale. A table doing 10,000 writes per second on on-demand costs roughly $6.50/hour. Provisioned capacity for the same throughput (with auto-scaling) costs about $1.80/hour. For predictable workloads, provisioned saves 60-70%.

5. Not setting concurrency limits. By default, Lambda functions share a regional concurrency pool of 1,000. One runaway function can consume the entire pool and cause every other function in your account to throttle. Set reserved concurrency on critical functions and unreserved concurrency limits on non-critical ones.

typescript

// CDK — Set reserved concurrency
const criticalFn = new lambda.Function(this, "PaymentProcessor", {
  // ...
  reservedConcurrentExecutions: 100, // Guaranteed 100, max 100
});
 
const nonCriticalFn = new lambda.Function(this, "AnalyticsProcessor", {
  // ...
  reservedConcurrentExecutions: 50, // Won't starve critical functions
});

When Serverless Makes Sense (and When a VPS Is Better)#

After enough production serverless projects and VPS deployments, here's an honest assessment of when to use what.

Serverless is genuinely better for:

Event processing pipelines (S3 uploads, webhook handlers, stream processing)
Scheduled jobs that run briefly (daily reports, cleanup tasks, health checks)
APIs with extreme traffic variability (0 at night, 10,000 RPS during a marketing push)
Fan-out/fan-in workloads (parallel data processing, batch operations)
Prototypes and MVPs where operational overhead is the bottleneck
Glue code between AWS services (S3 trigger -> process -> DynamoDB)

A VPS / container is genuinely better for:

Consistent, predictable traffic (anything over ~50K requests/day)
WebSocket connections (Lambda doesn't support persistent connections natively; API Gateway WebSocket exists but is awkward and expensive)
Applications that need local state (in-memory caches, connection pools)
Long-running processes (video transcoding, ML training, data migrations)
Applications where cold start latency is unacceptable
Budget-constrained projects with moderate traffic

The hybrid approach that actually works:

Most production systems I build now are hybrid. The VPS handles the steady-state traffic — the API, the WebSocket connections, the background workers. Serverless handles the spiky, event-driven, parallelizable parts — file processing, notification fan-out, scheduled jobs, webhook ingestion.

Hybrid Architecture:

  [VPS Cluster]                    [Serverless]
  - Main API (Express/Fastify)     - S3 upload processing (Lambda)
  - WebSocket server               - Email/SMS notifications (Lambda + SES)
  - Background workers             - Daily report generation (Lambda + Step Functions)
  - Redis cache                    - Webhook ingestion (API Gateway + Lambda + SQS)
  - PostgreSQL                     - Image/video thumbnailing (Lambda)
                                   - Scheduled cleanup tasks (EventBridge + Lambda)

  Communication: SQS queues between VPS and Lambda functions

This gives you the cost efficiency of servers for predictable workloads, the elastic scaling of Lambda for bursty workloads, and the operational simplicity of managed services for glue tasks.

The Decision Matrix#

Before you start your next project, run through this checklist:

TRAFFIC PATTERN
  [ ] Consistent, predictable         -> VPS / Containers
  [ ] Highly variable, bursty         -> Serverless
  [ ] Idle most of the time           -> Serverless
  [ ] High sustained throughput       -> VPS / Containers

LATENCY REQUIREMENTS
  [ ] p99 < 50ms required             -> VPS (no cold starts)
  [ ] p99 < 200ms acceptable          -> Serverless (with provisioned concurrency)
  [ ] p99 < 500ms acceptable          -> Serverless (standard)
  [ ] Latency doesn't matter          -> Serverless

EXECUTION DURATION
  [ ] Sub-second responses            -> Either
  [ ] 1-15 minutes                    -> Lambda (with timeout awareness)
  [ ] 15 minutes to hours             -> ECS Fargate / VPS
  [ ] Hours to days                   -> VPS / dedicated server

STATE REQUIREMENTS
  [ ] Stateless request/response      -> Serverless
  [ ] In-memory caching needed        -> VPS
  [ ] WebSocket connections           -> VPS (or API Gateway WebSocket)
  [ ] Local filesystem needed         -> VPS (or Lambda + EFS, with caveats)

TEAM AND OPERATIONAL FACTORS
  [ ] Small team, no DevOps           -> Serverless (less infra to manage)
  [ ] Large team, dedicated SRE       -> Either (team can manage complexity)
  [ ] AWS expertise on team           -> Serverless
  [ ] Need to run on multiple clouds  -> VPS / Containers (portability)
  [ ] Tight budget                    -> VPS (for moderate+ traffic)

COST ESTIMATION
  [ ] < 10K requests/day              -> Serverless (nearly free)
  [ ] 10K-100K requests/day           -> Calculate both; often VPS wins
  [ ] 100K-1M requests/day            -> VPS almost certainly cheaper
  [ ] > 1M requests/day               -> VPS, unless traffic is extremely bursty

If you checked mostly items in the left column, serverless is probably your best bet. If you checked mostly right-column items, go with traditional servers. If it's mixed — and it usually is — the hybrid approach described above is likely your answer.

Final Thoughts#

Serverless is a tool, not a religion. The community around it sometimes forgets that. There are AWS Heroes who will tell you with a straight face that every application should be serverless. There are old-school sysadmins who will tell you it's a fad. Both are wrong.

The $20K monthly bill I mentioned at the start? We migrated the hot-path API to three Hetzner dedicated servers behind a load balancer. That part of the bill dropped to $200/month. We kept the event processing pipeline, the scheduled jobs, and the file processing on Lambda. That part costs about $85/month and would be a pain to manage on servers.

The total bill went from $20,400 to $285. The architecture is cleaner. The team is happier. The code is easier to debug.

Serverless is brilliant when it fits. It's expensive when it doesn't. The skill is knowing the difference before the bill arrives.

The Serverless Mental Model Shift#

Serverless is a deployment and execution model where:

You don't manage servers or containers
You pay per invocation/duration, not per hour
Scaling is automatic and (theoretically) limitless
State is external — your compute is ephemeral

Traditional Server Mental Model:
  [Long-lived process] -> in-memory cache
                       -> connection pool
                       -> local filesystem
                       -> background workers
                       -> cron scheduling

Serverless Mental Model:
  [Short-lived function] -> external cache (Redis/ElastiCache)
                         -> per-invocation connections (or RDS Proxy)
                         -> S3 / EFS
                         -> SQS + separate functions
                         -> EventBridge Scheduler

This shift isn't just architectural — it changes how you think about every problem. And that's where the patterns come in.

Pattern 1: Event-Driven Architecture#

Here's what a synchronous order flow looks like:

Synchronous (traditional):
  Client -> API -> Validate Order
                -> Charge Payment
                -> Update Inventory
                -> Send Confirmation Email
                -> Return Response

Total latency: sum of all steps (800-2000ms)
If email service is down: entire order fails

And the event-driven equivalent:

Event-Driven (serverless):
  Client -> API Gateway -> Lambda (validate + save order)
                        -> emit "OrderCreated" event
                        -> Return 202 Accepted (50ms)

  EventBridge picks up "OrderCreated":
    -> Lambda: charge payment -> emit "PaymentProcessed"
    -> Lambda: reserve inventory -> emit "InventoryReserved"
    -> Lambda: send confirmation email

  Each step is independent. Email failure doesn't block payment.

The implementation with EventBridge looks like this:

typescript

// order-handler.ts — the entry point
import { EventBridgeClient, PutEventsCommand } from "@aws-sdk/client-eventbridge";
import { DynamoDBDocumentClient, PutCommand } from "@aws-sdk/lib-dynamodb";
 
const eb = new EventBridgeClient({});
const ddb = DynamoDBDocumentClient.from(new DynamoDBClient({}));
 
export const handler = async (event: APIGatewayProxyEventV2) => {
  const order = JSON.parse(event.body || "{}");
 
  // Validate
  const errors = validateOrder(order);
  if (errors.length > 0) {
    return { statusCode: 400, body: JSON.stringify({ errors }) };
  }
 
  // Save to DynamoDB
  const orderId = ulid();
  await ddb.send(
    new PutCommand({
      TableName: "Orders",
      Item: {
        PK: `ORDER#${orderId}`,
        SK: "METADATA",
        ...order,
        status: "PENDING",
        createdAt: Date.now(),
      },
    }),
  );
 
  // Emit event — this is the key
  await eb.send(
    new PutEventsCommand({
      Entries: [
        {
          Source: "orders.service",
          DetailType: "OrderCreated",
          Detail: JSON.stringify({ orderId, ...order }),
          EventBusName: "main-bus",
        },
      ],
    }),
  );
 
  return {
    statusCode: 202,
    body: JSON.stringify({ orderId, status: "PENDING" }),
  };
};

Pattern 2: Fan-Out / Fan-In#

Fan-Out / Fan-In:

                          -> Lambda (chunk 1) --\
                         /-> Lambda (chunk 2) ---\
  S3 Upload -> Lambda --+--> Lambda (chunk 3) ----+--> Lambda (aggregate)
              (splitter) \-> Lambda (chunk 4) ---/        |
                          -> Lambda (chunk 5) --/    Final Result -> S3

  Processing 5GB CSV file:
    Single server: 45 minutes
    Fan-out to 100 Lambdas: 27 seconds

typescript

// splitter.ts — triggered by S3 event
import { S3Client, GetObjectCommand } from "@aws-sdk/client-s3";
import { SQSClient, SendMessageBatchCommand } from "@aws-sdk/client-sqs";
 
const s3 = new S3Client({});
const sqs = new SQSClient({});
 
export const handler = async (event: S3Event) => {
  const bucket = event.Records[0].s3.bucket.name;
  const key = event.Records[0].s3.object.key;
 
  // Stream the file to count lines and determine chunks
  const obj = await s3.send(new GetObjectCommand({ Bucket: bucket, Key: key }));
  const content = await obj.Body?.transformToString();
  const lines = content?.split("\n").filter(Boolean) || [];
 
  const CHUNK_SIZE = 10_000;
  const chunks = Math.ceil(lines.length / CHUNK_SIZE);
  const jobId = `job-${Date.now()}`;
 
  // Track total chunks for the aggregator
  await ddb.send(
    new PutCommand({
      TableName: "Jobs",
      Item: {
        PK: `JOB#${jobId}`,
        totalChunks: chunks,
        completedChunks: 0,
        status: "PROCESSING",
      },
    }),
  );
 
  // Fan out — send each chunk reference to SQS
  for (let i = 0; i < chunks; i++) {
    const batch = lines.slice(i * CHUNK_SIZE, (i + 1) * CHUNK_SIZE);
    // In practice, write chunks to S3 and send references
    await sqs.send(
      new SendMessageBatchCommand({
        QueueUrl: process.env.QUEUE_URL,
        Entries: [
          {
            Id: `chunk-${i}`,
            MessageBody: JSON.stringify({
              jobId,
              chunkIndex: i,
              bucket,
              chunkKey: `chunks/${jobId}/chunk-${i}.json`,
            }),
          },
        ],
      }),
    );
  }
};

typescript

// worker.ts — triggered by SQS
export const handler = async (event: SQSEvent) => {
  for (const record of event.Records) {
    const { jobId, chunkIndex, bucket, chunkKey } = JSON.parse(record.body);
 
    // Process the chunk
    const data = await getChunkFromS3(bucket, chunkKey);
    const result = processAnalytics(data);
 
    // Write partial result
    await ddb.send(
      new PutCommand({
        TableName: "Results",
        Item: {
          PK: `JOB#${jobId}`,
          SK: `CHUNK#${chunkIndex}`,
          result,
        },
      }),
    );
 
    // Atomically increment completed count
    const updated = await ddb.send(
      new UpdateCommand({
        TableName: "Jobs",
        Key: { PK: `JOB#${jobId}` },
        UpdateExpression: "SET completedChunks = completedChunks + :one",
        ExpressionAttributeValues: { ":one": 1 },
        ReturnValues: "ALL_NEW",
      }),
    );
 
    // Check if all chunks are done — trigger aggregation
    if (updated.Attributes?.completedChunks === updated.Attributes?.totalChunks) {
      await eb.send(
        new PutEventsCommand({
          Entries: [
            {
              Source: "analytics.processor",
              DetailType: "AllChunksComplete",
              Detail: JSON.stringify({ jobId }),
            },
          ],
        }),
      );
    }
  }
};

Pattern 3: The Saga Pattern for Distributed Transactions#

The saga pattern is the alternative. Instead of one atomic transaction, you execute a series of local transactions, each with a compensating action that can undo it if a later step fails.

Saga: Book a Trip

  Step 1: Reserve Flight     -> Compensate: Cancel Flight Reservation
  Step 2: Reserve Hotel      -> Compensate: Cancel Hotel Reservation
  Step 3: Charge Credit Card -> Compensate: Refund Credit Card
  Step 4: Send Confirmation  -> (no compensation needed)

  If Step 3 fails:
    -> Execute compensation for Step 2 (cancel hotel)
    -> Execute compensation for Step 1 (cancel flight)
    -> Notify user of failure

AWS Step Functions is the managed way to implement this. The state machine definition looks like this:

json

{
  "Comment": "Trip Booking Saga",
  "StartAt": "ReserveFlight",
  "States": {
    "ReserveFlight": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123:function:reserve-flight",
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "Next": "FlightReservationFailed"
        }
      ],
      "Next": "ReserveHotel"
    },
    "ReserveHotel": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123:function:reserve-hotel",
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "Next": "CancelFlightReservation"
        }
      ],
      "Next": "ChargePayment"
    },
    "ChargePayment": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123:function:charge-payment",
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "Next": "CancelHotelReservation"
        }
      ],
      "Next": "SendConfirmation"
    },
    "SendConfirmation": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123:function:send-confirmation",
      "End": true
    },
    "CancelHotelReservation": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123:function:cancel-hotel",
      "Next": "CancelFlightReservation"
    },
    "CancelFlightReservation": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123:function:cancel-flight",
      "Next": "NotifyBookingFailed"
    },
    "FlightReservationFailed": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123:function:notify-failure",
      "End": true
    },
    "NotifyBookingFailed": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123:function:notify-failure",
      "End": true
    }
  }
}

My honest recommendation: if you can restructure your domain to avoid distributed transactions entirely, do that. Use the saga pattern as a last resort, not a first choice.

Pattern 4: Step Functions and Orchestration#

Step Functions go beyond sagas. They're a general-purpose workflow orchestration service, and they're genuinely useful for complex multi-step processes. The two flavors matter:

Express Workflows: At-least-once execution, up to 5 minutes, priced per invocation and duration. Good for high-volume, short-duration workflows.

Here's a frequently used pattern — a document processing pipeline:

Document Processing Pipeline (Step Functions Standard):

  Start
    |
    v
  [Extract Text]  -- Lambda: OCR / textract
    |
    v
  [Classify Document]  -- Lambda: ML classification
    |
    v
  <Is Sensitive?>  -- Choice State
   /          \
  Yes          No
  |             |
  v             v
  [Redact PII]  [Skip Redaction]
  |             |
  v             v
  [Store in S3]
    |
    v
  [Update Database]
    |
    v
  [Notify Subscribers]
    |
    v
  End

The Step Functions definition handles retries, timeouts, and error handling declaratively:

typescript

// CDK definition — much cleaner than raw JSON
const extractText = new tasks.LambdaInvoke(this, "ExtractText", {
  lambdaFunction: extractFn,
  retryOnServiceExceptions: true,
  resultPath: "$.extractResult",
});
 
const classifyDoc = new tasks.LambdaInvoke(this, "ClassifyDocument", {
  lambdaFunction: classifyFn,
  retryOnServiceExceptions: true,
  resultPath: "$.classification",
});
 
const isSensitive = new sfn.Choice(this, "IsSensitive?")
  .when(sfn.Condition.stringEquals("$.classification.Payload.type", "SENSITIVE"), redactPII)
  .otherwise(skipRedaction);
 
const pipeline = extractText.next(classifyDoc).next(isSensitive);
 
// Both paths converge
redactPII.next(storeInS3);
skipRedaction.next(storeInS3);
storeInS3.next(updateDB).next(notifySubscribers);
 
new sfn.StateMachine(this, "DocProcessor", {
  definition: pipeline,
  timeout: Duration.hours(1),
});

The Cold Start Reality#

Let's talk numbers, because cold start discussions without numbers are useless.

Systematic cold start benchmarks across different runtimes and memory configurations in late 2025 looked roughly like this:

Cold Start Latency (p50 / p99):

Node.js 20 (128MB):   320ms / 890ms
Node.js 20 (512MB):   180ms / 410ms
Node.js 20 (1024MB):  120ms / 280ms
Node.js 20 (1769MB):  95ms  / 210ms    <- 1 full vCPU

Python 3.12 (128MB):  380ms / 1100ms
Python 3.12 (512MB):  210ms / 520ms
Python 3.12 (1024MB): 140ms / 340ms

Java 21 (512MB):      3200ms / 8400ms  <- yes, seconds
Java 21 (1024MB):     1800ms / 4200ms
Java 21 (SnapStart):  280ms  / 620ms   <- SnapStart is essential

.NET 8 (512MB):       680ms / 1800ms
.NET 8 (1024MB):      420ms / 980ms

Rust (custom runtime): 12ms / 28ms     <- not a typo
Go (provided.al2023):  18ms / 45ms

Here's what actually helps with cold starts, ranked by impact:

typescript

// esbuild config for Lambda
import { build } from "esbuild";
 
await build({
  entryPoints: ["src/handlers/order.ts"],
  bundle: true,
  minify: true,
  platform: "node",
  target: "node20",
  outfile: "dist/order/index.js",
  external: ["@aws-sdk/*"], // AWS SDK v3 is included in the Lambda runtime
  treeShaking: true,
});

3. Lazy initialization. Don't initialize everything at module load time. If a handler only sometimes needs the S3 client, create it on first use:

typescript

// Bad: always initializes both clients
const s3 = new S3Client({});
const ddb = new DynamoDBDocumentClient.from(new DynamoDBClient({}));
 
export const handler = async (event) => {
  // Most invocations only use DynamoDB
  // but S3 client was initialized regardless
};
 
// Good: lazy initialization
let s3Client: S3Client | null = null;
const getS3 = () => {
  if (!s3Client) s3Client = new S3Client({});
  return s3Client;
};
 
// DynamoDB is always needed, so eagerly initialize it
const ddb = DynamoDBDocumentClient.from(new DynamoDBClient({}));
 
export const handler = async (event) => {
  // S3 only initialized when actually needed
  if (event.needsS3) {
    const s3 = getS3();
    // ...
  }
};

typescript

export const handler = async (event: any) => {
  // Warm-up detection
  if (event.source === "warmup" || event.detail?.warmup) {
    return { statusCode: 200, body: "warm" };
  }
 
  // Actual logic
  // ...
};

This only keeps one instance warm. If you get concurrent requests, new instances still cold-start. But for low-traffic endpoints, it eliminates 90% of user-visible cold starts.

API Gateway Patterns#

API Gateway is the front door to most serverless APIs, and it comes in two flavors that confuse everyone.

HTTP API (v2): The lightweight version. About 70% cheaper ($1.00 per million), lower latency (typically 5-10ms less), but fewer features. No caching, limited request validation, no usage plans.

For most new projects, start with HTTP API. You can always upgrade to REST API if you need the advanced features.

Here's a pattern that works well for most API Gateway setups — a Lambda authorizer that validates JWTs and attaches user context:

typescript

// authorizer.ts
import { APIGatewayRequestAuthorizerEventV2 } from "aws-lambda";
import { jwtVerify } from "jose";
 
const JWKS_CACHE: Map<string, CryptoKey> = new Map();
 
export const handler = async (event: APIGatewayRequestAuthorizerEventV2) => {
  const token = event.headers?.authorization?.replace("Bearer ", "");
 
  if (!token) {
    return { isAuthorized: false };
  }
 
  try {
    const { payload } = await jwtVerify(token, getPublicKey, {
      algorithms: ["RS256"],
      issuer: "https://auth.example.com",
    });
 
    return {
      isAuthorized: true,
      context: {
        userId: payload.sub,
        email: payload.email,
        role: payload.role,
        // These are available in $context.authorizer in your Lambda
      },
    };
  } catch {
    return { isAuthorized: false };
  }
};

Without direct integration:
  Client -> API Gateway -> Lambda -> DynamoDB
  Cost: API GW + Lambda invocation + DynamoDB read
  Latency: ~50-80ms

With direct integration:
  Client -> API Gateway -> DynamoDB (direct)
  Cost: API GW + DynamoDB read
  Latency: ~15-25ms

DynamoDB Single-Table Design#

Single-Table Design for an E-Commerce App:

PK                    SK                    Data
-------------------------------------------------------------
USER#u123             METADATA              {name, email, ...}
USER#u123             ORDER#o456            {total, status, ...}
USER#u123             ORDER#o789            {total, status, ...}
ORDER#o456            METADATA              {userId, total, ...}
ORDER#o456            ITEM#i001             {productId, qty, ...}
ORDER#o456            ITEM#i002             {productId, qty, ...}
PRODUCT#p100          METADATA              {name, price, ...}
PRODUCT#p100          REVIEW#r001           {userId, rating, ...}

GSI1:
GSI1PK                GSI1SK
-------------------------------------------
USER#u123             2025-12-01T10:30:00Z   (orders by date)
PRODUCT#p100          5#r001                 (reviews by rating)

This design lets you:

Get a user's profile: PK = USER#u123, SK = METADATA
Get all orders for a user: PK = USER#u123, SK begins_with ORDER#
Get all items in an order: PK = ORDER#o456, SK begins_with ITEM#
Get recent orders for a user: GSI1 with GSI1PK = USER#u123, sorted by date

typescript

// Get user profile + recent orders in one query
const userProfile = await ddb.send(
  new QueryCommand({
    TableName: "Main",
    KeyConditionExpression: "PK = :pk",
    ExpressionAttributeValues: { ":pk": `USER#${userId}` },
  }),
);
 
// Single query returns both METADATA and ORDER# items
const metadata = userProfile.Items?.find((i) => i.SK === "METADATA");
const orders = userProfile.Items?.filter((i) => i.SK.startsWith("ORDER#"));

Queues and topics are the connective tissue of serverless architectures. The distinction matters:

SQS (Simple Queue Service): Point-to-point. One message, one consumer. Messages persist until processed. Great for work distribution, rate limiting, and buffering.

SNS (Simple Notification Service): Pub/sub. One message, many subscribers. No persistence — if a subscriber is down, the message is lost (unless it delivers to SQS). Great for event fan-out.

The power pattern is combining them — the "SNS to SQS fan-out":

SNS + SQS Fan-Out:

  Producer -> SNS Topic "OrderCreated"
                |
                +-> SQS Queue: Payment Processing
                |     -> Lambda: process-payment
                |
                +-> SQS Queue: Inventory Service
                |     -> Lambda: update-inventory
                |
                +-> SQS Queue: Analytics
                |     -> Lambda: track-analytics
                |
                +-> SQS Queue: Email Notifications
                      -> Lambda: send-email

  Each queue has its own:
    - Retry policy (maxReceiveCount)
    - Dead letter queue
    - Concurrency limit
    - Batch size

The SQS-to-Lambda integration has important configuration options that people get wrong:

typescript

// CDK — SQS -> Lambda event source mapping
new lambda.EventSourceMapping(this, "OrderQueueMapping", {
  target: processOrderFn,
  eventSourceArn: orderQueue.queueArn,
  batchSize: 10, // Process up to 10 messages per invocation
  maxBatchingWindow: Duration.seconds(5), // Wait up to 5s to fill batch
  maxConcurrency: 50, // Maximum concurrent Lambda invocations
  reportBatchItemFailures: true, // Critical: report per-item failures
});

typescript

// Handler with batch item failure reporting
import { SQSBatchResponse, SQSEvent } from "aws-lambda";
 
export const handler = async (event: SQSEvent): Promise<SQSBatchResponse> => {
  const failures: { itemIdentifier: string }[] = [];
 
  for (const record of event.Records) {
    try {
      await processMessage(JSON.parse(record.body));
    } catch (error) {
      console.error(`Failed to process ${record.messageId}`, error);
      failures.push({ itemIdentifier: record.messageId });
    }
  }
 
  return {
    batchItemFailures: failures,
  };
};

The Cost Analysis Nobody Wants to Hear#

Here's the honest breakdown, based on numbers from a few real production workloads.

Workload 1: Low-traffic API (1,000 requests/day)

Serverless:
  API Gateway HTTP API:        1K * 30 = 30K requests/month = $0.03
  Lambda (256MB, 100ms avg):   30K invocations = $0.01
  DynamoDB (on-demand):        ~$0.50
  CloudWatch Logs:             ~$0.30
  Total: ~$0.84/month

VPS (smallest Hetzner):
  CX22 (2 vCPU, 4GB RAM):     $4.49/month
  Total: $4.49/month

Winner: Serverless (5x cheaper)

Workload 2: Medium-traffic API (100,000 requests/day)

Serverless:
  API Gateway HTTP API:        3M requests/month = $3.00
  Lambda (512MB, 200ms avg):   3M invocations = $1.88
  Lambda compute:              3M * 0.2s * 0.5GB = 300K GB-s = $5.00
  DynamoDB (on-demand):        ~$15.00
  CloudWatch Logs:             ~$8.00
  NAT Gateway (if in VPC):     ~$32.00 + data processing
  Total: ~$65/month (without NAT) or ~$97/month (with NAT)

VPS:
  Hetzner CX42 (4 vCPU, 16GB): $17.49/month
  Total: $17.49/month

Winner: VPS (4-6x cheaper)

Workload 3: High-traffic API (2M requests/day)

Serverless:
  API Gateway:                 60M requests/month = $60.00
  Lambda (1024MB, 150ms avg):  60M invocations = $12.00
  Lambda compute:              60M * 0.15s * 1GB = 9M GB-s = $150.03
  DynamoDB (provisioned):      ~$120.00
  CloudWatch Logs:             ~$45.00
  NAT Gateway:                 ~$95.00
  ElastiCache (for caching):   ~$50.00
  Total: ~$532/month

VPS (3x Hetzner dedicated):
  3x AX52 (8 core, 64GB):     3 * $63 = $189/month
  Load balancer:               $5.49
  Total: ~$195/month

Winner: VPS (2.7x cheaper)

Workload 4: Bursty processing (idle 23 hours, massive spike for 1 hour)

Serverless:
  Lambda during spike: 500 concurrent, 1 hour = ~$45
  Lambda during idle: $0
  SQS/EventBridge: ~$5
  Total: ~$50/month

VPS (sized for peak):
  Need to handle peak concurrency = expensive
  Hetzner AX102 (16 core, 128GB): $128/month
  Running 24/7 even when idle
  Total: $128/month

Winner: Serverless (2.5x cheaper)

Vendor Lock-In: The Honest Assessment#

Here's a realistic portability assessment:

Portability Spectrum:

Easy to port:
  - Lambda function code (just Node.js/Python/Go code)
  - S3 object storage (standard API, many alternatives)
  - SQS messaging (replace with RabbitMQ, Redis queues)

Moderate effort:
  - API Gateway (replace with Express/Fastify + reverse proxy)
  - SNS (replace with Redis pub/sub, NATS)
  - CloudWatch Logs (replace with any logging service)

Significant rewrite:
  - DynamoDB (unique data model, no direct equivalent)
  - Step Functions (rewrite as code-based orchestration)
  - EventBridge (custom event bus implementation)
  - Cognito (entire auth system replacement)

Practically impossible without redesign:
  - DynamoDB Streams -> Lambda triggers
  - AppSync + DynamoDB resolvers
  - Multi-service IAM-based auth

typescript

// Adapter layer (AWS-specific)
import { APIGatewayProxyEventV2 } from "aws-lambda";
import { createOrder } from "./domain/orders";
import { DynamoDBOrderRepository } from "./infra/dynamodb-order-repo";
import { SQSEventPublisher } from "./infra/sqs-event-publisher";
 
export const handler = async (event: APIGatewayProxyEventV2) => {
  const input = JSON.parse(event.body || "{}");
 
  // Domain logic knows nothing about AWS
  const result = await createOrder(input, {
    orderRepo: new DynamoDBOrderRepository(),
    eventPublisher: new SQSEventPublisher(),
  });
 
  return {
    statusCode: result.success ? 201 : 400,
    body: JSON.stringify(result),
  };
};
 
// Domain layer (portable)
interface OrderRepository {
  save(order: Order): Promise<void>;
}
 
interface EventPublisher {
  publish(event: DomainEvent): Promise<void>;
}
 
export async function createOrder(
  input: CreateOrderInput,
  deps: { orderRepo: OrderRepository; eventPublisher: EventPublisher },
) {
  const order = Order.create(input);
  await deps.orderRepo.save(order);
  await deps.eventPublisher.publish(new OrderCreatedEvent(order));
  return { success: true, orderId: order.id };
}

Observability: The Serverless Achilles Heel#

Here's what you need, at minimum:

Structured logging with correlation IDs. Every log line needs to be JSON with a correlation ID that ties together all the functions involved in processing a single request:

typescript

import { Logger } from "@aws-lambda-powertools/logger";
import { Tracer } from "@aws-lambda-powertools/tracer";
 
const logger = new Logger({ serviceName: "order-service" });
const tracer = new Tracer({ serviceName: "order-service" });
 
export const handler = async (event: any) => {
  // Extract or generate correlation ID
  const correlationId = event.headers?.["x-correlation-id"] || event.detail?.correlationId || crypto.randomUUID();
 
  logger.appendKeys({ correlationId });
 
  logger.info("Processing order", {
    orderId: event.detail?.orderId,
    source: event.source,
  });
 
  try {
    const result = await processOrder(event);
    logger.info("Order processed successfully", { result });
    return result;
  } catch (error) {
    logger.error("Order processing failed", { error });
    throw error;
  }
};

typescript

// In your CDK stack
const fn = new lambda.Function(this, "OrderHandler", {
  // ...
  tracing: lambda.Tracing.ACTIVE, // Enable X-Ray
  environment: {
    POWERTOOLS_SERVICE_NAME: "order-service",
    LOG_LEVEL: "INFO",
  },
});

CloudWatch Insights queries. Learn these. When something breaks at 3 AM, you'll be querying CloudWatch Logs Insights to find what happened:

# Find all errors for a specific correlation ID
fields @timestamp, @message
| filter correlationId = "abc-123-def"
| sort @timestamp asc

# Find all cold starts in the last hour
fields @timestamp, @initDuration, @memorySize
| filter @type = "REPORT" and @initDuration > 0
| sort @initDuration desc
| limit 50

# Find functions with highest error rate
filter @type = "REPORT"
| stats count() as invocations,
        sum(strcontains(@message, "ERROR")) as errors,
        (sum(strcontains(@message, "ERROR")) / count()) * 100 as errorRate
  by @logGroup
| sort errorRate desc

Local Development: The Ongoing Pain#

This is the area where serverless developer experience still lags the most. Your code runs on AWS. Your local machine is not AWS. Bridging that gap is painful.

The options, ranked by my preference:

typescript

// sst.config.ts — SST v3 configuration
export default $config({
  app(input) {
    return {
      name: "my-service",
      removal: input?.stage === "production" ? "retain" : "remove",
    };
  },
  async run() {
    const table = new sst.aws.Dynamo("Orders", {
      fields: { PK: "string", SK: "string" },
      primaryIndex: { hashKey: "PK", rangeKey: "SK" },
    });
 
    const api = new sst.aws.ApiGatewayV2("Api");
 
    api.route("POST /orders", {
      handler: "src/handlers/order.handler",
      link: [table],
    });
 
    // During `sst dev`, Lambda invocations are proxied to your machine
    // During `sst deploy`, everything runs on AWS normally
  },
});

Testing Serverless Applications#

Testing serverless code requires a different strategy than testing traditional applications. You have three layers to test, and each requires a different approach.

typescript

// domain/orders.test.ts
import { createOrder } from "./orders";
 
const mockRepo = {
  save: vi.fn().mockResolvedValue(undefined),
};
const mockPublisher = {
  publish: vi.fn().mockResolvedValue(undefined),
};
 
describe("createOrder", () => {
  it("creates an order and publishes an event", async () => {
    const result = await createOrder(
      { userId: "u123", items: [{ productId: "p1", quantity: 2 }] },
      { orderRepo: mockRepo, eventPublisher: mockPublisher },
    );
 
    expect(result.success).toBe(true);
    expect(mockRepo.save).toHaveBeenCalledOnce();
    expect(mockPublisher.publish).toHaveBeenCalledWith(expect.objectContaining({ type: "OrderCreated" }));
  });
 
  it("rejects empty orders", async () => {
    const result = await createOrder(
      { userId: "u123", items: [] },
      { orderRepo: mockRepo, eventPublisher: mockPublisher },
    );
 
    expect(result.success).toBe(false);
    expect(mockRepo.save).not.toHaveBeenCalled();
  });
});

Integration tests: Test AWS interactions with LocalStack. Use Testcontainers to spin up LocalStack in Docker and run your actual DynamoDB queries, SQS operations, and S3 uploads against it:

typescript

// infra/dynamodb-order-repo.integration.test.ts
import { GenericContainer } from "testcontainers";
import { DynamoDBClient, CreateTableCommand } from "@aws-sdk/client-dynamodb";
 
let localstackContainer: any;
let ddbClient: DynamoDBClient;
 
beforeAll(async () => {
  localstackContainer = await new GenericContainer("localstack/localstack").withExposedPorts(4566).start();
 
  const endpoint = `http://localhost:${localstackContainer.getMappedPort(4566)}`;
 
  ddbClient = new DynamoDBClient({
    endpoint,
    region: "us-east-1",
    credentials: { accessKeyId: "test", secretAccessKey: "test" },
  });
 
  // Create table
  await ddbClient.send(
    new CreateTableCommand({
      TableName: "Orders",
      KeySchema: [
        { AttributeName: "PK", KeyType: "HASH" },
        { AttributeName: "SK", KeyType: "RANGE" },
      ],
      AttributeDefinitions: [
        { AttributeName: "PK", AttributeType: "S" },
        { AttributeName: "SK", AttributeType: "S" },
      ],
      BillingMode: "PAY_PER_REQUEST",
    }),
  );
}, 60_000);
 
afterAll(async () => {
  await localstackContainer?.stop();
});
 
it("saves and retrieves an order", async () => {
  const repo = new DynamoDBOrderRepository(ddbClient, "Orders");
 
  const order = Order.create({
    userId: "u123",
    items: [{ productId: "p1", quantity: 2 }],
  });
 
  await repo.save(order);
  const retrieved = await repo.findById(order.id);
 
  expect(retrieved).toEqual(order);
});

typescript

// e2e/order-flow.e2e.test.ts
const API_URL = process.env.API_URL; // Set by CI/CD after deployment
 
it("creates an order and processes it end-to-end", async () => {
  // Create order via API
  const response = await fetch(`${API_URL}/orders`, {
    method: "POST",
    headers: { "Content-Type": "application/json", Authorization: `Bearer ${token}` },
    body: JSON.stringify({
      items: [{ productId: "p1", quantity: 2 }],
    }),
  });
 
  expect(response.status).toBe(202);
  const { orderId } = await response.json();
 
  // Poll for completion (event-driven processing takes time)
  const order = await pollUntil(
    () => fetch(`${API_URL}/orders/${orderId}`).then((r) => r.json()),
    (result) => result.status === "COMPLETED",
    { timeout: 30_000, interval: 1_000 },
  );
 
  expect(order.status).toBe("COMPLETED");
  expect(order.paymentStatus).toBe("CHARGED");
  expect(order.inventoryStatus).toBe("RESERVED");
});

Anti-Patterns: What Not to Do#

All of these mistakes are common in real-world serverless projects. Worth avoiding ahead of time.

Bad: Lambda-lith
  API Gateway /* -> Single Lambda (Express app)
                    - 50MB bundle
                    - 2 second cold start
                    - All routes, all middleware, all the time

Good: Function-per-route (or function-per-domain)
  API Gateway /orders/* -> Order Lambda (3MB, 120ms cold start)
  API Gateway /users/*  -> User Lambda (2MB, 100ms cold start)
  API Gateway /search/* -> Search Lambda (4MB, 150ms cold start)

Bad: Synchronous chain
  Lambda A (waiting...) -> Lambda B (waiting...) -> Lambda C -> Lambda D
  Total cost: A duration + B duration + C duration + D duration
  Latency: A + B + C + D
  If D fails: everything fails

Good: Async with queues
  Lambda A -> SQS -> Lambda B -> SQS -> Lambda C -> SQS -> Lambda D
  Total cost: A + B + C + D (but A finishes immediately)
  Latency (user-facing): just A
  If D fails: D's message goes to DLQ, everything else succeeded

typescript

// CDK — Set reserved concurrency
const criticalFn = new lambda.Function(this, "PaymentProcessor", {
  // ...
  reservedConcurrentExecutions: 100, // Guaranteed 100, max 100
});
 
const nonCriticalFn = new lambda.Function(this, "AnalyticsProcessor", {
  // ...
  reservedConcurrentExecutions: 50, // Won't starve critical functions
});

When Serverless Makes Sense (and When a VPS Is Better)#

After enough production serverless projects and VPS deployments, here's an honest assessment of when to use what.

Serverless is genuinely better for:

Event processing pipelines (S3 uploads, webhook handlers, stream processing)
Scheduled jobs that run briefly (daily reports, cleanup tasks, health checks)
APIs with extreme traffic variability (0 at night, 10,000 RPS during a marketing push)
Fan-out/fan-in workloads (parallel data processing, batch operations)
Prototypes and MVPs where operational overhead is the bottleneck
Glue code between AWS services (S3 trigger -> process -> DynamoDB)

A VPS / container is genuinely better for:

Consistent, predictable traffic (anything over ~50K requests/day)
WebSocket connections (Lambda doesn't support persistent connections natively; API Gateway WebSocket exists but is awkward and expensive)
Applications that need local state (in-memory caches, connection pools)
Long-running processes (video transcoding, ML training, data migrations)
Applications where cold start latency is unacceptable
Budget-constrained projects with moderate traffic

The hybrid approach that actually works:

Hybrid Architecture:

  [VPS Cluster]                    [Serverless]
  - Main API (Express/Fastify)     - S3 upload processing (Lambda)
  - WebSocket server               - Email/SMS notifications (Lambda + SES)
  - Background workers             - Daily report generation (Lambda + Step Functions)
  - Redis cache                    - Webhook ingestion (API Gateway + Lambda + SQS)
  - PostgreSQL                     - Image/video thumbnailing (Lambda)
                                   - Scheduled cleanup tasks (EventBridge + Lambda)

  Communication: SQS queues between VPS and Lambda functions

This gives you the cost efficiency of servers for predictable workloads, the elastic scaling of Lambda for bursty workloads, and the operational simplicity of managed services for glue tasks.

The Decision Matrix#

Before you start your next project, run through this checklist:

TRAFFIC PATTERN
  [ ] Consistent, predictable         -> VPS / Containers
  [ ] Highly variable, bursty         -> Serverless
  [ ] Idle most of the time           -> Serverless
  [ ] High sustained throughput       -> VPS / Containers

LATENCY REQUIREMENTS
  [ ] p99 < 50ms required             -> VPS (no cold starts)
  [ ] p99 < 200ms acceptable          -> Serverless (with provisioned concurrency)
  [ ] p99 < 500ms acceptable          -> Serverless (standard)
  [ ] Latency doesn't matter          -> Serverless

EXECUTION DURATION
  [ ] Sub-second responses            -> Either
  [ ] 1-15 minutes                    -> Lambda (with timeout awareness)
  [ ] 15 minutes to hours             -> ECS Fargate / VPS
  [ ] Hours to days                   -> VPS / dedicated server

STATE REQUIREMENTS
  [ ] Stateless request/response      -> Serverless
  [ ] In-memory caching needed        -> VPS
  [ ] WebSocket connections           -> VPS (or API Gateway WebSocket)
  [ ] Local filesystem needed         -> VPS (or Lambda + EFS, with caveats)

TEAM AND OPERATIONAL FACTORS
  [ ] Small team, no DevOps           -> Serverless (less infra to manage)
  [ ] Large team, dedicated SRE       -> Either (team can manage complexity)
  [ ] AWS expertise on team           -> Serverless
  [ ] Need to run on multiple clouds  -> VPS / Containers (portability)
  [ ] Tight budget                    -> VPS (for moderate+ traffic)

COST ESTIMATION
  [ ] < 10K requests/day              -> Serverless (nearly free)
  [ ] 10K-100K requests/day           -> Calculate both; often VPS wins
  [ ] 100K-1M requests/day            -> VPS almost certainly cheaper
  [ ] > 1M requests/day               -> VPS, unless traffic is extremely bursty

Final Thoughts#

The total bill went from $20,400 to $285. The architecture is cleaner. The team is happier. The code is easier to debug.

Serverless is brilliant when it fits. It's expensive when it doesn't. The skill is knowing the difference before the bill arrives.

The Serverless Mental Model Shift#

Pattern 1: Event-Driven Architecture#

Pattern 2: Fan-Out / Fan-In#

Pattern 3: The Saga Pattern for Distributed Transactions#

Pattern 4: Step Functions and Orchestration#

The Cold Start Reality#

API Gateway Patterns#

DynamoDB Single-Table Design#

SQS and SNS Patterns#

The Cost Analysis Nobody Wants to Hear#

Vendor Lock-In: The Honest Assessment#

Observability: The Serverless Achilles Heel#

Local Development: The Ongoing Pain#

Testing Serverless Applications#

Anti-Patterns: What Not to Do#

When Serverless Makes Sense (and When a VPS Is Better)#

The Decision Matrix#

Final Thoughts#

Související články

WebSockets at Scale: Architecture, Reconnection, and the Mistakes I Keep Seeing

LLMs in Production Web Apps: Streaming, Caching, Cost Control, and What the Tutorials Skip

The Serverless Mental Model Shift#

Pattern 1: Event-Driven Architecture#

Pattern 2: Fan-Out / Fan-In#

Pattern 3: The Saga Pattern for Distributed Transactions#

Pattern 4: Step Functions and Orchestration#

The Cold Start Reality#

API Gateway Patterns#

DynamoDB Single-Table Design#

SQS and SNS Patterns#

The Cost Analysis Nobody Wants to Hear#

Vendor Lock-In: The Honest Assessment#

Observability: The Serverless Achilles Heel#

Local Development: The Ongoing Pain#

Testing Serverless Applications#

Anti-Patterns: What Not to Do#

When Serverless Makes Sense (and When a VPS Is Better)#

The Decision Matrix#

Final Thoughts#

Související články

WebSockets at Scale: Architecture, Reconnection, and the Mistakes I Keep Seeing

LLMs in Production Web Apps: Streaming, Caching, Cost Control, and What the Tutorials Skip