Serverless isn't just Lambda functions. Event-driven architectures, fan-out patterns, saga orchestration, cold starts, vendor lock-in, and the cost surprises that hit when your traffic actually grows.
I need to start this post with a confession. In 2024, I convinced my team to go "fully serverless" for a new product. No containers. No VMs. Pure Lambda, DynamoDB, SQS, Step Functions — the works. The architecture diagrams looked beautiful. The pitch to leadership was compelling: "We only pay for what we use. It scales to zero. It scales to infinity."
Eighteen months later, our monthly bill stabilized at $20,400. The equivalent workload on three well-provisioned VPS instances would have cost roughly $600. We were processing about 2 million requests per day, which sounds like a lot until you realize that's about 23 requests per second on average — a workload any $200/month server handles while yawning.
This post is everything I learned from that experience and the three serverless projects that followed it. Some of those projects genuinely benefited from serverless. The first one didn't. The difference was understanding the patterns, the anti-patterns, and most critically, the cost model.
Before we get into patterns, let's fix the mental model. When most developers hear "serverless," they think "functions as a service" — write a function, deploy it, it runs when triggered. That's accurate but incomplete, like saying a car is "an engine with seats."
Serverless is a deployment and execution model where:
That last point is the one people underestimate. Everything you know about in-process caching, connection pooling, background threads, and local file storage needs to be rethought. Your function is a stateless transformer: input comes in, output goes out, anything you want to remember lives somewhere else.
Traditional Server Mental Model:
[Long-lived process] -> in-memory cache
-> connection pool
-> local filesystem
-> background workers
-> cron scheduling
Serverless Mental Model:
[Short-lived function] -> external cache (Redis/ElastiCache)
-> per-invocation connections (or RDS Proxy)
-> S3 / EFS
-> SQS + separate functions
-> EventBridge Scheduler
This shift isn't just architectural — it changes how you think about every problem. And that's where the patterns come in.
This is the foundational serverless pattern, and it's genuinely powerful when applied correctly. Instead of services calling each other synchronously, they emit events. Other services react to those events. Nobody waits for anybody.
Here's what a synchronous order flow looks like:
Synchronous (traditional):
Client -> API -> Validate Order
-> Charge Payment
-> Update Inventory
-> Send Confirmation Email
-> Return Response
Total latency: sum of all steps (800-2000ms)
If email service is down: entire order fails
And the event-driven equivalent:
Event-Driven (serverless):
Client -> API Gateway -> Lambda (validate + save order)
-> emit "OrderCreated" event
-> Return 202 Accepted (50ms)
EventBridge picks up "OrderCreated":
-> Lambda: charge payment -> emit "PaymentProcessed"
-> Lambda: reserve inventory -> emit "InventoryReserved"
-> Lambda: send confirmation email
Each step is independent. Email failure doesn't block payment.
The implementation with EventBridge looks like this:
// order-handler.ts — the entry point
import { EventBridgeClient, PutEventsCommand } from "@aws-sdk/client-eventbridge";
import { DynamoDBDocumentClient, PutCommand } from "@aws-sdk/lib-dynamodb";
const eb = new EventBridgeClient({});
const ddb = DynamoDBDocumentClient.from(new DynamoDBClient({}));
export const handler = async (event: APIGatewayProxyEventV2) => {
const order = JSON.parse(event.body || "{}");
// Validate
const errors = validateOrder(order);
if (errors.length > 0) {
return { statusCode: 400, body: JSON.stringify({ errors }) };
}
// Save to DynamoDB
const orderId = ulid();
await ddb.send(new PutCommand({
TableName: "Orders",
Item: {
PK: `ORDER#${orderId}`,
SK: "METADATA",
...order,
status: "PENDING",
createdAt: Date.now(),
},
}));
// Emit event — this is the key
await eb.send(new PutEventsCommand({
Entries: [{
Source: "orders.service",
DetailType: "OrderCreated",
Detail: JSON.stringify({ orderId, ...order }),
EventBusName: "main-bus",
}],
}));
return {
statusCode: 202,
body: JSON.stringify({ orderId, status: "PENDING" }),
};
};The beauty is in the decoupling. The order handler has no idea what happens after it emits that event. Payment processing, inventory management, email notifications — they're all separate Lambda functions triggered by EventBridge rules. You can add a new reaction (say, updating an analytics dashboard) without touching the order handler.
But here's where the honest part comes in: debugging this is a nightmare. When a customer says "I placed an order but never got a confirmation email," you need to trace across multiple Lambda invocations, check EventBridge delivery logs, look at the email Lambda's CloudWatch logs, and hope you logged enough correlation IDs. We'll get to observability later, but know this: the operational complexity of event-driven architectures is real and substantial.
This pattern is where serverless genuinely earns its keep. You have a job that can be parallelized — processing a large file, running computations across multiple data sets, generating reports from different sources. Instead of processing sequentially, you fan out to many concurrent Lambda invocations and fan back in when they're all done.
Fan-Out / Fan-In:
-> Lambda (chunk 1) --\
/-> Lambda (chunk 2) ---\
S3 Upload -> Lambda --+--> Lambda (chunk 3) ----+--> Lambda (aggregate)
(splitter) \-> Lambda (chunk 4) ---/ |
-> Lambda (chunk 5) --/ Final Result -> S3
Processing 5GB CSV file:
Single server: 45 minutes
Fan-out to 100 Lambdas: 27 seconds
I used this pattern to process user analytics data. Every night, a 3GB file lands in S3. A "splitter" Lambda reads the file, chunks it into pieces, and drops each chunk into an SQS queue. A fleet of Lambda workers processes the chunks concurrently. When they're all done, a final Lambda aggregates the results.
// splitter.ts — triggered by S3 event
import { S3Client, GetObjectCommand } from "@aws-sdk/client-s3";
import { SQSClient, SendMessageBatchCommand } from "@aws-sdk/client-sqs";
const s3 = new S3Client({});
const sqs = new SQSClient({});
export const handler = async (event: S3Event) => {
const bucket = event.Records[0].s3.bucket.name;
const key = event.Records[0].s3.object.key;
// Stream the file to count lines and determine chunks
const obj = await s3.send(new GetObjectCommand({ Bucket: bucket, Key: key }));
const content = await obj.Body?.transformToString();
const lines = content?.split("\n").filter(Boolean) || [];
const CHUNK_SIZE = 10_000;
const chunks = Math.ceil(lines.length / CHUNK_SIZE);
const jobId = `job-${Date.now()}`;
// Track total chunks for the aggregator
await ddb.send(new PutCommand({
TableName: "Jobs",
Item: {
PK: `JOB#${jobId}`,
totalChunks: chunks,
completedChunks: 0,
status: "PROCESSING",
},
}));
// Fan out — send each chunk reference to SQS
for (let i = 0; i < chunks; i++) {
const batch = lines.slice(i * CHUNK_SIZE, (i + 1) * CHUNK_SIZE);
// In practice, write chunks to S3 and send references
await sqs.send(new SendMessageBatchCommand({
QueueUrl: process.env.QUEUE_URL,
Entries: [{
Id: `chunk-${i}`,
MessageBody: JSON.stringify({
jobId,
chunkIndex: i,
bucket,
chunkKey: `chunks/${jobId}/chunk-${i}.json`,
}),
}],
}));
}
};// worker.ts — triggered by SQS
export const handler = async (event: SQSEvent) => {
for (const record of event.Records) {
const { jobId, chunkIndex, bucket, chunkKey } = JSON.parse(record.body);
// Process the chunk
const data = await getChunkFromS3(bucket, chunkKey);
const result = processAnalytics(data);
// Write partial result
await ddb.send(new PutCommand({
TableName: "Results",
Item: {
PK: `JOB#${jobId}`,
SK: `CHUNK#${chunkIndex}`,
result,
},
}));
// Atomically increment completed count
const updated = await ddb.send(new UpdateCommand({
TableName: "Jobs",
Key: { PK: `JOB#${jobId}` },
UpdateExpression: "SET completedChunks = completedChunks + :one",
ExpressionAttributeValues: { ":one": 1 },
ReturnValues: "ALL_NEW",
}));
// Check if all chunks are done — trigger aggregation
if (updated.Attributes?.completedChunks === updated.Attributes?.totalChunks) {
await eb.send(new PutEventsCommand({
Entries: [{
Source: "analytics.processor",
DetailType: "AllChunksComplete",
Detail: JSON.stringify({ jobId }),
}],
}));
}
}
};The atomic counter on DynamoDB is the trick here — it's how you know when all workers have finished without needing a central coordinator polling for completion. This is genuinely elegant and hard to replicate this cleanly with traditional infrastructure.
The cost for this specific workload: about $4.50 per nightly run. Running a server 24/7 just to process one file each night would cost more. This is the serverless sweet spot — bursty, parallelizable workloads with long idle periods.
This is where serverless architecture gets seriously complicated, and where I've seen the most teams get burned. In a monolith, if you need to charge a payment and update inventory atomically, you wrap it in a database transaction. Done. In a distributed serverless system, there's no such thing as a distributed transaction (and if someone tries to sell you one, run).
The saga pattern is the alternative. Instead of one atomic transaction, you execute a series of local transactions, each with a compensating action that can undo it if a later step fails.
Saga: Book a Trip
Step 1: Reserve Flight -> Compensate: Cancel Flight Reservation
Step 2: Reserve Hotel -> Compensate: Cancel Hotel Reservation
Step 3: Charge Credit Card -> Compensate: Refund Credit Card
Step 4: Send Confirmation -> (no compensation needed)
If Step 3 fails:
-> Execute compensation for Step 2 (cancel hotel)
-> Execute compensation for Step 1 (cancel flight)
-> Notify user of failure
AWS Step Functions is the managed way to implement this. The state machine definition looks like this:
{
"Comment": "Trip Booking Saga",
"StartAt": "ReserveFlight",
"States": {
"ReserveFlight": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:reserve-flight",
"Catch": [{
"ErrorEquals": ["States.ALL"],
"Next": "FlightReservationFailed"
}],
"Next": "ReserveHotel"
},
"ReserveHotel": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:reserve-hotel",
"Catch": [{
"ErrorEquals": ["States.ALL"],
"Next": "CancelFlightReservation"
}],
"Next": "ChargePayment"
},
"ChargePayment": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:charge-payment",
"Catch": [{
"ErrorEquals": ["States.ALL"],
"Next": "CancelHotelReservation"
}],
"Next": "SendConfirmation"
},
"SendConfirmation": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:send-confirmation",
"End": true
},
"CancelHotelReservation": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:cancel-hotel",
"Next": "CancelFlightReservation"
},
"CancelFlightReservation": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:cancel-flight",
"Next": "NotifyBookingFailed"
},
"FlightReservationFailed": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:notify-failure",
"End": true
},
"NotifyBookingFailed": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:notify-failure",
"End": true
}
}
}This looks clean in a diagram. In practice, it's where the pain lives. What happens if the compensation itself fails? What if CancelHotelReservation times out? Now you have an inconsistent state: the flight was canceled but the hotel reservation is stuck. You need retry logic on compensations, dead-letter queues for failed compensations, and monitoring that alerts you when a saga is stuck in an intermediate state.
I've implemented three saga workflows in production. Every single one required manual intervention within the first month — not because the logic was wrong, but because third-party APIs are unreliable. The hotel booking API returned a 500 during cancellation. The payment refund endpoint had a different timeout than the charge endpoint. Real systems are messy, and the saga pattern doesn't hide that messiness — it just gives you a structured way to deal with it.
My honest recommendation: if you can restructure your domain to avoid distributed transactions entirely, do that. Use the saga pattern as a last resort, not a first choice.
Step Functions go beyond sagas. They're a general-purpose workflow orchestration service, and they're genuinely useful for complex multi-step processes. The two flavors matter:
Standard Workflows: Exactly-once execution, up to one year duration, priced per state transition ($0.025 per 1,000 transitions). Good for long-running processes where you need guaranteed completion.
Express Workflows: At-least-once execution, up to 5 minutes, priced per invocation and duration. Good for high-volume, short-duration workflows.
Here's a pattern I use frequently — a document processing pipeline:
Document Processing Pipeline (Step Functions Standard):
Start
|
v
[Extract Text] -- Lambda: OCR / textract
|
v
[Classify Document] -- Lambda: ML classification
|
v
<Is Sensitive?> -- Choice State
/ \
Yes No
| |
v v
[Redact PII] [Skip Redaction]
| |
v v
[Store in S3]
|
v
[Update Database]
|
v
[Notify Subscribers]
|
v
End
The Step Functions definition handles retries, timeouts, and error handling declaratively:
// CDK definition — much cleaner than raw JSON
const extractText = new tasks.LambdaInvoke(this, "ExtractText", {
lambdaFunction: extractFn,
retryOnServiceExceptions: true,
resultPath: "$.extractResult",
});
const classifyDoc = new tasks.LambdaInvoke(this, "ClassifyDocument", {
lambdaFunction: classifyFn,
retryOnServiceExceptions: true,
resultPath: "$.classification",
});
const isSensitive = new sfn.Choice(this, "IsSensitive?")
.when(
sfn.Condition.stringEquals("$.classification.Payload.type", "SENSITIVE"),
redactPII
)
.otherwise(skipRedaction);
const pipeline = extractText
.next(classifyDoc)
.next(isSensitive);
// Both paths converge
redactPII.next(storeInS3);
skipRedaction.next(storeInS3);
storeInS3.next(updateDB).next(notifySubscribers);
new sfn.StateMachine(this, "DocProcessor", {
definition: pipeline,
timeout: Duration.hours(1),
});The cost trap with Step Functions Standard: each state transition costs money. A workflow with 10 states processing 100,000 documents per month is 1,000,000 transitions = $25. Not bad. But add error handling, retries, and parallel branches, and a single execution might have 40-50 transitions. Now you're at $125/month. Still reasonable, but it creeps up fast.
Express Workflows are better for high-volume scenarios. I switched a webhook processing pipeline from Standard to Express and the cost dropped from $340/month to $18/month. The trade-off is at-least-once semantics — your Lambda functions need to be idempotent.
Let's talk numbers, because cold start discussions without numbers are useless.
I ran systematic cold start benchmarks across different runtimes and memory configurations in late 2025. Here's what I measured:
Cold Start Latency (p50 / p99):
Node.js 20 (128MB): 320ms / 890ms
Node.js 20 (512MB): 180ms / 410ms
Node.js 20 (1024MB): 120ms / 280ms
Node.js 20 (1769MB): 95ms / 210ms <- 1 full vCPU
Python 3.12 (128MB): 380ms / 1100ms
Python 3.12 (512MB): 210ms / 520ms
Python 3.12 (1024MB): 140ms / 340ms
Java 21 (512MB): 3200ms / 8400ms <- yes, seconds
Java 21 (1024MB): 1800ms / 4200ms
Java 21 (SnapStart): 280ms / 620ms <- SnapStart is essential
.NET 8 (512MB): 680ms / 1800ms
.NET 8 (1024MB): 420ms / 980ms
Rust (custom runtime): 12ms / 28ms <- not a typo
Go (provided.al2023): 18ms / 45ms
The Node.js numbers are the most relevant for web developers. At 512MB with a typical Express-on-Lambda setup (using a framework adapter), cold starts average 180ms. That's perceptible but not terrible. But if you're importing heavy SDKs — the full AWS SDK v3 with DynamoDB, S3, SES, and SQS clients — that jumps to 400-600ms because the module graph is massive.
Here's what actually helps with cold starts, ranked by impact:
1. Bundle size (highest impact). Use esbuild to tree-shake and bundle your Lambda code. A typical Express app with full dependencies: 45MB. After esbuild bundling: 2-4MB. Cold start improvement: 40-60%.
// esbuild config for Lambda
import { build } from "esbuild";
await build({
entryPoints: ["src/handlers/order.ts"],
bundle: true,
minify: true,
platform: "node",
target: "node20",
outfile: "dist/order/index.js",
external: ["@aws-sdk/*"], // AWS SDK v3 is included in the Lambda runtime
treeShaking: true,
});2. Memory allocation. Lambda allocates CPU proportional to memory. At 128MB you get a fraction of a vCPU. At 1769MB you get a full vCPU. The sweet spot for Node.js is 512MB-1024MB. Going higher doesn't help cold starts much but costs more per ms.
3. Lazy initialization. Don't initialize everything at module load time. If a handler only sometimes needs the S3 client, create it on first use:
// Bad: always initializes both clients
const s3 = new S3Client({});
const ddb = new DynamoDBDocumentClient.from(new DynamoDBClient({}));
export const handler = async (event) => {
// Most invocations only use DynamoDB
// but S3 client was initialized regardless
};
// Good: lazy initialization
let s3Client: S3Client | null = null;
const getS3 = () => {
if (!s3Client) s3Client = new S3Client({});
return s3Client;
};
// DynamoDB is always needed, so eagerly initialize it
const ddb = DynamoDBDocumentClient.from(new DynamoDBClient({}));
export const handler = async (event) => {
// S3 only initialized when actually needed
if (event.needsS3) {
const s3 = getS3();
// ...
}
};4. Provisioned concurrency. This is the "just keep instances warm" solution. It works, but it's expensive. You're essentially paying for servers again — which defeats the "pay per use" promise. At $0.015 per GB-hour for provisioned concurrency, keeping 10 instances warm at 512MB costs about $55/month. That's on top of your invocation costs.
I use provisioned concurrency for exactly two scenarios: user-facing API endpoints where p99 latency matters, and scheduled functions that need to respond within a tight SLA. Everything else just eats the cold start.
5. Keep-alive pinging. The poor man's provisioned concurrency. A CloudWatch Events rule triggers your function every 5 minutes with a "warm-up" event. The function detects the warm-up and returns immediately. This keeps one instance warm for free.
export const handler = async (event: any) => {
// Warm-up detection
if (event.source === "warmup" || event.detail?.warmup) {
return { statusCode: 200, body: "warm" };
}
// Actual logic
// ...
};This only keeps one instance warm. If you get concurrent requests, new instances still cold-start. But for low-traffic endpoints, it eliminates 90% of user-visible cold starts.
API Gateway is the front door to most serverless APIs, and it comes in two flavors that confuse everyone.
REST API (v1): The original. Full-featured. Request/response transformations, API keys, usage plans, WAF integration, caching, request validation, resource policies. Priced at $3.50 per million requests plus data transfer.
HTTP API (v2): The lightweight version. About 70% cheaper ($1.00 per million), lower latency (typically 5-10ms less), but fewer features. No caching, limited request validation, no usage plans.
For most new projects, start with HTTP API. You can always upgrade to REST API if you need the advanced features.
Here's a pattern I use for all my API Gateway setups — a Lambda authorizer that validates JWTs and attaches user context:
// authorizer.ts
import { APIGatewayRequestAuthorizerEventV2 } from "aws-lambda";
import { jwtVerify } from "jose";
const JWKS_CACHE: Map<string, CryptoKey> = new Map();
export const handler = async (event: APIGatewayRequestAuthorizerEventV2) => {
const token = event.headers?.authorization?.replace("Bearer ", "");
if (!token) {
return { isAuthorized: false };
}
try {
const { payload } = await jwtVerify(token, getPublicKey, {
algorithms: ["RS256"],
issuer: "https://auth.example.com",
});
return {
isAuthorized: true,
context: {
userId: payload.sub,
email: payload.email,
role: payload.role,
// These are available in $context.authorizer in your Lambda
},
};
} catch {
return { isAuthorized: false };
}
};A critical performance pattern: enable authorizer caching. By default, the authorizer runs on every single request. With caching enabled (TTL up to 3600 seconds), identical tokens reuse the cached authorization result. This can reduce your authorizer invocations by 80-95%.
Another pattern worth knowing: direct integrations. API Gateway can talk to DynamoDB, SQS, Step Functions, and other AWS services directly, without a Lambda function in the middle. For simple CRUD operations, this eliminates a Lambda invocation entirely:
Without direct integration:
Client -> API Gateway -> Lambda -> DynamoDB
Cost: API GW + Lambda invocation + DynamoDB read
Latency: ~50-80ms
With direct integration:
Client -> API Gateway -> DynamoDB (direct)
Cost: API GW + DynamoDB read
Latency: ~15-25ms
The trade-off is that the VTL (Velocity Template Language) mapping templates are horrific to write and debug. For anything beyond simple get/put operations, the Lambda is worth the extra cost and latency just for the developer experience.
DynamoDB is the database of serverless. It scales automatically, has single-digit millisecond latency, and integrates natively with Lambda through DynamoDB Streams. It's also the most misunderstood database in the AWS ecosystem.
The "single-table design" pattern means putting all your entities in one table, using composite primary keys and GSIs to support different access patterns. This sounds insane if you come from a relational background. It is, kind of. But it's how you get the most out of DynamoDB.
Single-Table Design for an E-Commerce App:
PK SK Data
-------------------------------------------------------------
USER#u123 METADATA {name, email, ...}
USER#u123 ORDER#o456 {total, status, ...}
USER#u123 ORDER#o789 {total, status, ...}
ORDER#o456 METADATA {userId, total, ...}
ORDER#o456 ITEM#i001 {productId, qty, ...}
ORDER#o456 ITEM#i002 {productId, qty, ...}
PRODUCT#p100 METADATA {name, price, ...}
PRODUCT#p100 REVIEW#r001 {userId, rating, ...}
GSI1:
GSI1PK GSI1SK
-------------------------------------------
USER#u123 2025-12-01T10:30:00Z (orders by date)
PRODUCT#p100 5#r001 (reviews by rating)
This design lets you:
PK = USER#u123, SK = METADATAPK = USER#u123, SK begins_with ORDER#PK = ORDER#o456, SK begins_with ITEM#GSI1PK = USER#u123, sorted by date// Get user profile + recent orders in one query
const userProfile = await ddb.send(new QueryCommand({
TableName: "Main",
KeyConditionExpression: "PK = :pk",
ExpressionAttributeValues: { ":pk": `USER#${userId}` },
}));
// Single query returns both METADATA and ORDER# items
const metadata = userProfile.Items?.find(i => i.SK === "METADATA");
const orders = userProfile.Items?.filter(i => i.SK.startsWith("ORDER#"));The advantage: one round-trip to DynamoDB gets you everything. No joins, no multiple queries, no connection pooling nightmares. In a Lambda function with a 50ms DynamoDB query, this is critical — you don't have a long-lived connection pool to absorb multiple sequential queries efficiently.
The disadvantage: the schema is painful to evolve. Adding a new access pattern might require a new GSI or even a data migration. And if you get the key design wrong initially, you might need to rebuild the table. I've had to do this twice. Neither time was fun.
My advice: use single-table design for high-traffic, well-understood access patterns. For exploratory features where the queries are evolving, use a separate table (or honestly, just use PostgreSQL through RDS Proxy).
Queues and topics are the connective tissue of serverless architectures. The distinction matters:
SQS (Simple Queue Service): Point-to-point. One message, one consumer. Messages persist until processed. Great for work distribution, rate limiting, and buffering.
SNS (Simple Notification Service): Pub/sub. One message, many subscribers. No persistence — if a subscriber is down, the message is lost (unless it delivers to SQS). Great for event fan-out.
The power pattern is combining them — the "SNS to SQS fan-out":
SNS + SQS Fan-Out:
Producer -> SNS Topic "OrderCreated"
|
+-> SQS Queue: Payment Processing
| -> Lambda: process-payment
|
+-> SQS Queue: Inventory Service
| -> Lambda: update-inventory
|
+-> SQS Queue: Analytics
| -> Lambda: track-analytics
|
+-> SQS Queue: Email Notifications
-> Lambda: send-email
Each queue has its own:
- Retry policy (maxReceiveCount)
- Dead letter queue
- Concurrency limit
- Batch size
This gives you the fan-out of pub/sub with the durability and retry semantics of queues. If the email Lambda is broken and failing, messages stack up in its SQS queue. The payment and inventory services are completely unaffected. When you fix the email Lambda, it processes the backlog automatically.
The SQS-to-Lambda integration has important configuration options that people get wrong:
// CDK — SQS -> Lambda event source mapping
new lambda.EventSourceMapping(this, "OrderQueueMapping", {
target: processOrderFn,
eventSourceArn: orderQueue.queueArn,
batchSize: 10, // Process up to 10 messages per invocation
maxBatchingWindow: Duration.seconds(5), // Wait up to 5s to fill batch
maxConcurrency: 50, // Maximum concurrent Lambda invocations
reportBatchItemFailures: true, // Critical: report per-item failures
});That reportBatchItemFailures flag is essential. Without it, if one message in a batch of 10 fails, all 10 messages return to the queue and get reprocessed. With it, you report which specific messages failed, and only those go back to the queue:
// Handler with batch item failure reporting
import { SQSBatchResponse, SQSEvent } from "aws-lambda";
export const handler = async (event: SQSEvent): Promise<SQSBatchResponse> => {
const failures: { itemIdentifier: string }[] = [];
for (const record of event.Records) {
try {
await processMessage(JSON.parse(record.body));
} catch (error) {
console.error(`Failed to process ${record.messageId}`, error);
failures.push({ itemIdentifier: record.messageId });
}
}
return {
batchItemFailures: failures,
};
};One lesson learned the hard way: set maxConcurrency on your SQS-to-Lambda mappings. Without it, a sudden spike of messages can trigger hundreds of concurrent Lambda invocations, which can overwhelm your downstream services (especially databases). I once had a backlog of 50,000 messages drain simultaneously and take down a downstream API with 800 concurrent connections. The maxConcurrency setting would have prevented that entirely.
Here's the honest breakdown. I've run these numbers across four production workloads.
Workload 1: Low-traffic API (1,000 requests/day)
Serverless:
API Gateway HTTP API: 1K * 30 = 30K requests/month = $0.03
Lambda (256MB, 100ms avg): 30K invocations = $0.01
DynamoDB (on-demand): ~$0.50
CloudWatch Logs: ~$0.30
Total: ~$0.84/month
VPS (smallest Hetzner):
CX22 (2 vCPU, 4GB RAM): $4.49/month
Total: $4.49/month
Winner: Serverless (5x cheaper)
Workload 2: Medium-traffic API (100,000 requests/day)
Serverless:
API Gateway HTTP API: 3M requests/month = $3.00
Lambda (512MB, 200ms avg): 3M invocations = $1.88
Lambda compute: 3M * 0.2s * 0.5GB = 300K GB-s = $5.00
DynamoDB (on-demand): ~$15.00
CloudWatch Logs: ~$8.00
NAT Gateway (if in VPC): ~$32.00 + data processing
Total: ~$65/month (without NAT) or ~$97/month (with NAT)
VPS:
Hetzner CX42 (4 vCPU, 16GB): $17.49/month
Total: $17.49/month
Winner: VPS (4-6x cheaper)
Workload 3: High-traffic API (2M requests/day)
Serverless:
API Gateway: 60M requests/month = $60.00
Lambda (1024MB, 150ms avg): 60M invocations = $12.00
Lambda compute: 60M * 0.15s * 1GB = 9M GB-s = $150.03
DynamoDB (provisioned): ~$120.00
CloudWatch Logs: ~$45.00
NAT Gateway: ~$95.00
ElastiCache (for caching): ~$50.00
Total: ~$532/month
VPS (3x Hetzner dedicated):
3x AX52 (8 core, 64GB): 3 * $63 = $189/month
Load balancer: $5.49
Total: ~$195/month
Winner: VPS (2.7x cheaper)
Workload 4: Bursty processing (idle 23 hours, massive spike for 1 hour)
Serverless:
Lambda during spike: 500 concurrent, 1 hour = ~$45
Lambda during idle: $0
SQS/EventBridge: ~$5
Total: ~$50/month
VPS (sized for peak):
Need to handle peak concurrency = expensive
Hetzner AX102 (16 core, 128GB): $128/month
Running 24/7 even when idle
Total: $128/month
Winner: Serverless (2.5x cheaper)
The pattern is clear: serverless wins for very low traffic and very bursty workloads. For anything with consistent, moderate-to-high traffic, traditional servers are significantly cheaper. The break-even point is typically around 50,000-100,000 requests per day, depending on your Lambda memory and execution duration.
The hidden cost that kills budgets: NAT Gateway. If your Lambda functions need to access resources in a VPC (which they do if you're using RDS, ElastiCache, or any private resource), you need a NAT Gateway for outbound internet access. That's $0.045 per hour ($32.40/month minimum) plus $0.045 per GB of data processed. I've seen NAT Gateway costs exceed the Lambda costs themselves.
Let me be direct about this: if you use DynamoDB, Step Functions, EventBridge, SQS, SNS, and API Gateway, you are locked into AWS. Deeply. That's not necessarily wrong, but you should make that decision consciously, not accidentally.
Here's a realistic portability assessment:
Portability Spectrum:
Easy to port:
- Lambda function code (just Node.js/Python/Go code)
- S3 object storage (standard API, many alternatives)
- SQS messaging (replace with RabbitMQ, Redis queues)
Moderate effort:
- API Gateway (replace with Express/Fastify + reverse proxy)
- SNS (replace with Redis pub/sub, NATS)
- CloudWatch Logs (replace with any logging service)
Significant rewrite:
- DynamoDB (unique data model, no direct equivalent)
- Step Functions (rewrite as code-based orchestration)
- EventBridge (custom event bus implementation)
- Cognito (entire auth system replacement)
Practically impossible without redesign:
- DynamoDB Streams -> Lambda triggers
- AppSync + DynamoDB resolvers
- Multi-service IAM-based auth
The mitigation strategy I recommend: use a hexagonal architecture (ports and adapters) for your Lambda functions. Keep your business logic completely free of AWS SDK calls. The Lambda handler is a thin adapter that translates between AWS events and your domain:
// Adapter layer (AWS-specific)
import { APIGatewayProxyEventV2 } from "aws-lambda";
import { createOrder } from "./domain/orders";
import { DynamoDBOrderRepository } from "./infra/dynamodb-order-repo";
import { SQSEventPublisher } from "./infra/sqs-event-publisher";
export const handler = async (event: APIGatewayProxyEventV2) => {
const input = JSON.parse(event.body || "{}");
// Domain logic knows nothing about AWS
const result = await createOrder(input, {
orderRepo: new DynamoDBOrderRepository(),
eventPublisher: new SQSEventPublisher(),
});
return {
statusCode: result.success ? 201 : 400,
body: JSON.stringify(result),
};
};
// Domain layer (portable)
interface OrderRepository {
save(order: Order): Promise<void>;
}
interface EventPublisher {
publish(event: DomainEvent): Promise<void>;
}
export async function createOrder(
input: CreateOrderInput,
deps: { orderRepo: OrderRepository; eventPublisher: EventPublisher }
) {
const order = Order.create(input);
await deps.orderRepo.save(order);
await deps.eventPublisher.publish(new OrderCreatedEvent(order));
return { success: true, orderId: order.id };
}With this structure, swapping DynamoDB for PostgreSQL means implementing a new PostgresOrderRepository. Swapping SQS for RabbitMQ means implementing a new RabbitMQEventPublisher. Your domain logic — the valuable part — doesn't change.
Do I actually do this for every project? No. For small, clearly temporary services, I write AWS-coupled code because the abstraction isn't worth the effort. But for any service that might live more than a year or process critical business logic, the adapter layer pays for itself.
Debugging a serverless application in production is fundamentally harder than debugging a traditional application. There's no server to SSH into. There's no long-running process to attach a debugger to. Your function ran for 200ms, processed a message, and the execution environment might already be gone.
Here's what you need, at minimum:
Structured logging with correlation IDs. Every log line needs to be JSON with a correlation ID that ties together all the functions involved in processing a single request:
import { Logger } from "@aws-lambda-powertools/logger";
import { Tracer } from "@aws-lambda-powertools/tracer";
const logger = new Logger({ serviceName: "order-service" });
const tracer = new Tracer({ serviceName: "order-service" });
export const handler = async (event: any) => {
// Extract or generate correlation ID
const correlationId =
event.headers?.["x-correlation-id"] ||
event.detail?.correlationId ||
crypto.randomUUID();
logger.appendKeys({ correlationId });
logger.info("Processing order", {
orderId: event.detail?.orderId,
source: event.source,
});
try {
const result = await processOrder(event);
logger.info("Order processed successfully", { result });
return result;
} catch (error) {
logger.error("Order processing failed", { error });
throw error;
}
};X-Ray tracing. AWS X-Ray traces requests across services. It's not perfect — the console is clunky and it misses some async patterns — but it's the only way to get a visual trace of a request flowing through Lambda -> SQS -> Lambda -> DynamoDB. Enable it:
// In your CDK stack
const fn = new lambda.Function(this, "OrderHandler", {
// ...
tracing: lambda.Tracing.ACTIVE, // Enable X-Ray
environment: {
POWERTOOLS_SERVICE_NAME: "order-service",
LOG_LEVEL: "INFO",
},
});CloudWatch Insights queries. Learn these. When something breaks at 3 AM, you'll be querying CloudWatch Logs Insights to find what happened:
# Find all errors for a specific correlation ID
fields @timestamp, @message
| filter correlationId = "abc-123-def"
| sort @timestamp asc
# Find all cold starts in the last hour
fields @timestamp, @initDuration, @memorySize
| filter @type = "REPORT" and @initDuration > 0
| sort @initDuration desc
| limit 50
# Find functions with highest error rate
filter @type = "REPORT"
| stats count() as invocations,
sum(strcontains(@message, "ERROR")) as errors,
(sum(strcontains(@message, "ERROR")) / count()) * 100 as errorRate
by @logGroup
| sort errorRate desc
The reality: even with all of this, debugging a 10-service event-driven architecture takes 3-5x longer than debugging a monolith. That's a genuine, permanent trade-off. You can improve it with better tooling (Lumigo, Epsagon, Thundra, and others offer serverless-specific observability), but the fundamental complexity of distributed tracing across ephemeral functions is irreducible.
This is the area where serverless developer experience still lags the most. Your code runs on AWS. Your local machine is not AWS. Bridging that gap is painful.
The options, ranked by my preference:
1. SST (Serverless Stack) with Live Lambda Dev. This is the best DX I've found. It deploys your infrastructure to AWS but proxies Lambda invocations back to your local machine. You write code locally, save the file, and the next invocation runs your updated code — no deployment needed. It's close to the hot-reload experience of local development.
2. AWS SAM Local. Runs Lambda functions locally in Docker containers. It works for simple cases but falls apart with complex triggers (EventBridge, DynamoDB Streams, Step Functions). You end up mocking half of AWS.
3. LocalStack. Emulates AWS services locally. The free tier covers S3, SQS, SNS, DynamoDB, and Lambda. The pro tier adds Step Functions, EventBridge, and others. It's good but not identical to real AWS — I've had tests pass on LocalStack and fail on real AWS due to subtle behavioral differences.
4. Deploy and test in a dev account. The "just deploy it" approach. It's the most accurate because you're testing on real AWS. It's also the slowest — even with SAM Accelerate or CDK Hotswap, a deploy cycle is 10-30 seconds. That loop of code-deploy-test-debug is brutal when you're iterating.
My actual workflow: SST for feature development, LocalStack for integration tests in CI, and a shared dev account for pre-production validation. It's three environments to maintain, and none of them perfectly replicate production.
// sst.config.ts — SST v3 configuration
export default $config({
app(input) {
return {
name: "my-service",
removal: input?.stage === "production" ? "retain" : "remove",
};
},
async run() {
const table = new sst.aws.Dynamo("Orders", {
fields: { PK: "string", SK: "string" },
primaryIndex: { hashKey: "PK", rangeKey: "SK" },
});
const api = new sst.aws.ApiGatewayV2("Api");
api.route("POST /orders", {
handler: "src/handlers/order.handler",
link: [table],
});
// During `sst dev`, Lambda invocations are proxied to your machine
// During `sst deploy`, everything runs on AWS normally
},
});Testing serverless code requires a different strategy than testing traditional applications. You have three layers to test, and each requires a different approach.
Unit tests: Test your business logic in isolation. If you followed the hexagonal architecture pattern from the vendor lock-in section, this is straightforward. Your domain logic has no AWS dependencies and can be tested with standard test frameworks:
// domain/orders.test.ts
import { createOrder } from "./orders";
const mockRepo = {
save: vi.fn().mockResolvedValue(undefined),
};
const mockPublisher = {
publish: vi.fn().mockResolvedValue(undefined),
};
describe("createOrder", () => {
it("creates an order and publishes an event", async () => {
const result = await createOrder(
{ userId: "u123", items: [{ productId: "p1", quantity: 2 }] },
{ orderRepo: mockRepo, eventPublisher: mockPublisher }
);
expect(result.success).toBe(true);
expect(mockRepo.save).toHaveBeenCalledOnce();
expect(mockPublisher.publish).toHaveBeenCalledWith(
expect.objectContaining({ type: "OrderCreated" })
);
});
it("rejects empty orders", async () => {
const result = await createOrder(
{ userId: "u123", items: [] },
{ orderRepo: mockRepo, eventPublisher: mockPublisher }
);
expect(result.success).toBe(false);
expect(mockRepo.save).not.toHaveBeenCalled();
});
});Integration tests: Test AWS interactions with LocalStack. Use Testcontainers to spin up LocalStack in Docker and run your actual DynamoDB queries, SQS operations, and S3 uploads against it:
// infra/dynamodb-order-repo.integration.test.ts
import { GenericContainer } from "testcontainers";
import { DynamoDBClient, CreateTableCommand } from "@aws-sdk/client-dynamodb";
let localstackContainer: any;
let ddbClient: DynamoDBClient;
beforeAll(async () => {
localstackContainer = await new GenericContainer("localstack/localstack")
.withExposedPorts(4566)
.start();
const endpoint = `http://localhost:${localstackContainer.getMappedPort(4566)}`;
ddbClient = new DynamoDBClient({
endpoint,
region: "us-east-1",
credentials: { accessKeyId: "test", secretAccessKey: "test" },
});
// Create table
await ddbClient.send(new CreateTableCommand({
TableName: "Orders",
KeySchema: [
{ AttributeName: "PK", KeyType: "HASH" },
{ AttributeName: "SK", KeyType: "RANGE" },
],
AttributeDefinitions: [
{ AttributeName: "PK", AttributeType: "S" },
{ AttributeName: "SK", AttributeType: "S" },
],
BillingMode: "PAY_PER_REQUEST",
}));
}, 60_000);
afterAll(async () => {
await localstackContainer?.stop();
});
it("saves and retrieves an order", async () => {
const repo = new DynamoDBOrderRepository(ddbClient, "Orders");
const order = Order.create({
userId: "u123",
items: [{ productId: "p1", quantity: 2 }],
});
await repo.save(order);
const retrieved = await repo.findById(order.id);
expect(retrieved).toEqual(order);
});End-to-end tests: Test the full flow on a real AWS account. Deploy to a test stage and run tests that hit the actual API Gateway endpoint, verify DynamoDB writes, check SQS messages, and confirm the full event chain works:
// e2e/order-flow.e2e.test.ts
const API_URL = process.env.API_URL; // Set by CI/CD after deployment
it("creates an order and processes it end-to-end", async () => {
// Create order via API
const response = await fetch(`${API_URL}/orders`, {
method: "POST",
headers: { "Content-Type": "application/json", Authorization: `Bearer ${token}` },
body: JSON.stringify({
items: [{ productId: "p1", quantity: 2 }],
}),
});
expect(response.status).toBe(202);
const { orderId } = await response.json();
// Poll for completion (event-driven processing takes time)
const order = await pollUntil(
() => fetch(`${API_URL}/orders/${orderId}`).then(r => r.json()),
(result) => result.status === "COMPLETED",
{ timeout: 30_000, interval: 1_000 }
);
expect(order.status).toBe("COMPLETED");
expect(order.paymentStatus).toBe("CHARGED");
expect(order.inventoryStatus).toBe("RESERVED");
});The test pyramid for serverless is inverted compared to traditional apps. You need more integration and E2E tests because the glue between services (IAM permissions, event formats, queue configurations) is where most bugs hide. A unit test won't catch that your Lambda doesn't have permission to write to the SQS queue, or that the EventBridge rule pattern doesn't match your event format.
I have made every one of these mistakes. Learn from my pain.
1. Lambda-lith. Putting your entire Express app inside a single Lambda function. The cold start is massive (2-5 seconds with all dependencies), you can't scale individual routes independently, and you're paying for 1GB of memory to serve a health check endpoint.
Bad: Lambda-lith
API Gateway /* -> Single Lambda (Express app)
- 50MB bundle
- 2 second cold start
- All routes, all middleware, all the time
Good: Function-per-route (or function-per-domain)
API Gateway /orders/* -> Order Lambda (3MB, 120ms cold start)
API Gateway /users/* -> User Lambda (2MB, 100ms cold start)
API Gateway /search/* -> Search Lambda (4MB, 150ms cold start)
However — and this is a nuanced point — a Lambda-lith for a low-traffic internal tool is actually fine. If you have 100 requests/day and don't care about cold starts, the simplicity of a single function outweighs the architectural purity of function-per-route. Context matters.
2. Synchronous chains. Lambda A calls Lambda B, which calls Lambda C, which calls Lambda D. Each one waits for the next. You're paying for all four simultaneously, the total latency is the sum of all four, and if any one fails, you need retry logic at every level.
Bad: Synchronous chain
Lambda A (waiting...) -> Lambda B (waiting...) -> Lambda C -> Lambda D
Total cost: A duration + B duration + C duration + D duration
Latency: A + B + C + D
If D fails: everything fails
Good: Async with queues
Lambda A -> SQS -> Lambda B -> SQS -> Lambda C -> SQS -> Lambda D
Total cost: A + B + C + D (but A finishes immediately)
Latency (user-facing): just A
If D fails: D's message goes to DLQ, everything else succeeded
3. Using Lambda for long-running tasks. Lambda has a 15-minute timeout. I've seen people chain Lambda invocations to process hour-long jobs, each function invoking the next at the 14-minute mark. This is fragile, expensive, and impossible to debug. Use ECS Fargate tasks for long-running work. You can still trigger them from Lambda.
4. Ignoring DynamoDB throughput modes. On-demand pricing is convenient but expensive at scale. A table doing 10,000 writes per second on on-demand costs roughly $6.50/hour. Provisioned capacity for the same throughput (with auto-scaling) costs about $1.80/hour. For predictable workloads, provisioned saves 60-70%.
5. Not setting concurrency limits. By default, Lambda functions share a regional concurrency pool of 1,000. One runaway function can consume the entire pool and cause every other function in your account to throttle. Set reserved concurrency on critical functions and unreserved concurrency limits on non-critical ones.
// CDK — Set reserved concurrency
const criticalFn = new lambda.Function(this, "PaymentProcessor", {
// ...
reservedConcurrentExecutions: 100, // Guaranteed 100, max 100
});
const nonCriticalFn = new lambda.Function(this, "AnalyticsProcessor", {
// ...
reservedConcurrentExecutions: 50, // Won't starve critical functions
});After four production serverless projects and twice as many VPS deployments, here's my honest assessment of when to use what.
Serverless is genuinely better for:
A VPS / container is genuinely better for:
The hybrid approach that actually works:
Most production systems I build now are hybrid. The VPS handles the steady-state traffic — the API, the WebSocket connections, the background workers. Serverless handles the spiky, event-driven, parallelizable parts — file processing, notification fan-out, scheduled jobs, webhook ingestion.
Hybrid Architecture:
[VPS Cluster] [Serverless]
- Main API (Express/Fastify) - S3 upload processing (Lambda)
- WebSocket server - Email/SMS notifications (Lambda + SES)
- Background workers - Daily report generation (Lambda + Step Functions)
- Redis cache - Webhook ingestion (API Gateway + Lambda + SQS)
- PostgreSQL - Image/video thumbnailing (Lambda)
- Scheduled cleanup tasks (EventBridge + Lambda)
Communication: SQS queues between VPS and Lambda functions
This gives you the cost efficiency of servers for predictable workloads, the elastic scaling of Lambda for bursty workloads, and the operational simplicity of managed services for glue tasks.
Before you start your next project, run through this checklist:
TRAFFIC PATTERN
[ ] Consistent, predictable -> VPS / Containers
[ ] Highly variable, bursty -> Serverless
[ ] Idle most of the time -> Serverless
[ ] High sustained throughput -> VPS / Containers
LATENCY REQUIREMENTS
[ ] p99 < 50ms required -> VPS (no cold starts)
[ ] p99 < 200ms acceptable -> Serverless (with provisioned concurrency)
[ ] p99 < 500ms acceptable -> Serverless (standard)
[ ] Latency doesn't matter -> Serverless
EXECUTION DURATION
[ ] Sub-second responses -> Either
[ ] 1-15 minutes -> Lambda (with timeout awareness)
[ ] 15 minutes to hours -> ECS Fargate / VPS
[ ] Hours to days -> VPS / dedicated server
STATE REQUIREMENTS
[ ] Stateless request/response -> Serverless
[ ] In-memory caching needed -> VPS
[ ] WebSocket connections -> VPS (or API Gateway WebSocket)
[ ] Local filesystem needed -> VPS (or Lambda + EFS, with caveats)
TEAM AND OPERATIONAL FACTORS
[ ] Small team, no DevOps -> Serverless (less infra to manage)
[ ] Large team, dedicated SRE -> Either (team can manage complexity)
[ ] AWS expertise on team -> Serverless
[ ] Need to run on multiple clouds -> VPS / Containers (portability)
[ ] Tight budget -> VPS (for moderate+ traffic)
COST ESTIMATION
[ ] < 10K requests/day -> Serverless (nearly free)
[ ] 10K-100K requests/day -> Calculate both; often VPS wins
[ ] 100K-1M requests/day -> VPS almost certainly cheaper
[ ] > 1M requests/day -> VPS, unless traffic is extremely bursty
If you checked mostly items in the left column, serverless is probably your best bet. If you checked mostly right-column items, go with traditional servers. If it's mixed — and it usually is — the hybrid approach described above is likely your answer.
Serverless is a tool, not a religion. The community around it sometimes forgets that. There are AWS Heroes who will tell you with a straight face that every application should be serverless. There are old-school sysadmins who will tell you it's a fad. Both are wrong.
The $20K monthly bill I mentioned at the start? We migrated the hot-path API to three Hetzner dedicated servers behind a load balancer. That part of the bill dropped to $200/month. We kept the event processing pipeline, the scheduled jobs, and the file processing on Lambda. That part costs about $85/month and would be a pain to manage on servers.
The total bill went from $20,400 to $285. The architecture is cleaner. The team is happier. The code is easier to debug.
Serverless is brilliant when it fits. It's expensive when it doesn't. The skill is knowing the difference before the bill arrives.