The thinking frameworks, back-of-the-envelope math, and architecture patterns that actually matter in system design interviews — from someone who has been on both sides of the table.
I have sat through dozens of system design interviews, both as a candidate and as an interviewer. The pattern I see over and over is candidates who have memorized the "correct" architecture for a URL shortener or a chat application, but crumble the moment you change a single constraint. They have learned answers. They have not learned how to think.
Most system design prep material follows the same formula: here is the problem, here is the solution diagram, memorize it. This approach fails spectacularly in real interviews because interviewers are not testing whether you can regurgitate a reference architecture. They are testing whether you can reason about trade-offs under ambiguity, communicate clearly while exploring a design space, and make defensible decisions with incomplete information.
What follows is everything I wish someone had told me before my first system design interview — the mental models, the math, the patterns, and the mistakes that separate candidates who get offers from candidates who get polite rejections.
The typical study path looks like this: watch a YouTube video about designing Twitter, read a blog post about designing Uber, memorize the components, and hope the interview asks about something similar. This is like studying for a math exam by memorizing the answers to specific problems instead of understanding the underlying concepts.
System design interviews are open-ended conversations. The interviewer gives you a deliberately vague prompt — "design a notification system" — and watches how you navigate from vagueness to clarity. The skills being evaluated are:
Scoping and clarification. Do you ask what kind of notifications? Push, email, SMS, in-app? What scale? What latency requirements? Or do you just start drawing boxes?
Trade-off analysis. When you choose a database, can you articulate why you chose it over alternatives? When you add a cache, do you acknowledge the consistency implications?
Iterative refinement. Do you start simple and add complexity as needed, or do you immediately jump to a microservices architecture with Kafka, Redis, Elasticsearch, and a service mesh for a system that handles 100 requests per minute?
Operational awareness. How will you deploy this? How will you monitor it? What happens when component X fails? These questions reveal whether you have actually built systems or only read about them.
The prep that actually works is building a toolkit of reusable patterns and understanding when each pattern applies. You do not need to memorize 50 system designs. You need to deeply understand maybe 10 fundamental patterns and be able to compose them on the fly.
Every distributed system is composed of the same handful of building blocks. Understanding these deeply — not just what they are, but when they become the bottleneck — is the foundation of system design thinking.
A load balancer distributes incoming traffic across multiple servers. Simple concept, but the devil is in the details.
+-----------+
Clients -------->| Load |-------> Server A
| Balancer |-------> Server B
+-----------+-------> Server C
Algorithms that matter:
When the load balancer becomes the bottleneck: At extremely high throughput (millions of requests per second), Layer 7 (HTTP) load balancers become CPU-bound because they need to parse HTTP headers. The solution is either Layer 4 (TCP) load balancing, which is much cheaper, or DNS-based load balancing. In practice, most systems use a two-tier approach: DNS/L4 at the edge, L7 internally.
Memory is roughly 100x faster than SSD and 10,000x faster than network round-trips to a database. Caching exploits this hierarchy.
Request --> Cache HIT? --> Return cached response
|
MISS
|
v
Database --> Store in cache --> Return response
When the cache becomes the bottleneck: Cache stampede — when a popular cache key expires and hundreds of concurrent requests all miss the cache simultaneously, hammering the database. Also, memory limits: when your working set exceeds available memory, eviction starts and hit rates drop. I will cover caching strategies in depth later.
Queues decouple producers from consumers, absorb traffic spikes, and enable asynchronous processing.
Producer --> [ Queue ] --> Consumer A
--> Consumer B
--> Consumer C
When the queue becomes the bottleneck: When consumers cannot keep up with producers for an extended period, the queue grows unboundedly. This manifests as increasing end-to-end latency and, eventually, disk pressure on the broker. The fix is either adding consumers, implementing backpressure, or dropping/dead-lettering messages.
The persistent source of truth. The single most important decision in many system designs.
When the database becomes the bottleneck: Almost always reads. Read-heavy workloads can be addressed with read replicas, caching, or denormalization. Write-heavy workloads require sharding, write-behind patterns, or choosing a database designed for write throughput.
A CDN is a geographically distributed cache for static and semi-static content.
User (Tokyo) --> CDN Edge (Tokyo) --> HIT --> Response (5ms)
|
MISS
|
v
Origin (US-East) --> Response (200ms) + cache at edge
When the CDN becomes the bottleneck: When your content is too dynamic to cache, when cache invalidation latency is unacceptable, or when you are serving personalized content to every user. At that point, you are paying CDN costs without getting CDN benefits.
Interviewers do not expect you to compute exact numbers. They expect you to demonstrate that you can estimate whether a design is feasible. Here are the calculations that come up repeatedly.
Start with daily active users (DAU), estimate actions per user per day, and convert to requests per second.
QPS = (DAU * actions_per_user_per_day) / 86400
Example: Twitter-like service
- 200M DAU
- Average user views 50 tweets/day, posts 2 tweets/day
- Read QPS: (200M * 50) / 86400 ≈ 115,000 read QPS
- Write QPS: (200M * 2) / 86400 ≈ 4,600 write QPS
- Peak QPS: ~2-3x average ≈ 230,000-345,000 read QPS
These numbers immediately tell you: this is a read-heavy system (25:1 ratio), you need serious caching and read replicas, and a single database will not handle this.
Estimate the size of a single record, multiply by the number of records over your time horizon (usually 5 years).
Example: URL shortener
- 100M new URLs per month
- Each record: short_url (7 bytes) + long_url (avg 200 bytes) + metadata (50 bytes) ≈ 257 bytes
- 5 years: 100M * 12 * 5 * 257 bytes ≈ 1.54 TB
That fits comfortably on a single machine. No need for sharding yet.
Bandwidth = QPS * average_response_size
Example: Image serving
- 50,000 QPS for images
- Average image size: 500 KB
- Bandwidth: 50,000 * 500 KB = 25 GB/s
That is a LOT of bandwidth. You absolutely need a CDN.
These are approximate and vary by hardware, but they give you the right order of magnitude:
L1 cache reference: 1 ns
L2 cache reference: 4 ns
Main memory reference: 100 ns
SSD random read: 16,000 ns (16 us)
HDD random read: 2,000,000 ns (2 ms)
Round trip within same datacenter: 500,000 ns (0.5 ms)
Round trip US coast to coast: 40,000,000 ns (40 ms)
Single machine throughput:
- Redis: ~100,000 ops/sec
- PostgreSQL: ~10,000 queries/sec (simple queries)
- Kafka: ~1,000,000 messages/sec (per broker, small messages)
Most candidates describe consistency as a binary: either you have it or you do not. Reality is a spectrum, and understanding where your system falls on that spectrum is critical.
After a write completes, every subsequent read (from any node) returns that write. This is what you get with a single database, and what you give up the moment you replicate.
Use when: Correctness matters more than availability. Banking transactions, inventory counts, anything involving money or legal compliance.
Client A: WRITE balance = $500 [PRIMARY]
Client B: READ balance -----------> [REPLICA]
Must return $500
With strong consistency, the system blocks Client B's read
until the replica has received and applied the write.
This adds latency but guarantees correctness.
The cost of strong consistency is latency and availability. If the replica is unreachable, the system must either block (reducing availability) or return an error.
After a write, replicas will eventually converge to the same value, but reads in the interim may return stale data.
Use when: Availability and latency matter more than immediate consistency. Social media feeds, like counts, view counters, recommendation systems.
Client A: WRITE new_post [PRIMARY]
Client B: READ feed -----------> [REPLICA]
Might not include new_post yet
Will include it eventually (usually ms to seconds)
"Eventually" is doing a lot of work in that sentence. In well-configured systems, "eventually" means milliseconds. In poorly configured systems, it can mean minutes or even hours during network partitions.
Preserves the ordering of causally related operations while allowing concurrent operations to be seen in different orders.
Use when: You need something stronger than eventual consistency but cannot afford strong consistency. Collaborative editing, comment threads, messaging systems.
Alice: POST "Anyone want lunch?" (timestamp T1)
Bob: POST "Sure, where?" (timestamp T2, caused by T1)
Carol: READ feed
Must see Alice's post before Bob's (causal ordering)
May not see either yet (eventual delivery)
This is the sweet spot for many real-world systems. Google Docs uses a variant of causal consistency (via operational transforms / CRDTs) to allow concurrent editing without locks while preserving the "intent" of edits.
"Which database should I use?" is the single most common question in system design interviews. Here is a framework for reasoning about it instead of memorizing a chart.
The right database depends on how you will read and write data, not on what kind of data you have. Ask these questions:
Choose when: You have structured data with relationships, need ACID transactions, your access patterns involve joins and complex queries, and your dataset fits on a single machine or a small number of replicas.
Avoid when: Your data has no fixed schema, you need horizontal write scaling beyond what a single master can handle, or your access patterns are simple key-value lookups (you are paying for SQL parsing overhead you do not need).
-- Relational excels at this: complex queries with joins
SELECT u.name, COUNT(o.id) as order_count, SUM(o.total) as lifetime_value
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.created_at > NOW() - INTERVAL '30 days'
GROUP BY u.id
HAVING COUNT(o.id) > 5
ORDER BY lifetime_value DESC
LIMIT 100;Choose when: Your data is naturally hierarchical (user profiles with nested preferences, product catalogs with variable attributes), you want schema flexibility, or your access pattern is "fetch entire document by key."
Avoid when: You need complex joins, multi-document transactions are central to your business logic, or you have highly relational data.
// Document stores excel at this: nested, variable-schema data
{
"_id": "product_12345",
"name": "Wireless Headphones",
"category": "electronics",
"attributes": {
"battery_life": "40 hours",
"noise_cancellation": true,
"driver_size": "40mm",
"bluetooth_version": "5.3"
},
"reviews": [
{ "user": "alice", "rating": 5, "text": "Great sound" },
{ "user": "bob", "rating": 4, "text": "Comfortable fit" }
],
"pricing": {
"base": 199.99,
"sale": 159.99,
"currency": "USD"
}
}Choose when: You need massive write throughput, horizontal scaling across data centers, and your queries are known in advance (you model your tables around your queries). Time-series data, event logs, IoT sensor data.
Avoid when: You need ad-hoc queries, your access patterns are unpredictable, or you need strong consistency across partitions.
Choose when: Relationships ARE the data. Social networks, recommendation engines, fraud detection, knowledge graphs. Any problem where you need to traverse relationships of variable depth.
Avoid when: Your relationships are simple (one or two levels of joins in SQL). Graph databases add operational complexity; do not use them just because your data "has relationships."
Choose when: Your data is append-only with a timestamp, queries are time-range based, and you need efficient aggregation over time windows. Monitoring, IoT, financial market data.
Avoid when: Your data is not fundamentally time-ordered, or you need complex relational queries alongside time-series data (consider TimescaleDB, which is PostgreSQL-based).
Need ACID transactions + complex queries?
YES --> PostgreSQL / MySQL
NO --> continue
Data is hierarchical with variable schema?
YES --> MongoDB / DynamoDB
NO --> continue
Need massive write throughput + horizontal scaling?
YES --> Cassandra / ScyllaDB
NO --> continue
Traversing deep relationships is the core query?
YES --> Neo4j / Neptune
NO --> continue
Append-only, time-stamped, aggregation queries?
YES --> InfluxDB / TimescaleDB / ClickHouse
NO --> continue
Simple key-value lookups at extreme scale?
YES --> Redis / DynamoDB / Memcached
NO --> Just use PostgreSQL. Seriously.
That last line is not a joke. PostgreSQL handles an enormous range of workloads well. When in doubt, start with PostgreSQL and specialize later when you hit a concrete bottleneck.
Caching is conceptually simple and operationally nightmarish. Here are the four core patterns and when each makes sense.
The application manages the cache explicitly. On read: check cache, if miss then query database and populate cache. On write: update database, then invalidate (or update) cache.
async function getUser(userId: string): Promise<User> {
// 1. Check cache
const cached = await redis.get(`user:${userId}`);
if (cached) {
return JSON.parse(cached);
}
// 2. Cache miss - query database
const user = await db.query("SELECT * FROM users WHERE id = $1", [userId]);
// 3. Populate cache with TTL
await redis.setex(`user:${userId}`, 3600, JSON.stringify(user));
return user;
}
async function updateUser(userId: string, data: Partial<User>): Promise<void> {
// 1. Update database (source of truth)
await db.query("UPDATE users SET ... WHERE id = $1", [userId, ...]);
// 2. Invalidate cache (NOT update - avoids race conditions)
await redis.del(`user:${userId}`);
}Pros: Only caches data that is actually requested. Cache failure does not block reads (just slower). Simple to implement.
Cons: Cache miss penalty (cold start). Possible stale data between database write and cache invalidation. N+1 problem if you are not careful.
Every write goes to both the cache and the database synchronously. Reads always hit the cache.
async function updateUser(userId: string, data: Partial<User>): Promise<void> {
// Write to database AND cache atomically (or as close as possible)
const updatedUser = await db.query(
"UPDATE users SET ... RETURNING *",
[userId, ...]
);
await redis.setex(`user:${userId}`, 3600, JSON.stringify(updatedUser));
}Pros: Cache is always up to date. Reads are always fast.
Cons: Write latency increases (two writes per operation). Caches data that may never be read. If the cache write fails after the database write succeeds, you have an inconsistency window.
Writes go to the cache immediately, and the cache asynchronously flushes to the database in batches.
Write Request --> Cache (immediate) --> Background flush --> Database
(batched, async)
Pros: Extremely fast writes. Can batch database writes for efficiency. Great for write-heavy workloads.
Cons: Risk of data loss if the cache crashes before flushing. Complexity of managing the write queue. Reads from the database will return stale data. This pattern terrifies me in most contexts, but it is the right choice for things like analytics counters or rate limit tracking.
The cache itself is responsible for loading data from the database on a miss. The application only talks to the cache, never directly to the database for reads.
Application --> Cache --> HIT? Return data
--> MISS? Cache loads from DB, caches, returns
Pros: Clean application code (single data access layer). Cache handles its own population.
Cons: Requires a cache that supports read-through natively (or a wrapper). First request for each key is slow. Less flexibility in how data is loaded.
Cache invalidation is where every caching strategy eventually breaks down. Here are the problems and partial solutions:
Problem 1: Race conditions during invalidation.
Thread A: UPDATE user SET name='Alice' (DB)
Thread B: UPDATE user SET name='Bob' (DB)
Thread A: SET cache user = 'Alice' (Cache)
Thread B: SET cache user = 'Bob' (Cache)
Cache has 'Bob', which is correct. But what if the cache writes are reordered?
Thread A: UPDATE user SET name='Alice' (DB)
Thread B: UPDATE user SET name='Bob' (DB)
Thread B: SET cache user = 'Bob' (Cache)
Thread A: SET cache user = 'Alice' (Cache) <-- STALE!
Cache now has 'Alice' but DB has 'Bob'.
Solution: Invalidate (delete) instead of updating the cache. Let the next read repopulate it. This eliminates the reordering problem at the cost of one extra cache miss.
Problem 2: Cache stampede.
A popular key expires. 1,000 concurrent requests all see a cache miss simultaneously. All 1,000 hit the database.
Solution: Lock-based repopulation (only one request queries the database, others wait for the cache to be populated), probabilistic early expiration (refresh the cache before it actually expires, with some randomness to avoid synchronized refreshes), or stale-while-revalidate (serve the expired value while refreshing in the background).
async function getWithStampedeProtection(key: string): Promise<string | null> {
const cached = await redis.get(key);
if (cached) return cached;
// Try to acquire a lock
const lockKey = `lock:${key}`;
const acquired = await redis.set(lockKey, "1", "EX", 5, "NX");
if (acquired) {
// We got the lock - fetch from DB and populate cache
const value = await fetchFromDatabase(key);
await redis.setex(key, 3600, value);
await redis.del(lockKey);
return value;
} else {
// Another request is populating the cache - wait and retry
await sleep(50);
return getWithStampedeProtection(key);
}
}Message queues are the duct tape of distributed systems. They solve problems of coupling, reliability, and scale — and introduce problems of ordering, idempotency, and operational complexity.
Ask yourself: does this work need to happen synchronously in the request path?
If a user signs up and you need to send a welcome email, update analytics, and provision resources — none of that needs to block the HTTP response. Put it on a queue.
HTTP Request --> API Server --> Response (200 OK, fast)
|
v
Message Queue
|
+---------+---------+
| | |
Send Email Analytics Provision
These are not interchangeable. They solve fundamentally different problems.
Apache Kafka is a distributed commit log. Messages are persisted, ordered within partitions, and can be replayed. Consumers track their own offset.
Use Kafka when:
- You need event replay (reprocessing historical events)
- You need strict ordering within a partition
- Multiple consumers need the same events (fan-out)
- You are building event sourcing or CQRS
- Throughput matters more than latency (batching is core to Kafka)
Do NOT use Kafka when:
- You have simple task queues (it is massive overkill)
- You need per-message routing/filtering (Kafka's model is partitions)
- You cannot operate a distributed system (Kafka clusters need care)
- Message volume is low (< 1000/sec, use something simpler)
RabbitMQ is a traditional message broker. It supports complex routing (topic exchanges, headers exchanges, fan-out), per-message acknowledgment, and priority queues.
Use RabbitMQ when:
- You need complex routing (message A goes to queue X, message B to queue Y)
- You need per-message acknowledgment and redelivery
- You need priority queues
- Your message patterns are request/reply or work distribution
- You want something battle-tested and relatively simple to operate
Do NOT use RabbitMQ when:
- You need message replay (consumed messages are gone)
- You need extreme throughput (Kafka wins here by an order of magnitude)
- You need multi-datacenter replication (possible but painful)
Amazon SQS is a managed queue service. No cluster to operate, no broker to monitor, auto-scales to any throughput.
Use SQS when:
- You want zero operational overhead
- You are already on AWS
- Your queuing needs are straightforward (FIFO or standard)
- You do not need complex routing or message replay
Do NOT use SQS when:
- You need strict ordering across all messages (FIFO queues have throughput limits)
- You need message replay
- You want to avoid AWS lock-in
- You need sub-second latency (SQS long-polling adds latency)
With any queue, messages can be delivered more than once. Network timeouts, consumer crashes, and broker retries all cause duplicate delivery. Your consumers MUST be idempotent.
async function processPayment(message: PaymentMessage): Promise<void> {
const idempotencyKey = message.paymentId;
// Check if we already processed this payment
const existing = await db.query(
"SELECT id FROM processed_payments WHERE payment_id = $1",
[idempotencyKey]
);
if (existing.rows.length > 0) {
// Already processed - acknowledge and skip
console.log(`Payment ${idempotencyKey} already processed, skipping`);
return;
}
// Process the payment
await db.query("BEGIN");
try {
await db.query("INSERT INTO payments ...", [/* payment data */]);
await db.query(
"INSERT INTO processed_payments (payment_id) VALUES ($1)",
[idempotencyKey]
);
await db.query("COMMIT");
} catch (error) {
await db.query("ROLLBACK");
throw error; // Message will be redelivered
}
}Every system that faces the internet needs rate limiting. Every system that depends on downstream services needs backpressure. They are related but different.
Rate limiting protects your system from being overwhelmed, whether by legitimate traffic spikes, misbehaving clients, or denial-of-service attacks.
Token bucket is the most common algorithm. Each user has a bucket of tokens that refills at a constant rate. Each request consumes a token. When the bucket is empty, requests are rejected.
async function checkRateLimit(
userId: string,
maxTokens: number,
refillRate: number // tokens per second
): Promise<boolean> {
const key = `ratelimit:${userId}`;
const now = Date.now();
// Lua script for atomic token bucket
const script = `
local key = KEYS[1]
local max_tokens = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or max_tokens
local last_refill = tonumber(bucket[2]) or now
-- Refill tokens based on elapsed time
local elapsed = (now - last_refill) / 1000
tokens = math.min(max_tokens, tokens + (elapsed * refill_rate))
if tokens >= 1 then
tokens = tokens - 1
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 3600)
return 1
else
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 3600)
return 0
end
`;
const result = await redis.eval(script, 1, key, maxTokens, refillRate, now);
return result === 1;
}Sliding window is better for APIs where you want to express limits as "100 requests per minute" and have them apply smoothly rather than resetting at fixed intervals.
Backpressure is the system telling upstream "slow down, I cannot keep up." Without backpressure, queues grow without bound, memory fills up, and eventually something crashes.
Without backpressure:
Producer (fast) --> Queue (growing) --> Consumer (slow)
^^^^^^^^^^^^^^^^
Eventually: OOM, disk full, crash
With backpressure:
Producer (fast) --> Queue (bounded) --> Consumer (slow)
|
Queue full? --> Reject / slow producer / drop oldest
Backpressure strategies include bounded queues that reject when full, reactive streams that signal demand upstream, circuit breakers that stop calling failing services, and load shedding that deliberately drops low-priority requests to protect high-priority ones.
The circuit breaker pattern deserves special attention:
class CircuitBreaker {
private failures: number = 0;
private lastFailure: number = 0;
private state: "closed" | "open" | "half-open" = "closed";
constructor(
private threshold: number = 5,
private resetTimeout: number = 30000
) {}
async execute<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === "open") {
if (Date.now() - this.lastFailure > this.resetTimeout) {
this.state = "half-open";
} else {
throw new Error("Circuit breaker is open");
}
}
try {
const result = await fn();
if (this.state === "half-open") {
this.state = "closed";
this.failures = 0;
}
return result;
} catch (error) {
this.failures++;
this.lastFailure = Date.now();
if (this.failures >= this.threshold) {
this.state = "open";
}
throw error;
}
}
}You have heard it a hundred times: in a distributed system, you can only have two of three properties: Consistency, Availability, and Partition tolerance. Since network partitions are inevitable, you must choose between CP (consistent but unavailable during partitions) and AP (available but inconsistent during partitions).
This is technically true but practically misleading for several reasons.
First, the "choice" is not system-wide. Different parts of your system can make different trade-offs. Your user authentication can be CP (you absolutely need consistency for password changes) while your news feed can be AP (showing a slightly stale feed is fine).
Second, CAP says nothing about what happens when there is no partition, which is the normal state of affairs. A system can be both consistent and available when the network is healthy — the question is what it sacrifices during the rare partition events.
PACELC extends CAP by asking: when there is no Partition, what do you trade off between Latency and Consistency?
IF Partition:
Choose between Availability and Consistency (same as CAP)
ELSE (normal operation):
Choose between Latency and Consistency
This is much more useful because you are making the latency/consistency trade-off constantly, not just during rare partition events.
System | During Partition | Normal Operation
---------------|--------------------|-----------------
DynamoDB | AP (available) | EL (low latency)
Cassandra | AP (available) | EL (low latency)
MongoDB | CP (consistent) | EC (consistent)
PostgreSQL | CP (consistent) | EC (consistent)
Cosmos DB | Configurable | Configurable
DynamoDB and Cassandra sacrifice consistency for both availability during partitions AND low latency during normal operations. PostgreSQL and MongoDB sacrifice availability during partitions AND accept higher latency during normal operations (synchronous replication). This maps much more closely to the actual decisions you make when designing systems.
Let me walk through designing a URL shortener to demonstrate how these patterns compose. But instead of the typical "draw these boxes" walkthrough, I will focus on the edge cases and decisions that matter.
Before drawing a single box, I would ask:
Let me assume: 100M new URLs per month, 10B redirects per month, URLs expire after 5 years, basic analytics (click count), no custom URLs for now.
Write QPS: 100M / (30 * 86400) ≈ 40 writes/sec (easy)
Read QPS: 10B / (30 * 86400) ≈ 3,800 reads/sec (moderate)
Peak read: ~10,000 reads/sec (manageable)
Storage:
- 100M * 12 months * 5 years = 6B records
- Each record: short_code (7B) + URL (200B) + metadata (50B) ≈ 257B
- Total: 6B * 257B ≈ 1.5 TB (fits on one machine, but sharding for availability)
Short URL length:
- 7 characters using [a-zA-Z0-9] = 62^7 ≈ 3.5 trillion combinations
- With 6B URLs, collision probability is negligible
This is where most candidates stumble. How do you generate unique, short, URL-safe strings?
Option A: Hash the URL, take first 7 characters.
Problem: collisions. MD5 or SHA256 have uniform distribution, but truncating to 7 chars increases collision probability significantly. You must handle collisions (check database, regenerate).
Option B: Pre-generate keys.
Generate all possible 7-character strings, store them in a database, and pop one off when needed. Eliminates collision handling but requires managing a key pool.
Option C: Counter-based with base62 encoding.
Use a distributed counter (or range-based allocation) and encode the counter value in base62.
function encodeBase62(num: number): string {
const chars =
"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
let result = "";
while (num > 0) {
result = chars[num % 62] + result;
num = Math.floor(num / 62);
}
return result.padStart(7, "0");
}
// Counter 1 -> "0000001"
// Counter 1000000 -> "0004c92"
// Counter 3500000000000 -> "zzzzzzy" (approaching limit)I would go with Option C using range-based allocation: each application server requests a range of IDs (say 10,000 at a time) from a coordination service (could be a simple database sequence). This avoids hot-spotting the counter and eliminates the need for collision checking.
+-------------+
Client --> CDN/LB -------->| API Servers |
| (stateless) |
+------+------+
|
+----------+----------+
| |
+----+----+ +-----+-----+
| Redis | | PostgreSQL |
| Cache | | (sharded) |
+---------+ +-----------+
Read path (redirect):
GET /abc1234abc1234 -> if HIT, 301 redirect (sub-millisecond)Write path (create short URL):
POST /shorten with the long URL301 vs 302 redirect. A 301 (permanent redirect) means browsers will cache the redirect and never hit your server again. Great for performance, terrible for analytics. A 302 (temporary redirect) forces every click through your server. Most URL shorteners use 302 or 307 despite the performance cost, because analytics data is the product.
Hot URLs. A single URL that goes viral can receive millions of hits per second. Your cache must handle this gracefully. Since the data is immutable (the mapping never changes), you can cache aggressively with long TTLs and even replicate hot keys across multiple cache shards.
Abuse prevention. URL shorteners are frequently used for phishing and malware distribution. You need URL scanning before accepting a new URL, rate limiting on URL creation, and a reporting/takedown mechanism.
Database sharding strategy. Shard by the short code (hash-based sharding). This distributes load evenly and ensures each redirect query hits exactly one shard. Do not shard by user ID unless analytics queries by user are the primary access pattern.
After interviewing many candidates, these are the patterns I see most often in failed interviews.
Candidate immediately introduces microservices, Kubernetes, Kafka, Redis, Elasticsearch, a service mesh, and a data lake for a system that handles 100 requests per minute. Start simple. A monolith with a PostgreSQL database handles an enormous range of workloads. Add complexity only when you can articulate why the simpler solution fails.
A good interviewer will ask "what happens when we need to scale 100x?" That is your opportunity to evolve the design. But starting with a distributed system architecture for a low-scale problem signals that you do not understand the cost of complexity.
The prompt "design a chat application" is deliberately vague. Is it 1:1 or group chat? How many users? Do messages need to be persistent? Are there media attachments? Is there end-to-end encryption? What about message delivery guarantees (read receipts)?
Jumping straight into the design without scoping is the single biggest red flag. It suggests you build solutions before understanding problems.
Your beautiful architecture is worthless if you cannot deploy, monitor, and debug it. Good candidates discuss:
Drawing a box labeled "Load Balancer" is not enough. What kind of load balancer? What algorithm? Is it layer 4 or layer 7? What are the failure modes? Similarly, "add a cache" is not a strategy — what caching pattern, what eviction policy, what consistency guarantees?
In the real world, every architectural decision has a dollar cost. Running a Kafka cluster with 3 brokers and 100 partitions costs real money. If your message volume is 50 messages per second, SQS at $0.40 per million messages would cost you about $1.50/month. That Kafka cluster costs hundreds of dollars per month in infrastructure. The interviewer wants to see that you can make pragmatic decisions, not gold-plated ones.
System design is not about memorizing solutions. It is about building a repertoire of patterns and knowing when to apply them. Every design interview is fundamentally the same conversation:
The candidates who stand out are not the ones who draw the most complex diagrams. They are the ones who make deliberate, well-reasoned decisions and can articulate the trade-offs. They say things like "I chose eventual consistency here because this data is read-heavy and staleness of a few seconds is acceptable — if we needed strong consistency, we would use synchronous replication at the cost of higher write latency."
That sentence demonstrates more system design skill than any memorized architecture diagram ever could.
The best system designers I know share one trait: they are deeply uncomfortable with complexity. They add components reluctantly, only when the simpler approach demonstrably fails. They know that every box on the diagram is a potential failure point, a monitoring burden, and a maintenance cost. Simplicity is not a sign of inexperience — it is a sign of someone who has been paged at 3 AM because of a cascading failure in an over-engineered system and has the scars to prove it.