Everything I learned building real-time features with WebSockets. Connection management, scaling beyond one server, heartbeats, reconnection strategies, and why most tutorials skip the hard parts.
Every WebSocket tutorial starts the same way. Open a connection, send a message, receive a message, celebrate. Thirty lines of code and you've got "real-time." Congratulations. Now deploy that to production, put it behind a load balancer, handle 10,000 concurrent connections, deal with flaky mobile networks, and watch everything fall apart.
I've built real-time features across multiple projects — multiplayer games, live dashboards, collaborative editors, notification systems, status monitoring. The pattern is always the same: the initial implementation takes a day. Making it reliable takes weeks. Making it scale takes months. And the gap between "works on localhost" and "works in production" is where most teams lose their minds.
This is everything I've learned about WebSockets in production. Not the happy path. The real path — with reconnections, heartbeats, scaling headaches, and the subtle bugs that only show up when thousands of users are connected at 2 AM.
A WebSocket connection isn't just "open" or "closed." In production, you're dealing with at least six distinct states, and most applications only handle two of them:
type ConnectionState =
| 'connecting' // Handshake in progress
| 'connected' // Open and healthy
| 'stale' // Open but no recent heartbeat
| 'reconnecting' // Actively trying to reconnect
| 'disconnected' // Clean close or max retries exceeded
| 'suspended'; // Tab backgrounded, connection pausedThe browser's WebSocket API gives you four events: onopen, onmessage, onclose, onerror. That's it. No "stale" detection. No "reconnecting" state. No visibility change handling. You build all of that yourself, or you suffer.
Here's what a proper connection manager looks like. Not a toy — something that actually survives the real world:
interface ConnectionConfig {
url: string;
protocols?: string[];
heartbeatInterval: number; // How often to ping (ms)
heartbeatTimeout: number; // How long to wait for pong (ms)
reconnectBaseDelay: number; // Starting reconnect delay (ms)
reconnectMaxDelay: number; // Cap on reconnect delay (ms)
maxReconnectAttempts: number; // Give up after this many tries
}
class WebSocketManager {
private ws: WebSocket | null = null;
private state: ConnectionState = 'disconnected';
private reconnectAttempts = 0;
private heartbeatTimer: ReturnType<typeof setInterval> | null = null;
private heartbeatTimeoutTimer: ReturnType<typeof setTimeout> | null = null;
private reconnectTimer: ReturnType<typeof setTimeout> | null = null;
private messageQueue: string[] = [];
private listeners = new Map<string, Set<Function>>();
constructor(private config: ConnectionConfig) {
// Handle tab visibility changes
if (typeof document !== 'undefined') {
document.addEventListener('visibilitychange', () => {
if (document.hidden) {
this.suspend();
} else {
this.resume();
}
});
}
}
connect(): void {
if (this.state === 'connecting' || this.state === 'connected') return;
this.setState('connecting');
try {
this.ws = new WebSocket(this.config.url, this.config.protocols);
} catch (err) {
this.handleConnectionFailure();
return;
}
this.ws.onopen = () => {
this.setState('connected');
this.reconnectAttempts = 0;
this.startHeartbeat();
this.flushMessageQueue();
this.emit('connected');
};
this.ws.onmessage = (event) => {
// Any message resets the heartbeat timeout
this.resetHeartbeatTimeout();
if (event.data === 'pong') return; // Heartbeat response
try {
const message = JSON.parse(event.data);
this.emit('message', message);
if (message.type) {
this.emit(message.type, message.payload);
}
} catch {
this.emit('message', event.data);
}
};
this.ws.onclose = (event) => {
this.cleanup();
if (event.code === 1000 || event.code === 1001) {
// Normal closure — don't reconnect
this.setState('disconnected');
this.emit('disconnected', { clean: true, code: event.code });
} else {
// Abnormal closure — attempt reconnect
this.scheduleReconnect();
}
};
this.ws.onerror = () => {
// onerror always fires before onclose, so we don't
// need to handle reconnection here. Just log it.
this.emit('error', { state: this.state });
};
}
// ...methods continued below
}The visibility change handler is the thing most people miss. When a user switches tabs, the browser may throttle or kill the WebSocket connection. On mobile, it's even worse — iOS Safari will aggressively kill background connections. If you don't handle this, your users come back to a tab that looks connected but is actually dead.
TCP keepalives exist, but they're not enough. They operate at the TCP level, with intervals measured in hours by default. Your application needs to know within seconds if a connection is dead, not within two hours.
A heartbeat is simple: send a ping, expect a pong, kill the connection if the pong doesn't come back in time. The implementation matters more than the concept:
private startHeartbeat(): void {
this.stopHeartbeat();
this.heartbeatTimer = setInterval(() => {
if (this.ws?.readyState === WebSocket.OPEN) {
this.ws.send('ping');
this.startHeartbeatTimeout();
}
}, this.config.heartbeatInterval);
}
private startHeartbeatTimeout(): void {
this.clearHeartbeatTimeout();
this.heartbeatTimeoutTimer = setTimeout(() => {
// No pong received — connection is dead
console.warn('[WS] Heartbeat timeout, closing connection');
this.ws?.close(4000, 'Heartbeat timeout');
}, this.config.heartbeatTimeout);
}
private resetHeartbeatTimeout(): void {
// Any message from the server counts as proof of life
this.clearHeartbeatTimeout();
}
private clearHeartbeatTimeout(): void {
if (this.heartbeatTimeoutTimer) {
clearTimeout(this.heartbeatTimeoutTimer);
this.heartbeatTimeoutTimer = null;
}
}
private stopHeartbeat(): void {
if (this.heartbeatTimer) {
clearInterval(this.heartbeatTimer);
this.heartbeatTimer = null;
}
this.clearHeartbeatTimeout();
}The numbers matter. I've settled on these after a lot of trial and error:
One thing that bit me: using the WebSocket protocol-level ping/pong frames instead of application-level messages. Protocol pings are great in theory, but the browser WebSocket API doesn't expose them. You can't send a protocol ping from JavaScript, and you can't listen for protocol pongs. So you end up with a server that thinks it's pinging clients, and clients that have no idea. Use application-level heartbeats. Always.
Server side, the heartbeat handler is equally important:
import { WebSocketServer, WebSocket } from 'ws';
const wss = new WebSocketServer({ port: 8080 });
const HEARTBEAT_INTERVAL = 30_000;
const CLIENT_TIMEOUT = 35_000;
wss.on('connection', (ws: WebSocket) => {
let isAlive = true;
let lastPong = Date.now();
ws.on('message', (data) => {
const message = data.toString();
if (message === 'ping') {
ws.send('pong');
isAlive = true;
lastPong = Date.now();
return;
}
// Handle other messages...
});
const interval = setInterval(() => {
if (Date.now() - lastPong > CLIENT_TIMEOUT) {
console.log('[WS] Client timed out, terminating');
clearInterval(interval);
ws.terminate(); // Hard close — don't wait for handshake
return;
}
// Also send server-side ping as backup
if (ws.readyState === WebSocket.OPEN) {
ws.ping();
}
}, HEARTBEAT_INTERVAL);
ws.on('close', () => {
clearInterval(interval);
});
});Notice ws.terminate() instead of ws.close(). The close() method initiates a graceful close handshake — it sends a close frame and waits for the client to acknowledge. If the client is already dead (which is why we're here), that handshake will never complete. terminate() kills the socket immediately. Use it when you've already decided the connection is gone.
When a connection drops, the worst thing you can do is immediately try to reconnect. If your server is down, a thousand clients all reconnecting at once will keep it down. This is the thundering herd problem, and it's not theoretical — I've seen it bring down production servers.
Exponential backoff with jitter is the standard solution, but the implementation details matter:
private scheduleReconnect(): void {
if (this.reconnectAttempts >= this.config.maxReconnectAttempts) {
this.setState('disconnected');
this.emit('disconnected', {
clean: false,
reason: 'max_retries_exceeded',
attempts: this.reconnectAttempts,
});
return;
}
this.setState('reconnecting');
// Exponential backoff: 1s, 2s, 4s, 8s, 16s, 30s (capped)
const exponentialDelay = this.config.reconnectBaseDelay *
Math.pow(2, this.reconnectAttempts);
const cappedDelay = Math.min(
exponentialDelay,
this.config.reconnectMaxDelay
);
// Add jitter: +/- 25% randomization
const jitter = cappedDelay * 0.25 * (Math.random() * 2 - 1);
const delay = Math.max(0, cappedDelay + jitter);
this.emit('reconnecting', {
attempt: this.reconnectAttempts + 1,
maxAttempts: this.config.maxReconnectAttempts,
delay,
});
this.reconnectTimer = setTimeout(() => {
this.reconnectAttempts++;
this.connect();
}, delay);
}The jitter is critical. Without it, all your clients will retry at exactly the same intervals, creating synchronized spikes. With 25% jitter, a 16-second base delay becomes something between 12 and 20 seconds, spreading the load across a window.
I've also learned the hard way that you need to expose the reconnection state to your UI. Users need to know what's happening. A silent reconnection that succeeds is fine. A silent reconnection that keeps failing for 30 seconds while the user types messages into a void is terrible:
// In your React component
function ConnectionStatus() {
const { state, reconnectInfo } = useWebSocket();
if (state === 'connected') return null;
if (state === 'reconnecting') {
return (
<div className="bg-yellow-900/50 text-yellow-200 px-4 py-2 text-sm">
Reconnecting... (attempt {reconnectInfo.attempt}/
{reconnectInfo.maxAttempts})
</div>
);
}
if (state === 'disconnected') {
return (
<div className="bg-red-900/50 text-red-200 px-4 py-2 text-sm">
Connection lost.{' '}
<button onClick={manualReconnect} className="underline">
Try again
</button>
</div>
);
}
return null;
}And always provide a manual reconnect button after automatic retries are exhausted. Nothing is more frustrating than an app that gives up and offers no way to try again.
Here's something that catches almost everyone: what happens to messages sent while the connection is down? If you're just calling ws.send(), they're gone. The browser will throw an error and the message vanishes.
A proper implementation queues messages during disconnection and flushes them when the connection is restored:
send(data: unknown): void {
const message = typeof data === 'string' ? data : JSON.stringify(data);
if (this.ws?.readyState === WebSocket.OPEN) {
this.ws.send(message);
} else {
// Queue for later delivery
this.messageQueue.push(message);
// Prevent unbounded queue growth
if (this.messageQueue.length > 1000) {
this.messageQueue.shift(); // Drop oldest
console.warn('[WS] Message queue overflow, dropping oldest message');
}
}
}
private flushMessageQueue(): void {
while (
this.messageQueue.length > 0 &&
this.ws?.readyState === WebSocket.OPEN
) {
const message = this.messageQueue.shift()!;
this.ws.send(message);
}
}The queue cap is important. If a user goes offline for 10 minutes in a chat app, you don't want to blast 500 queued messages at the server the instant they reconnect. Set a reasonable limit, drop the oldest messages, and let the application decide how to handle the gap (usually by fetching missed messages from the server via HTTP after reconnection).
This brings up an important architectural principle: WebSockets are for pushing real-time updates, not for guaranteeing delivery. If you need guaranteed message delivery, you need acknowledgments, sequence numbers, and a server-side message store. At that point, you're building a messaging protocol on top of WebSockets, which is a much bigger project. For most applications, the pattern is: reconnect, fetch the current state from the server via REST, then start receiving live updates again.
Here's where things get interesting. On a single server, WebSockets are straightforward. Every connection lives on the same process, and broadcasting to all connected clients is a simple loop. But the moment you add a second server — whether for redundancy, load balancing, or capacity — everything breaks.
Imagine a chat room with 100 users. 50 are connected to Server A, 50 to Server B. User Alice, connected to Server A, sends a message. Server A can broadcast it to its 50 connections. But the 50 users on Server B never see it.
The solution is a message broker. Redis Pub/Sub is the most common choice for this, and it works well:
import Redis from 'ioredis';
const pubClient = new Redis({ host: 'redis.internal', port: 6379 });
const subClient = new Redis({ host: 'redis.internal', port: 6379 });
// Track which rooms each connection is in
const connectionRooms = new Map<WebSocket, Set<string>>();
const roomConnections = new Map<string, Set<WebSocket>>();
// Subscribe to room channels
async function joinRoom(ws: WebSocket, room: string): Promise<void> {
// Track locally
if (!connectionRooms.has(ws)) {
connectionRooms.set(ws, new Set());
}
connectionRooms.get(ws)!.add(room);
if (!roomConnections.has(room)) {
roomConnections.set(room, new Set());
// First local member — subscribe to Redis channel
await subClient.subscribe(`room:${room}`);
}
roomConnections.get(room)!.add(ws);
}
async function leaveRoom(ws: WebSocket, room: string): Promise<void> {
connectionRooms.get(ws)?.delete(room);
roomConnections.get(room)?.delete(ws);
if (roomConnections.get(room)?.size === 0) {
roomConnections.delete(room);
// No more local members — unsubscribe from Redis
await subClient.unsubscribe(`room:${room}`);
}
}
// Broadcast to a room (publishes to Redis so all servers receive it)
function broadcastToRoom(room: string, message: object): void {
pubClient.publish(`room:${room}`, JSON.stringify(message));
}
// Handle messages from Redis
subClient.on('message', (channel, rawMessage) => {
const room = channel.replace('room:', '');
const connections = roomConnections.get(room);
if (!connections) return;
for (const ws of connections) {
if (ws.readyState === WebSocket.OPEN) {
ws.send(rawMessage);
}
}
});This works, but there are sharp edges:
Redis Pub/Sub doesn't buffer. If a server goes down and comes back, it misses all messages published during the downtime. For chat-like applications, this might be acceptable — users can fetch history via HTTP. For financial data or critical notifications, it's not. Consider Redis Streams instead, which provide consumer groups and message persistence.
Channel explosion. If you have 50,000 rooms and subscribe to each one individually, Redis will handle it fine — it's designed for this. But if your room scheme is more like "one channel per user" (for private notifications), you're looking at hundreds of thousands of subscriptions. At that scale, consider pattern subscriptions or a different architecture.
Serialization cost. Every message goes through JSON serialize on publish, deserialize on receive, then JSON serialize again for the WebSocket send. For high-throughput systems, this double serialization adds up. You can skip the intermediate parse by treating Redis messages as opaque strings and forwarding them directly to WebSocket clients.
Here's the optimized version:
// Publish already-serialized JSON — don't parse on the receiving end
function broadcastToRoom(room: string, payload: string): void {
pubClient.publish(`room:${room}`, payload);
}
subClient.on('message', (channel, rawMessage) => {
const room = channel.replace('room:', '');
const connections = roomConnections.get(room);
if (!connections) return;
// Forward the raw string directly — no parsing
for (const ws of connections) {
if (ws.readyState === WebSocket.OPEN) {
ws.send(rawMessage);
}
}
});WebSockets and load balancers don't play nicely together. The HTTP upgrade handshake starts as a normal HTTP request, then upgrades to a persistent connection. Your load balancer needs to understand this, and many don't handle it well by default.
The main problem is: the initial HTTP request hits one server, but after the upgrade, the connection is persistent. If your load balancer is doing round-robin, subsequent HTTP requests from the same client will hit different servers. This matters because most applications use both WebSocket and REST endpoints, and you often need them to agree on state.
Sticky sessions solve this by routing all requests from the same client to the same backend. The approaches, ranked by my preference:
1. Cookie-based stickiness (best for most cases):
upstream websocket_backend {
ip_hash; # Simple but breaks behind CDN/proxy
# Better: use a cookie
# sticky cookie srv_id expires=1h domain=.example.com path=/;
server 10.0.0.1:8080;
server 10.0.0.2:8080;
server 10.0.0.3:8080;
}
server {
listen 443 ssl;
server_name ws.example.com;
location /ws {
proxy_pass http://websocket_backend;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# Critical: increase timeouts for long-lived connections
proxy_read_timeout 86400s;
proxy_send_timeout 86400s;
# Don't buffer WebSocket frames
proxy_buffering off;
}
}2. Connection-ID based routing:
Have the client include a connection ID (assigned by the server on first connect) in subsequent requests. The load balancer routes based on a hash of this ID. This is more reliable than IP-based hashing, which breaks when multiple users share a NAT gateway.
3. Separate WebSocket servers:
Run your WebSocket servers on a different subdomain (ws.example.com) from your REST API (api.example.com). This lets you scale them independently and avoids the stickiness problem for REST requests entirely. This is what I do on most projects now.
The proxy_read_timeout 86400s line is something you'll find in every "fix WebSocket behind Nginx" Stack Overflow answer, and for good reason. Nginx's default read timeout is 60 seconds. Your WebSocket connection will be killed every minute without this.
Every real-time application has some concept of "groups" — chat rooms, game lobbies, document editing sessions, notification channels. The abstraction matters because it determines how your server manages connections and how your scaling strategy works.
I've settled on this model after several iterations:
interface Room {
id: string;
type: 'public' | 'private' | 'direct';
members: Map<string, MemberInfo>;
metadata: Record<string, unknown>;
createdAt: number;
lastActivity: number;
}
interface MemberInfo {
userId: string;
connectionId: string;
ws: WebSocket;
joinedAt: number;
role: 'owner' | 'admin' | 'member' | 'spectator';
}
class RoomManager {
private rooms = new Map<string, Room>();
private userConnections = new Map<string, Set<string>>(); // userId -> roomIds
createRoom(id: string, type: Room['type'], ownerId: string): Room {
if (this.rooms.has(id)) {
throw new Error(`Room ${id} already exists`);
}
const room: Room = {
id,
type,
members: new Map(),
metadata: {},
createdAt: Date.now(),
lastActivity: Date.now(),
};
this.rooms.set(id, room);
return room;
}
join(roomId: string, member: MemberInfo): void {
const room = this.rooms.get(roomId);
if (!room) throw new Error(`Room ${roomId} not found`);
room.members.set(member.userId, member);
room.lastActivity = Date.now();
if (!this.userConnections.has(member.userId)) {
this.userConnections.set(member.userId, new Set());
}
this.userConnections.get(member.userId)!.add(roomId);
// Notify other members
this.broadcastToRoom(roomId, {
type: 'member_joined',
payload: {
userId: member.userId,
role: member.role,
memberCount: room.members.size,
},
}, member.userId); // Exclude the joining user
}
leave(roomId: string, userId: string): void {
const room = this.rooms.get(roomId);
if (!room) return;
room.members.delete(userId);
this.userConnections.get(userId)?.delete(roomId);
if (room.members.size === 0) {
// Auto-cleanup empty rooms
this.rooms.delete(roomId);
} else {
this.broadcastToRoom(roomId, {
type: 'member_left',
payload: { userId, memberCount: room.members.size },
});
}
}
broadcastToRoom(
roomId: string,
message: object,
excludeUserId?: string
): void {
const room = this.rooms.get(roomId);
if (!room) return;
const data = JSON.stringify(message);
for (const [userId, member] of room.members) {
if (userId === excludeUserId) continue;
if (member.ws.readyState === WebSocket.OPEN) {
member.ws.send(data);
}
}
}
// Clean up all rooms for a disconnected user
handleDisconnect(userId: string): void {
const rooms = this.userConnections.get(userId);
if (!rooms) return;
for (const roomId of rooms) {
this.leave(roomId, userId);
}
this.userConnections.delete(userId);
}
}A few lessons embedded in this code:
Auto-cleanup empty rooms. If you don't, you'll have a memory leak. I've seen servers accumulate hundreds of thousands of empty room objects because nobody bothered to clean them up.
Track user-to-rooms mapping bidirectionally. When a user disconnects, you need to remove them from all rooms quickly. Without the reverse mapping, you'd have to scan every room — O(n) per disconnect is not acceptable at scale.
Serialize once, send many. Call JSON.stringify once and send the same string to every connection. I've seen code that stringifies inside the loop, wasting CPU on identical serialization for every recipient.
WebSockets support two frame types: text (UTF-8 strings) and binary (raw bytes). Most applications use text frames with JSON, and that's fine for 90% of use cases. But the other 10% — games, audio/video, file transfer, high-frequency data feeds — benefit enormously from binary.
JSON has overhead. A simple game state update like {"x": 142.5, "y": 89.3, "angle": 2.14, "health": 85} is 52 bytes as JSON. The same data in a packed binary format is 14 bytes (three 32-bit floats + one 16-bit int). At 60 updates per second to 100 players, that's the difference between 312 KB/s and 84 KB/s. Not life-changing at small scale, but it adds up.
Here's a practical binary protocol implementation:
// Shared between client and server
enum MessageType {
POSITION_UPDATE = 1,
PLAYER_ACTION = 2,
GAME_STATE = 3,
CHAT = 4,
}
// Encode a position update into a compact binary format
function encodePositionUpdate(
playerId: number,
x: number,
y: number,
angle: number,
health: number
): ArrayBuffer {
const buffer = new ArrayBuffer(16);
const view = new DataView(buffer);
view.setUint8(0, MessageType.POSITION_UPDATE); // 1 byte: type
view.setUint8(1, playerId); // 1 byte: player ID
view.setFloat32(2, x, true); // 4 bytes: x position
view.setFloat32(6, y, true); // 4 bytes: y position
view.setFloat32(10, angle, true); // 4 bytes: angle
view.setUint16(14, health, true); // 2 bytes: health
return buffer;
}
function decodeMessage(data: ArrayBuffer): object {
const view = new DataView(data);
const type = view.getUint8(0);
switch (type) {
case MessageType.POSITION_UPDATE:
return {
type: 'position_update',
playerId: view.getUint8(1),
x: view.getFloat32(2, true),
y: view.getFloat32(6, true),
angle: view.getFloat32(10, true),
health: view.getUint16(14, true),
};
// ... other message types
default:
throw new Error(`Unknown message type: ${type}`);
}
}
// Client-side usage
ws.binaryType = 'arraybuffer'; // Important! Default is 'blob'
ws.onmessage = (event) => {
if (event.data instanceof ArrayBuffer) {
const message = decodeMessage(event.data);
handleGameMessage(message);
} else {
// Text frame — JSON
const message = JSON.parse(event.data);
handleControlMessage(message);
}
};The ws.binaryType = 'arraybuffer' line is essential. By default, browsers deliver binary frames as Blob objects, which require async reading. Setting it to 'arraybuffer' gives you synchronous access to the data, which is what you want for game state updates.
My recommendation: use JSON for control messages (join room, leave room, authentication, errors) and binary for high-frequency data (game state, sensor readings, audio). You can freely mix both on the same connection — the WebSocket protocol distinguishes them at the frame level.
Most WebSocket error handling I see in the wild looks like this:
ws.onerror = (err) => {
console.log('WebSocket error:', err);
};This tells you nothing. The error event on browser WebSockets doesn't include useful information — no error code, no message, no reason. It just fires, followed immediately by a close event. The close event is where the information is.
Server-side, you need to be much more careful:
wss.on('connection', (ws, req) => {
const clientIp = req.headers['x-forwarded-for'] || req.socket.remoteAddress;
const connectionId = crypto.randomUUID();
ws.on('error', (err) => {
// Don't log ECONNRESET as errors — it's just a disconnect
if ((err as NodeJS.ErrnoException).code === 'ECONNRESET') {
console.debug(`[WS] Client ${connectionId} disconnected abruptly`);
return;
}
// EPIPE means we tried to write to a dead socket
if ((err as NodeJS.ErrnoException).code === 'EPIPE') {
console.debug(`[WS] Broken pipe for ${connectionId}`);
return;
}
// Anything else is unexpected
console.error(`[WS] Unexpected error for ${connectionId}:`, {
code: (err as NodeJS.ErrnoException).code,
message: err.message,
clientIp,
});
});
ws.on('message', (rawData) => {
try {
// Guard against oversized messages
if (rawData.length > 1_048_576) { // 1 MB
ws.close(1009, 'Message too large');
return;
}
const message = JSON.parse(rawData.toString());
// Validate message structure
if (!message.type || typeof message.type !== 'string') {
ws.send(JSON.stringify({
type: 'error',
payload: { code: 'INVALID_MESSAGE', message: 'Missing type field' },
}));
return;
}
handleMessage(ws, connectionId, message);
} catch (err) {
if (err instanceof SyntaxError) {
ws.send(JSON.stringify({
type: 'error',
payload: { code: 'INVALID_JSON', message: 'Failed to parse message' },
}));
return;
}
throw err;
}
});
});ECONNRESET floods are the number one noise source in WebSocket server logs. Every time a client closes their browser, their phone loses signal, or their WiFi drops, you get an ECONNRESET. These are normal. Filter them out of your error logs or you'll drown in noise and miss the actual problems.
WebSocket security is an afterthought in most implementations. I've reviewed codebases where the WebSocket server accepts connections from any origin, has no authentication, no rate limiting, and no message validation. In production. Handling user data.
Origin validation:
const wss = new WebSocketServer({
port: 8080,
verifyClient: (info, callback) => {
const origin = info.origin || info.req.headers.origin;
const allowedOrigins = [
'https://example.com',
'https://www.example.com',
];
if (process.env.NODE_ENV === 'development') {
allowedOrigins.push('http://localhost:3000');
}
if (!origin || !allowedOrigins.includes(origin)) {
callback(false, 403, 'Forbidden');
return;
}
callback(true);
},
});Authentication:
WebSockets don't support custom headers in the browser API. You can't send an Authorization header with the initial handshake (there's a protocols parameter hack, but don't). The two clean approaches:
wss://example.com/ws?token=jwt_here — Simple, but tokens end up in access logs. Use short-lived tokens.I prefer option 2:
const AUTHENTICATION_TIMEOUT = 5_000; // 5 seconds to authenticate
wss.on('connection', (ws, req) => {
let authenticated = false;
let userId: string | null = null;
// Start the auth countdown
const authTimer = setTimeout(() => {
if (!authenticated) {
ws.close(4001, 'Authentication timeout');
}
}, AUTHENTICATION_TIMEOUT);
ws.on('message', async (rawData) => {
const message = JSON.parse(rawData.toString());
if (!authenticated) {
if (message.type !== 'authenticate') {
ws.close(4002, 'Must authenticate first');
return;
}
try {
const user = await verifyToken(message.payload.token);
authenticated = true;
userId = user.id;
clearTimeout(authTimer);
ws.send(JSON.stringify({
type: 'authenticated',
payload: { userId: user.id },
}));
} catch {
ws.close(4003, 'Invalid token');
}
return;
}
// Normal message handling (only reached after authentication)
handleAuthenticatedMessage(ws, userId!, message);
});
});Rate limiting:
WebSocket rate limiting is different from HTTP rate limiting. You're not limiting requests per second — you're limiting messages per second on a persistent connection. A simple token bucket works:
class MessageRateLimiter {
private buckets = new Map<string, { tokens: number; lastRefill: number }>();
constructor(
private maxTokens: number = 50,
private refillRate: number = 10, // tokens per second
) {}
consume(userId: string): boolean {
const now = Date.now();
let bucket = this.buckets.get(userId);
if (!bucket) {
bucket = { tokens: this.maxTokens, lastRefill: now };
this.buckets.set(userId, bucket);
}
// Refill tokens based on elapsed time
const elapsed = (now - bucket.lastRefill) / 1000;
bucket.tokens = Math.min(
this.maxTokens,
bucket.tokens + elapsed * this.refillRate
);
bucket.lastRefill = now;
if (bucket.tokens < 1) {
return false; // Rate limited
}
bucket.tokens--;
return true;
}
}
const rateLimiter = new MessageRateLimiter(50, 10);
// In your message handler:
if (!rateLimiter.consume(userId)) {
ws.send(JSON.stringify({
type: 'error',
payload: {
code: 'RATE_LIMITED',
message: 'Too many messages. Slow down.',
},
}));
return;
}50 tokens with a 10/second refill means a user can burst 50 messages, then is limited to 10 per second. Adjust based on your use case — a chat app needs maybe 5/second, a multiplayer game might need 60/second.
I see teams reach for WebSockets when Server-Sent Events (SSE) would be simpler, cheaper, and more reliable. Here's my decision framework:
Use SSE when:
Use WebSocket when:
The honest truth: most "real-time" features don't need WebSockets. A notification system? SSE. A live dashboard? SSE. A stock ticker? SSE. An activity feed? SSE with a REST fallback. These are all one-way data flows.
Here's a minimal SSE implementation for comparison:
// Server (Express/Node.js)
app.get('/events', (req, res) => {
res.writeHead(200, {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache',
Connection: 'keep-alive',
'X-Accel-Buffering': 'no', // Disable Nginx buffering
});
const sendEvent = (event: string, data: object) => {
res.write(`event: ${event}\n`);
res.write(`data: ${JSON.stringify(data)}\n\n`);
};
// Send initial state
sendEvent('connected', { timestamp: Date.now() });
// Heartbeat to keep connection alive
const heartbeat = setInterval(() => {
res.write(': heartbeat\n\n'); // SSE comment, keeps connection open
}, 25_000);
// Subscribe to events (your application logic here)
const unsubscribe = eventBus.subscribe((event) => {
sendEvent(event.type, event.data);
});
req.on('close', () => {
clearInterval(heartbeat);
unsubscribe();
});
});
// Client
const eventSource = new EventSource('/events');
eventSource.addEventListener('notification', (event) => {
const data = JSON.parse(event.data);
showNotification(data);
});
// Automatic reconnection is built in!
// EventSource will reconnect with Last-Event-ID headerThe Last-Event-ID feature of SSE is genuinely excellent. If the connection drops, the browser includes the last event ID it received in the reconnection request. Your server can use this to replay missed events. You'd have to build this from scratch with WebSockets.
That said, when you do need bidirectional communication, don't try to fake it with SSE + HTTP POST. I've seen architectures where the client sends data via POST requests and receives responses via SSE. It works, but you're maintaining two connections, dealing with ordering issues, and adding latency for every "send." Just use WebSockets.
A common question: how many WebSocket connections can one server handle? The answer depends on what you're doing with those connections, but the raw connection limit is higher than most people think.
Each idle WebSocket connection uses roughly 10-50 KB of memory (depending on your buffers and application state). On a 4 GB server, that's somewhere between 80,000 and 400,000 connections. In practice, you'll hit other limits first — file descriptors, CPU for message processing, bandwidth.
The important configuration:
# Increase file descriptor limit (Linux)
# /etc/security/limits.conf
* soft nofile 1000000
* hard nofile 1000000
# Or for the current process
ulimit -n 1000000
# Increase ephemeral port range
echo "10000 65535" > /proc/sys/net/ipv4/ip_local_port_range
# Increase socket backlog
echo 65535 > /proc/sys/net/core/somaxconnOn the Node.js side, the ws library handles connections efficiently, but you need to watch for memory leaks. The most common leak: storing references to closed connections. Always clean up:
wss.on('connection', (ws) => {
const userId = authenticate(ws);
// Register
connections.set(userId, ws);
roomManager.addConnection(userId, ws);
ws.on('close', () => {
// Always clean up ALL references
connections.delete(userId);
roomManager.handleDisconnect(userId);
rateLimiter.cleanup(userId);
// Any other Maps, Sets, or arrays holding this ws or userId
});
});I once debugged a server that was leaking 200 MB per hour. The cause: a Map<string, WebSocket> that was never cleaned up on disconnect. The old connections were garbage-collected by the engine, but the Map entries remained, each holding onto the userId string and whatever closure scope was attached to event handlers. After 24 hours, the server had 2 million dead entries and was using 5 GB of RAM.
Testing real-time systems is hard. Unit tests for message handlers are straightforward, but integration tests that verify the full connection lifecycle, reconnection behavior, and multi-server broadcasting are a pain.
Here's the approach I've landed on:
import { WebSocketServer } from 'ws';
import { describe, it, expect, beforeAll, afterAll } from 'vitest';
function createTestServer(port: number): WebSocketServer {
return new WebSocketServer({ port });
}
function connectClient(port: number): Promise<WebSocket> {
return new Promise((resolve, reject) => {
const ws = new WebSocket(`ws://localhost:${port}`);
ws.onopen = () => resolve(ws);
ws.onerror = reject;
});
}
function waitForMessage(ws: WebSocket): Promise<any> {
return new Promise((resolve) => {
ws.onmessage = (event) => {
resolve(JSON.parse(event.data.toString()));
};
});
}
describe('WebSocket Room Broadcasting', () => {
let server: WebSocketServer;
const PORT = 9876;
beforeAll(() => {
server = createTestServer(PORT);
setupRoomHandlers(server);
});
afterAll(() => {
server.close();
});
it('should broadcast to all room members except sender', async () => {
const client1 = await connectClient(PORT);
const client2 = await connectClient(PORT);
const client3 = await connectClient(PORT);
// All join the same room
client1.send(JSON.stringify({ type: 'join', room: 'test-room' }));
client2.send(JSON.stringify({ type: 'join', room: 'test-room' }));
client3.send(JSON.stringify({ type: 'join', room: 'test-room' }));
// Wait for join confirmations
await waitForMessage(client1); // join ack
await waitForMessage(client2);
await waitForMessage(client3);
// Client 1 sends a message
const messagePromise2 = waitForMessage(client2);
const messagePromise3 = waitForMessage(client3);
client1.send(JSON.stringify({
type: 'chat',
room: 'test-room',
text: 'Hello everyone',
}));
const msg2 = await messagePromise2;
const msg3 = await messagePromise3;
expect(msg2.type).toBe('chat');
expect(msg2.text).toBe('Hello everyone');
expect(msg3.type).toBe('chat');
expect(msg3.text).toBe('Hello everyone');
// Cleanup
client1.close();
client2.close();
client3.close();
});
});For reconnection testing, you need to be creative. Simulate network failures by killing the server mid-connection:
it('should reconnect after server restart', async () => {
let server = createTestServer(PORT);
const manager = new WebSocketManager({
url: `ws://localhost:${PORT}`,
reconnectBaseDelay: 100,
reconnectMaxDelay: 1000,
maxReconnectAttempts: 5,
heartbeatInterval: 5000,
heartbeatTimeout: 3000,
});
// Connect
const connected = new Promise((r) => manager.on('connected', r));
manager.connect();
await connected;
// Kill the server
server.close();
// Wait for reconnection attempts
const reconnecting = new Promise((r) => manager.on('reconnecting', r));
await reconnecting;
// Restart the server
server = createTestServer(PORT);
// Wait for reconnection success
const reconnected = new Promise((r) => manager.on('connected', r));
await reconnected;
expect(manager.getState()).toBe('connected');
server.close();
manager.disconnect();
});You can't manage what you don't measure. These are the metrics I track for every WebSocket deployment:
interface WebSocketMetrics {
// Connection metrics
activeConnections: number;
totalConnectionsToday: number;
connectionRate: number; // New connections per minute
disconnectionRate: number;
// Message metrics
messagesInPerSecond: number;
messagesOutPerSecond: number;
averageMessageSize: number;
largestMessageSize: number;
// Health metrics
heartbeatTimeoutRate: number; // % of connections timing out
reconnectionRate: number; // Reconnections per minute
averageConnectionDuration: number;
// Error metrics
authenticationFailures: number;
rateLimitHits: number;
invalidMessages: number;
// Room metrics
activeRooms: number;
averageRoomSize: number;
largestRoom: number;
}Expose these on a /metrics endpoint in Prometheus format, or push them to whatever monitoring system you use. The heartbeatTimeoutRate metric is especially valuable — if it suddenly spikes, something is wrong with your network or your server is overloaded and can't process heartbeats in time.
I also log connection durations as a histogram. A healthy distribution looks like a long tail — most connections last minutes to hours (users with the tab open). If you see a spike at very short durations (under 5 seconds), clients are connecting and immediately dropping, which usually means a deployment broke something.
Messages sent over a single WebSocket connection arrive in order. But once you introduce Redis Pub/Sub and multiple servers, ordering guarantees get complicated.
Consider this scenario: User A sends message 1, then message 2. Both hit Server A. Server A publishes both to Redis. Server B receives them from Redis. Do they arrive at Server B in order? Usually yes, but not guaranteed. Redis Pub/Sub delivers messages to a subscriber in the order they were published, so within a single publisher the order is preserved. But if messages come from different servers, there's no global ordering.
For chat applications, this is usually fine — messages have timestamps, and a few milliseconds of reordering is invisible to users. For collaborative editing or financial transactions, it's a dealbreaker. Solutions:
Sequence numbers: Include a monotonically increasing sequence number in each message. Clients buffer out-of-order messages and replay them in sequence. Simple, effective, but adds complexity.
Single-writer per entity: Route all writes for a given entity (document, game, channel) to the same server. This gives you total ordering within that entity. Use consistent hashing on the entity ID to determine the server.
CRDTs (Conflict-Free Replicated Data Types): For collaborative editing, CRDTs like Y.js or Automerge eliminate ordering concerns entirely. Every peer can apply operations in any order and converge to the same state. This is the right answer for collaborative editing, but it's a significant architectural commitment.
After years of building real-time features, here's what I wish I'd known earlier:
Start with SSE, upgrade to WebSockets. Most "real-time" features start as server-to-client pushes. A notification system, a live feed, a dashboard. SSE handles all of these with less complexity, better browser support, automatic reconnection, and no proxy headaches. You can always add WebSockets later when you genuinely need bidirectional communication.
Don't build your own protocol. I spent months building custom message formats, acknowledgment systems, and presence tracking. Then I found that libraries like Socket.IO (love it or hate it) handle most of this out of the box. Yes, Socket.IO adds overhead. Yes, it's opinionated. But it also handles reconnection, rooms, acknowledgments, binary support, and fallback transports. For most applications, the overhead is worth the saved development time. Roll your own only if you have specific performance requirements that a library can't meet.
Invest in observability from day one. The first time your WebSocket server has a problem in production, you'll wish you had metrics. Connection counts, message rates, error rates, latency percentiles. Set up dashboards before you launch, not after the first incident.
Design for disconnection. The number one mistake in real-time systems: assuming the connection is always there. Design your client as if disconnection is the default state. All state should be recoverable from the server. All user actions should be safe to retry. The WebSocket is an optimization that makes things feel instant — it should never be the only path to correct behavior.
Test with real network conditions. Chrome DevTools has network throttling, but it doesn't simulate the kind of network failures you see in production — half-open connections, connections that silently stop delivering data, sudden latency spikes. Use tools like tc (traffic control) on Linux or Clumsy on Windows to simulate real-world conditions. Or just test on a mobile phone on a train. That'll show you every gap in your error handling within about ten minutes.
Keep your message payloads small. Every byte you send is multiplied by the number of recipients. A 2 KB message broadcast to 1,000 users is 2 MB of bandwidth. A 200-byte message is 200 KB. At scale, this difference is your server budget. Be aggressive about trimming payloads — send IDs instead of full objects, use short field names in high-frequency messages, and consider binary encoding for anything sent more than a few times per second.
Plan for server restarts. Deploying a new version of your WebSocket server means killing all connections. Every connected client will disconnect and reconnect. If you have 50,000 connected clients, that's 50,000 simultaneous reconnections hitting your new server. Use rolling deploys, drain connections gracefully (send a "server shutting down, please reconnect in X seconds" message with staggered delays), and make sure your reconnection backoff has enough jitter to spread the load.
WebSockets are powerful, but they're also one of the most operationally complex things you can add to a web application. Every persistent connection is a resource your server is holding onto, a state you need to manage, and a failure mode you need to handle. When you need them, they're worth the complexity. When you don't, save yourself the trouble.
The real skill isn't making WebSockets work. It's knowing when they're the right tool — and building them so they degrade gracefully when they inevitably break.