The hard-won lessons from years of migrating production databases without downtime. Lock contention, expand-contract patterns, online schema tools, ORM pitfalls, and the incident stories nobody talks about.
It was a Friday afternoon — because it is always a Friday afternoon — when a junior developer ran a migration on our main orders table. The migration looked perfectly innocent: add a NOT NULL column with a default value to a table with 80 million rows. In PostgreSQL 10, this meant a full table rewrite. The entire table locked for 3 hours and 47 minutes. Every API endpoint that touched orders returned 500s. The on-call engineer's phone exploded. We lost an estimated $340,000 in failed transactions.
The migration itself? Five lines of SQL. The postmortem? Forty-seven pages.
I have been doing database migrations in production for over a decade now, across PostgreSQL, MySQL, and a brief painful stint with SQL Server. The single most important thing I have learned is this: the migration that terrifies you is rarely the one that causes the outage. It is the "simple" one that nobody reviews because it looks trivial. An ALTER TABLE here, an ADD COLUMN there. What could go wrong?
Everything. Everything can go wrong.
This post is the guide I wish I had when I started. Every pattern, every tool recommendation, every warning comes from real production incidents — either mine or ones I helped debug for other teams.
To understand why schema changes cause outages, you need to understand how PostgreSQL handles locks. Every DDL statement acquires an ACCESS EXCLUSIVE lock on the table. This is the strongest lock type in PostgreSQL's lock hierarchy. While this lock is held, nothing else can read from or write to that table. Not SELECT, not INSERT, not UPDATE. Nothing.
Here is where it gets really nasty. Even if your ALTER TABLE is fast — say, adding a nullable column with no default, which is metadata-only and takes milliseconds — it still needs to acquire that ACCESS EXCLUSIVE lock. And it has to wait in the lock queue behind any currently running queries. And while it is waiting, it blocks every new query that comes in behind it.
Let me illustrate with a timeline:
T=0s Long-running SELECT starts (holds ACCESS SHARE lock)
T=1s ALTER TABLE requests ACCESS EXCLUSIVE lock — BLOCKED by the SELECT
T=2s New SELECT comes in — BLOCKED by the pending ALTER TABLE
T=3s Another SELECT — BLOCKED
T=4s INSERT — BLOCKED
T=5s 100 more queries — ALL BLOCKED
...
T=300s Original SELECT finishes
T=300s ALTER TABLE acquires lock, runs in 5ms, releases lock
T=300s All 500+ queued queries flood in simultaneously
That last part is the killer. Even after the ALTER TABLE completes, the thundering herd of queued queries can overwhelm your connection pool and cascade into an outage. I have seen this pattern take down services for 20 minutes even though the actual migration ran in under a second.
The first defense is always to set a lock timeout:
SET lock_timeout = '5s';
ALTER TABLE orders ADD COLUMN tracking_url TEXT;If the lock cannot be acquired within 5 seconds, the statement aborts instead of sitting in the queue forever, blocking everything behind it. This single setting has prevented more outages than any other technique I know.
In your migration scripts, always set this:
BEGIN;
SET LOCAL lock_timeout = '3s';
SET LOCAL statement_timeout = '30s';
-- your DDL here
COMMIT;The SET LOCAL ensures these settings only apply within the transaction. If the lock acquisition times out, you retry during a quieter period — maybe during a traffic valley at 3 AM, or after you have killed that one analytics query that someone left running for 45 minutes.
Before running any migration on a hot table, check what is currently holding locks:
SELECT
pid,
now() - pg_stat_activity.query_start AS duration,
query,
state,
wait_event_type,
wait_event
FROM pg_stat_activity
WHERE datname = current_database()
AND state != 'idle'
AND pid != pg_backend_pid()
ORDER BY duration DESC;Kill the long-runners if you can. But be aware that some long-running queries are legitimate — reporting queries, batch jobs, replication slots. You need a strategy beyond "hope the table is quiet."
The expand-contract pattern (sometimes called "parallel change" or "expand-migrate-contract") is the fundamental technique for zero-downtime schema changes. The idea is simple: never make a breaking change in a single step. Instead, break it into three phases that can each be deployed independently.
Phase 1 — Expand: Add the new structure alongside the old one. Both coexist. The application writes to both but reads from the old.
Phase 2 — Migrate: Backfill data into the new structure. Switch the application to read from the new structure. The old structure is still there as a safety net.
Phase 3 — Contract: Once you are confident the new structure is correct and the old one is no longer needed, remove the old structure.
Here is a concrete example. Say you want to rename a column from email to email_address:
-- Phase 1: Expand (deploy #1)
ALTER TABLE users ADD COLUMN email_address TEXT;
-- Phase 2: Migrate (deploy #2, after application code handles both)
UPDATE users SET email_address = email WHERE email_address IS NULL;
-- Application now reads from email_address, writes to both
-- Phase 3: Contract (deploy #3, after verification period)
ALTER TABLE users DROP COLUMN email;Each phase gets its own deployment. Each deployment is independently safe to roll back. At no point does the application break.
The expand-contract pattern applies to almost every schema change:
The cost is that every "simple" change becomes three deployments spread over days or weeks. That is the price of zero downtime. There is no shortcut.
PostgreSQL has some specific behaviors and improvements that matter a lot for migrations.
Before PostgreSQL 11, ALTER TABLE ... ADD COLUMN ... DEFAULT rewrote the entire table. Every single row had to be updated to include the new default value. On a table with 100 million rows, this could take hours — all while holding an ACCESS EXCLUSIVE lock.
PostgreSQL 11 changed this. Now, when you add a column with a non-volatile default, PostgreSQL stores the default value in the catalog metadata. Existing rows are not rewritten. They get the default value lazily when they are next read or updated. The ALTER TABLE completes in milliseconds regardless of table size.
-- This is now fast in PostgreSQL 11+ (milliseconds, not hours)
ALTER TABLE orders ADD COLUMN status TEXT NOT NULL DEFAULT 'pending';This was one of the most important improvements in PostgreSQL's history for operational teams. But there are caveats:
DEFAULT now() will still rewrite the table because now() is not immutable.Regular CREATE INDEX acquires a lock that blocks writes (though not reads in PostgreSQL). On a large table, index creation can take minutes or hours.
CREATE INDEX CONCURRENTLY is the solution:
CREATE INDEX CONCURRENTLY idx_orders_customer_id ON orders (customer_id);This builds the index without blocking writes. It takes longer than a regular index build (roughly 2-3x) because it has to scan the table twice and reconcile any changes that happened during the build. But it does not lock the table.
Important gotchas with concurrent index creation:
SELECT indexrelid::regclass, indisvalid
FROM pg_index
WHERE NOT indisvalid;Drop invalid indexes and retry. Do not leave them — they consume disk space, slow down writes (PostgreSQL still updates them), and confuse the query planner.
Adding a NOT NULL constraint to an existing column requires a full table scan to verify no existing rows violate the constraint. On a large table, this scan holds a lock for the entire duration.
The workaround in PostgreSQL 12+ is to use a CHECK constraint with NOT VALID, then validate it separately:
-- Step 1: Add the constraint without validating existing rows (fast, brief lock)
ALTER TABLE orders
ADD CONSTRAINT orders_status_not_null
CHECK (status IS NOT NULL) NOT VALID;
-- Step 2: Validate existing rows (slow, but doesn't block reads/writes)
ALTER TABLE orders VALIDATE CONSTRAINT orders_status_not_null;
-- Step 3: Now you can safely convert to a real NOT NULL
-- PostgreSQL recognizes the CHECK constraint and skips the full scan
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;
-- Step 4: Clean up the CHECK constraint
ALTER TABLE orders DROP CONSTRAINT orders_status_not_null;Step 2 acquires only a ShareUpdateExclusiveLock, which allows reads and writes. The full table scan happens without blocking your application. Step 3 is then instant because PostgreSQL sees the existing CHECK constraint and knows the column cannot contain NULLs.
When the built-in PostgreSQL techniques are not enough — or when you are on MySQL — online schema migration tools become essential.
gh-ost is GitHub's tool for MySQL online schema changes. Unlike pt-online-schema-change, it does not use triggers. Instead, it reads from the binary log to track changes.
How it works internally:
_old and the ghost table takes its nameThe binary log approach is brilliant because triggers have serious performance implications — they run synchronously within the original transaction, doubling write latency. gh-ost's approach is asynchronous and adds nearly zero overhead to the original table's write path.
gh-ost \
--host=db-replica.internal \
--database=myapp \
--table=orders \
--alter="ADD COLUMN tracking_url VARCHAR(512)" \
--executeKey operational details:
Percona's pt-online-schema-change is the older MySQL approach. It uses triggers:
The trigger-based approach is simpler to reason about but has performance costs. Every write to the original table now also triggers a write to the new table. On write-heavy tables, this can double your write IOPS and significantly increase transaction latency.
pgroll is a newer tool from Xata specifically for PostgreSQL. It takes a different approach — instead of copying tables, it uses views and triggers to present multiple schema versions simultaneously.
When you start a migration with pgroll:
This is particularly powerful for blue-green deployments because both the old and new application versions can run against the same database simultaneously, each seeing the schema they expect.
{
"name": "add_tracking_url",
"operations": [
{
"add_column": {
"table": "orders",
"column": {
"name": "tracking_url",
"type": "text",
"nullable": true
}
}
}
]
}ORMs generate migrations. This is convenient during development and dangerous in production. Let me be specific about the three big ones in the Node.js ecosystem.
Prisma generates SQL migrations from schema diffs. You change your schema.prisma file, run prisma migrate dev, and it produces a SQL migration file. In development, this workflow is fantastic.
In production, it has sharp edges:
// Before
model Order {
id Int @id @default(autoincrement())
status String?
}
// After — you changed String? to String (nullable to required)
model Order {
id Int @id @default(autoincrement())
status String @default("pending")
}Prisma will generate something like:
-- This will FAIL if any rows have NULL status
ALTER TABLE "Order" ALTER COLUMN "status" SET NOT NULL;
ALTER TABLE "Order" ALTER COLUMN "status" SET DEFAULT 'pending';No backfill. No expand-contract. Just a raw ALTER TABLE that will fail if any rows have NULL values, or lock the table while scanning all rows to verify the NOT NULL constraint. On a table with 50 million rows, you just bought yourself an outage.
The right approach is to edit the generated migration file before applying it:
-- Step 1: Backfill NULLs
UPDATE "Order" SET status = 'pending' WHERE status IS NULL;
-- Step 2: Add NOT NULL with the CHECK constraint trick
ALTER TABLE "Order"
ADD CONSTRAINT order_status_not_null
CHECK (status IS NOT NULL) NOT VALID;
ALTER TABLE "Order" VALIDATE CONSTRAINT order_status_not_null;
ALTER TABLE "Order" ALTER COLUMN status SET NOT NULL;
ALTER TABLE "Order" DROP CONSTRAINT order_status_not_null;
ALTER TABLE "Order" ALTER COLUMN status SET DEFAULT 'pending';But now you are maintaining hand-written SQL alongside Prisma, which defeats much of the purpose.
Drizzle's migration generator is more transparent because you can see and modify the SQL directly. It generates migration files that you can edit freely. The downside is the same: the generated SQL is naive about production concerns.
// Drizzle schema change
export const orders = pgTable('orders', {
id: serial('id').primaryKey(),
status: text('status').notNull().default('pending'),
// New column
trackingUrl: text('tracking_url'),
});Drizzle generates:
ALTER TABLE "orders" ADD COLUMN "tracking_url" text;This particular migration is fine — adding a nullable column is metadata-only in PostgreSQL. But Drizzle does not generate CREATE INDEX CONCURRENTLY. It does not split migrations into expand-contract phases. It does not set lock timeouts. You have to add all of that yourself.
TypeORM's migration story is the weakest of the three for production use. Its synchronize: true mode will happily drop columns and tables to match your entity definitions. I have personally seen a production database lose a column because someone deployed with synchronize: true in the production config. The data was gone. Permanently.
Never, under any circumstances, use synchronize: true in production. Use TypeORM's migration generator, review every generated file, and add the safety measures manually.
Use ORM migration generators as a starting point, not as the final product. Every generated migration should be reviewed and edited for production safety before it runs against a real database. The generators do not know about your table sizes, your traffic patterns, your lock contention issues, or your uptime requirements.
When you add a new column that needs to be populated from existing data, you have a backfill problem. The naive approach — a single UPDATE on millions of rows — will blow out your WAL, spike your disk I/O, and potentially lock the table for minutes.
Break the backfill into small batches:
-- Backfill in batches of 10,000
DO $$
DECLARE
batch_size INT := 10000;
rows_updated INT;
BEGIN
LOOP
UPDATE orders
SET tracking_url = compute_tracking_url(carrier, tracking_number)
WHERE tracking_url IS NULL
AND id IN (
SELECT id FROM orders
WHERE tracking_url IS NULL
LIMIT batch_size
FOR UPDATE SKIP LOCKED
);
GET DIAGNOSTICS rows_updated = ROW_COUNT;
RAISE NOTICE 'Updated % rows', rows_updated;
IF rows_updated = 0 THEN
EXIT;
END IF;
-- Brief pause to let other transactions breathe
PERFORM pg_sleep(0.1);
END LOOP;
END $$;The FOR UPDATE SKIP LOCKED is critical — it prevents the batch update from blocking on rows that are currently locked by application transactions.
For very large backfills, I prefer running them as background jobs in the application layer:
async function backfillTrackingUrls() {
const BATCH_SIZE = 5000;
let totalUpdated = 0;
while (true) {
const result = await db.execute(sql`
WITH batch AS (
SELECT id, carrier, tracking_number
FROM orders
WHERE tracking_url IS NULL
LIMIT ${BATCH_SIZE}
)
UPDATE orders o
SET tracking_url = CASE
WHEN b.carrier = 'fedex' THEN 'https://track.fedex.com/' || b.tracking_number
WHEN b.carrier = 'ups' THEN 'https://track.ups.com/' || b.tracking_number
ELSE NULL
END
FROM batch b
WHERE o.id = b.id
RETURNING o.id
`);
totalUpdated += result.rowCount;
console.log(`Backfilled ${totalUpdated} rows`);
if (result.rowCount === 0) break;
// Throttle based on replication lag or server load
await sleep(200);
}
}This approach gives you better control: you can monitor progress, throttle based on server load, pause and resume, and log the progress.
For columns that need to be populated going forward while you backfill old data, use the dual-write pattern:
The tricky part is step 1. Your application code needs to handle the case where the new column does not exist yet (during the deployment rollout) and where it does. Feature flags help here.
This is one of the most common migration patterns and one of the most frequently botched. You have a column that was originally nullable, and now you want to make it required.
The safe multi-step process:
Step 1: Add a default value at the application level. Every INSERT and UPDATE should now provide a value for this column. Deploy this first and verify it is working.
Step 2: Backfill all existing NULL rows with the appropriate default value.
-- Do this in batches for large tables
UPDATE users SET timezone = 'UTC' WHERE timezone IS NULL;Step 3: Add a CHECK constraint without validation to prevent new NULLs immediately:
ALTER TABLE users
ADD CONSTRAINT users_timezone_not_null
CHECK (timezone IS NOT NULL) NOT VALID;Step 4: Validate the constraint (this scans all rows but does not block writes):
ALTER TABLE users VALIDATE CONSTRAINT users_timezone_not_null;Step 5: Now add the actual NOT NULL constraint (instant because PostgreSQL sees the CHECK):
ALTER TABLE users ALTER COLUMN timezone SET NOT NULL;
ALTER TABLE users DROP CONSTRAINT users_timezone_not_null;If you skip to step 5 directly, PostgreSQL scans every row while holding a lock. On a table with 100 million rows, that scan takes minutes. With the CHECK constraint trick, the lock duration is milliseconds.
Adding a foreign key to a large table is another migration that looks simple but can cause serious problems.
-- This looks innocent. It is not.
ALTER TABLE orders ADD CONSTRAINT fk_orders_customer
FOREIGN KEY (customer_id) REFERENCES customers(id);This acquires an ACCESS EXCLUSIVE lock on the orders table, then scans every row to validate that all customer_id values reference valid rows in customers. If orders has 200 million rows, this can take a very long time.
The same NOT VALID trick works here:
-- Step 1: Add constraint without validation (brief lock)
ALTER TABLE orders ADD CONSTRAINT fk_orders_customer
FOREIGN KEY (customer_id) REFERENCES customers(id)
NOT VALID;
-- Step 2: Validate separately (scans table but allows concurrent reads/writes)
ALTER TABLE orders VALIDATE CONSTRAINT fk_orders_customer;The NOT VALID constraint still enforces the foreign key for all new and updated rows immediately — it just skips the validation scan for existing rows. Step 2 then validates existing rows without blocking.
Foreign key constraints create implicit locks on referenced tables. When you INSERT into orders (which has an FK to customers), PostgreSQL takes a share lock on the referenced row in customers to ensure it is not deleted during the transaction. This can cause deadlocks when two transactions are modifying related rows in both tables simultaneously.
I once debugged a production incident where adding a foreign key between orders and products caused a spike in deadlocks. The root cause was a background job that was updating product prices (locking product rows) at the same time as the checkout flow was creating orders (which now needed to lock the same product rows for FK validation). The solution was to adjust the background job to run during off-peak hours and to use shorter transactions.
When adding foreign keys, always check for patterns in your application that touch both tables in the same transaction. Add monitoring for deadlocks before and after the change:
SELECT datname, deadlocks
FROM pg_stat_database
WHERE datname = current_database();Blue-green deployments are standard for application code. You have two environments — blue (current production) and green (new version). You switch traffic from blue to green, and if something goes wrong, you switch back.
When database schema changes are involved, this gets complicated. The database is shared between blue and green, so both application versions must be compatible with the database schema at all times during the transition.
This is the compatibility window concept. At any point during the deployment, you might have old and new application code running against the same database. Your schema must be compatible with both.
This rules out certain changes during a blue-green deploy:
The solution is always the expand-contract pattern, with each phase aligned to a deployment:
Deploy #1 (green): Add new column, application writes to both, reads from old
-> Switch traffic to green
-> Both blue and green compatible with current schema
Deploy #2 (green): Backfill complete, switch reads to new column
-> Switch traffic to green
-> Old column still exists for rollback
Deploy #3 (green): Drop old column
-> Switch traffic to green
-> No rollback possible to pre-migration state (but you shouldn't need it by now)
Each deploy maintains backward compatibility. Each deploy can be rolled back independently. The cost is three deployments instead of one.
The migration that works perfectly on your development database with 1,000 rows will destroy production with 100 million rows. You absolutely must test against production-sized data. Here are the approaches I have used:
If you are on managed PostgreSQL (RDS, Cloud SQL, Supabase), create a snapshot of production and restore it to a test instance. Run your migration against the test instance, measure the time, check for lock contention.
# RDS example
aws rds create-db-snapshot \
--db-instance-identifier prod-db \
--db-snapshot-identifier migration-test-20260322
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier migration-test \
--db-snapshot-identifier migration-test-20260322When you cannot use production data (compliance, GDPR, sensitivity), generate synthetic data at production scale:
-- Generate 100M realistic-looking orders
INSERT INTO orders (customer_id, status, created_at, total_amount)
SELECT
(random() * 1000000)::int,
(ARRAY['pending', 'shipped', 'delivered', 'cancelled'])[floor(random() * 4 + 1)],
now() - (random() * interval '365 days'),
(random() * 10000)::numeric(10,2)
FROM generate_series(1, 100000000);This takes a while, but it gives you a table that matches production's row count and data distribution.
For quick sanity checks on queries that a migration involves, use EXPLAIN on production without actually running the statement:
-- See the plan for the migration without running it
EXPLAIN ALTER TABLE orders ADD CONSTRAINT check_status
CHECK (status IN ('pending', 'shipped', 'delivered', 'cancelled'));This does not tell you how long it will take, but it tells you whether PostgreSQL plans to scan the table.
Before running any migration, know your table sizes:
SELECT
relname AS table_name,
pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
n_live_tup AS row_count,
n_dead_tup AS dead_rows,
last_vacuum,
last_autovacuum
FROM pg_stat_user_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 20;If the table has 500 million rows and 50 GB of data, you should not be running naive ALTER TABLE statements against it.
Here is an uncomfortable truth that most migration guides gloss over: rollback migrations almost never work in production.
Think about it. You run a forward migration that adds a column and starts populating it with data. The rollback drops that column. But during the time between the forward migration and the rollback, new rows were inserted with data in that column. Data you might need. Data that is now gone.
Or consider a more complex scenario: you split a table into two. The forward migration creates the new table, sets up dual-writes, and starts backfilling. The rollback would need to... merge the tables back? What about data that was written only to the new table during the split?
After the third incident where a rollback migration caused more damage than the original problem, I adopted a different philosophy: rollback migrations are a code smell. If your forward migration requires a complex rollback, your forward migration is too aggressive.
Instead, I use these strategies:
Fix forward: If a migration causes problems, deploy a new forward migration that fixes the issue. This is safer because you are always moving in one direction and you can test the fix before deploying it.
Expand-contract with long verification periods: Between each phase of an expand-contract migration, wait days or weeks. Run both old and new structures in parallel. Monitor for issues. Only proceed to the next phase when you are confident.
Logical rollbacks: Instead of undoing DDL, undo the application behavior. If the new column is causing problems, deploy application code that ignores it. The column still exists in the database, but nothing uses it. Fix the issue, then start using it again.
// "Logical rollback" — feature flag controls whether we use the new column
const useTrackingUrl = await featureFlags.isEnabled('use_tracking_url');
const order = useTrackingUrl
? await getOrderWithTracking(orderId)
: await getOrderLegacy(orderId);The one case where rollback migrations are safe and reasonable is simple additive changes. If your forward migration added a nullable column with no default, the rollback is just DROP COLUMN. No data loss that matters (the column was just added and is likely empty or irrelevant). No complex logic. This is fine.
-- Forward
ALTER TABLE orders ADD COLUMN tracking_url TEXT;
-- Rollback (safe because the column was just added)
ALTER TABLE orders DROP COLUMN tracking_url;Let me tell you about the worst migration incident I have ever been involved with. It is a case study in everything that can go wrong.
The setup: a SaaS platform with a sessions table. 200 million rows. PostgreSQL 12. The table was hot — every API request read from or wrote to it. Average traffic: 3,000 queries per second hitting this table.
The migration: add a device_type column with a default of 'unknown'.
The developer knew about the PostgreSQL 11 improvement for column defaults. They tested on a staging database with 1 million rows. The migration took 100 milliseconds. They figured production would be similarly fast. They were right — the DDL itself would have taken milliseconds.
But they did not account for the lock acquisition.
Here is what happened:
2:14 PM — A reporting query started a sequential scan of the sessions table. Estimated runtime: 8 minutes. This query held an ACCESS SHARE lock.
2:15 PM — The developer ran the migration. The ALTER TABLE requested an ACCESS EXCLUSIVE lock. It was blocked by the reporting query.
2:15 PM — Immediately, all new queries to the sessions table began queuing behind the pending ALTER TABLE.
2:16 PM — The application's connection pool maxed out. New requests started getting connection timeouts. Error rates spiked to 40%.
2:17 PM — The on-call engineer was paged. They saw the error rate spike but did not immediately connect it to a migration. They started investigating application-level issues.
2:20 PM — The engineer found the lock contention in pg_stat_activity. They killed the reporting query.
2:20 PM — The ALTER TABLE acquired its lock, ran in 12 milliseconds, and released the lock. The queued queries flooded in.
2:20 PM — The thundering herd overwhelmed PostgreSQL. 800 simultaneous queries hit the table. PostgreSQL's CPU went to 100%. The connection pool overflowed again.
2:25 PM — The team started killing queries to reduce load. But new queries kept coming in. The application's retry logic was creating even more load.
2:30 PM — They made the decision to restart the application servers with reduced concurrency. This took 5 minutes across all pods.
2:35 PM — Servers came back with reduced connection pools. Load stabilized.
2:40 PM — They discovered that the thundering herd had caused several deadlocks, and some transactions had been rolled back but not properly retried by the application.
3:00 PM — Full incident response. They found 847 orders that were in an inconsistent state due to the partial transaction failures.
4:00 PM — Manual data reconciliation began.
6:00 PM — All inconsistent orders were fixed. The incident was declared over.
Total impact: 3 hours and 45 minutes of degraded service. 847 orders required manual intervention. The migration itself was 12 milliseconds.
The postmortem identified several fixes:
Always set lock_timeout. If the developer had used SET lock_timeout = '3s', the migration would have failed harmlessly after 3 seconds instead of queuing hundreds of queries.
Kill long-running queries before migrating hot tables. Or better: schedule migrations during maintenance windows.
The application needed better circuit breaking. When the connection pool maxed out, the retry logic made things worse.
Connection pool limits should be lower than PostgreSQL's max_connections. This prevents the thundering herd from overwhelming the database.
Monitor pg_stat_activity during migrations. Have a dashboard showing lock waits in real time.
After all these incidents and lessons, I now use this checklist for every production migration:
Know your table size. Run pg_stat_user_tables for row counts. Run pg_total_relation_size() for disk size. If either number is large, proceed with extreme caution.
Set lock_timeout and statement_timeout. Always. No exceptions.
Check for long-running queries before starting the migration. Kill them if possible.
Test on production-sized data. A staging database with 1,000 rows tells you nothing about a production table with 100 million rows.
Use CONCURRENTLY for index creation. Every time. Even if the table is small now.
Use NOT VALID for constraint additions. Add the constraint without validation, then validate separately.
Backfill in batches. Never update millions of rows in a single statement.
Deploy in expand-contract phases. Never make a breaking change in one step.
Monitor after deployment. Watch for deadlocks, slow queries, increased lock wait times.
Do not trust rollback migrations. Plan to fix forward instead.
Zero-downtime migrations are not about clever tricks or fancy tools. They are about discipline. Every schema change, no matter how small, deserves a moment of thought: "What will this do on a table with 200 million rows during peak traffic?" If the answer is anything other than "complete in milliseconds without blocking," you need a plan.
The expand-contract pattern adds overhead to every migration. Lock timeouts add complexity to your deployment scripts. Testing against production-sized data takes time and money. But the alternative is the 3 AM phone call, the weekend-long incident, the 47-page postmortem, and the $340,000 in lost transactions.
I know which one I prefer.
The database does not care about your deployment schedule, your sprint commitments, or your Friday evening plans. It will take a lock and hold it for as long as it needs to. Your job is to make sure that "as long as it needs to" is measured in milliseconds, not hours.
Every migration in this post has one thing in common: it trades developer convenience for operational safety. You write more SQL. You deploy more often. You wait longer between phases. And your users never notice a thing. That is the goal. That is zero-downtime migration done right.