Picture a Friday afternoon — because it is always a Friday afternoon — when a junior developer runs a migration on the main orders table. The migration looks perfectly innocent: add a NOT NULL column with a default value to a table with 80 million rows. In PostgreSQL 10, this meant a full table rewrite. The entire table locks for 3 hours and 47 minutes. Every API endpoint that touches orders returns 500s. The on-call engineer's phone explodes. The estimated cost in failed transactions: hundreds of thousands of dollars.

The migration itself? Five lines of SQL. The postmortem? Forty-seven pages.

After enough years of production database migrations across PostgreSQL, MySQL, and a painful stint with SQL Server, the same lesson keeps appearing: the migration that terrifies you is rarely the one that causes the outage. It is the "simple" one that nobody reviews because it looks trivial. An ALTER TABLE here, an ADD COLUMN there. What could go wrong?

Everything. Everything can go wrong.

This post collects the patterns, tool recommendations, and warnings that come from real production incidents.

Why ALTER TABLE Can Take Your Site Down#

To understand why schema changes cause outages, you need to understand how PostgreSQL handles locks. Every DDL statement acquires an ACCESS EXCLUSIVE lock on the table. This is the strongest lock type in PostgreSQL's lock hierarchy. While this lock is held, nothing else can read from or write to that table. Not SELECT, not INSERT, not UPDATE. Nothing.

Here is where it gets really nasty. Even if your ALTER TABLE is fast — say, adding a nullable column with no default, which is metadata-only and takes milliseconds — it still needs to acquire that ACCESS EXCLUSIVE lock. And it has to wait in the lock queue behind any currently running queries. And while it is waiting, it blocks every new query that comes in behind it.

Let me illustrate with a timeline:

T=0s    Long-running SELECT starts (holds ACCESS SHARE lock)
T=1s    ALTER TABLE requests ACCESS EXCLUSIVE lock — BLOCKED by the SELECT
T=2s    New SELECT comes in — BLOCKED by the pending ALTER TABLE
T=3s    Another SELECT — BLOCKED
T=4s    INSERT — BLOCKED
T=5s    100 more queries — ALL BLOCKED
...
T=300s  Original SELECT finishes
T=300s  ALTER TABLE acquires lock, runs in 5ms, releases lock
T=300s  All 500+ queued queries flood in simultaneously

That last part is the killer. Even after the ALTER TABLE completes, the thundering herd of queued queries can overwhelm your connection pool and cascade into an outage. I have seen this pattern take down services for 20 minutes even though the actual migration ran in under a second.

The Lock Timeout Defense#

The first defense is always to set a lock timeout:

sql

SET lock_timeout = '5s';
ALTER TABLE orders ADD COLUMN tracking_url TEXT;

If the lock cannot be acquired within 5 seconds, the statement aborts instead of sitting in the queue forever, blocking everything behind it. This single setting has prevented more outages than any other technique I know.

In your migration scripts, always set this:

sql

BEGIN;
SET LOCAL lock_timeout = '3s';
SET LOCAL statement_timeout = '30s';
-- your DDL here
COMMIT;

The SET LOCAL ensures these settings only apply within the transaction. If the lock acquisition times out, you retry during a quieter period — maybe during a traffic valley at 3 AM, or after you have killed that one analytics query that someone left running for 45 minutes.

Identifying Lock Contenders#

Before running any migration on a hot table, check what is currently holding locks:

sql

SELECT
  pid,
  now() - pg_stat_activity.query_start AS duration,
  query,
  state,
  wait_event_type,
  wait_event
FROM pg_stat_activity
WHERE datname = current_database()
  AND state != 'idle'
  AND pid != pg_backend_pid()
ORDER BY duration DESC;

Kill the long-runners if you can. But be aware that some long-running queries are legitimate — reporting queries, batch jobs, replication slots. You need a strategy beyond "hope the table is quiet."

The Expand-Contract Pattern#

The expand-contract pattern (sometimes called "parallel change" or "expand-migrate-contract") is the fundamental technique for zero-downtime schema changes. The idea is simple: never make a breaking change in a single step. Instead, break it into three phases that can each be deployed independently.

Phase 1 — Expand: Add the new structure alongside the old one. Both coexist. The application writes to both but reads from the old.

Phase 2 — Migrate: Backfill data into the new structure. Switch the application to read from the new structure. The old structure is still there as a safety net.

Phase 3 — Contract: Once you are confident the new structure is correct and the old one is no longer needed, remove the old structure.

Here is a concrete example. Say you want to rename a column from email to email_address:

sql

-- Phase 1: Expand (deploy #1)
ALTER TABLE users ADD COLUMN email_address TEXT;
 
-- Phase 2: Migrate (deploy #2, after application code handles both)
UPDATE users SET email_address = email WHERE email_address IS NULL;
-- Application now reads from email_address, writes to both
 
-- Phase 3: Contract (deploy #3, after verification period)
ALTER TABLE users DROP COLUMN email;

Each phase gets its own deployment. Each deployment is independently safe to roll back. At no point does the application break.

The expand-contract pattern applies to almost every schema change:

Renaming a column: Add new, copy data, switch reads, drop old
Changing a column type: Add new column with new type, dual-write, backfill, switch, drop old
Splitting a table: Create new table, dual-write, backfill, switch reads, stop writing to old
Merging tables: Start writing to target, backfill from source, switch reads, drop source

The cost is that every "simple" change becomes three deployments spread over days or weeks. That is the price of zero downtime. There is no shortcut.

PostgreSQL-Specific Techniques#

PostgreSQL has some specific behaviors and improvements that matter a lot for migrations.

Adding Columns with Defaults: The pg11 Revolution#

Before PostgreSQL 11, ALTER TABLE ... ADD COLUMN ... DEFAULT rewrote the entire table. Every single row had to be updated to include the new default value. On a table with 100 million rows, this could take hours — all while holding an ACCESS EXCLUSIVE lock.

PostgreSQL 11 changed this. Now, when you add a column with a non-volatile default, PostgreSQL stores the default value in the catalog metadata. Existing rows are not rewritten. They get the default value lazily when they are next read or updated. The ALTER TABLE completes in milliseconds regardless of table size.

sql

-- This is now fast in PostgreSQL 11+ (milliseconds, not hours)
ALTER TABLE orders ADD COLUMN status TEXT NOT NULL DEFAULT 'pending';

This was one of the most important improvements in PostgreSQL's history for operational teams. But there are caveats:

The default must be a constant or immutable function. DEFAULT now() will still rewrite the table because now() is not immutable.
If you later DROP DEFAULT on such a column, PostgreSQL does not rewrite — the existing metadata default still applies to old rows.
Some pg_dump versions had bugs with this feature in early pg11. Make sure your tooling is up to date.

Creating Indexes Concurrently#

Regular CREATE INDEX acquires a lock that blocks writes (though not reads in PostgreSQL). On a large table, index creation can take minutes or hours.

CREATE INDEX CONCURRENTLY is the solution:

sql

CREATE INDEX CONCURRENTLY idx_orders_customer_id ON orders (customer_id);

This builds the index without blocking writes. It takes longer than a regular index build (roughly 2-3x) because it has to scan the table twice and reconcile any changes that happened during the build. But it does not lock the table.

Important gotchas with concurrent index creation:

It cannot run inside a transaction. If your migration tool wraps everything in a transaction, you need to handle this specially.
If it fails, it leaves an INVALID index behind. Always check for invalid indexes after running concurrent index builds:

sql

SELECT indexrelid::regclass, indisvalid
FROM pg_index
WHERE NOT indisvalid;

Drop invalid indexes and retry. Do not leave them — they consume disk space, slow down writes (PostgreSQL still updates them), and confuse the query planner.

It holds a ShareUpdateExclusiveLock for the duration of the build. This blocks other DDL operations on the same table (including other concurrent index builds), but allows normal reads and writes. Do not try to run two concurrent index builds on the same table simultaneously — they will deadlock.

Adding NOT NULL Constraints#

Adding a NOT NULL constraint to an existing column requires a full table scan to verify no existing rows violate the constraint. On a large table, this scan holds a lock for the entire duration.

The workaround in PostgreSQL 12+ is to use a CHECK constraint with NOT VALID, then validate it separately:

sql

-- Step 1: Add the constraint without validating existing rows (fast, brief lock)
ALTER TABLE orders
  ADD CONSTRAINT orders_status_not_null
  CHECK (status IS NOT NULL) NOT VALID;
 
-- Step 2: Validate existing rows (slow, but doesn't block reads/writes)
ALTER TABLE orders VALIDATE CONSTRAINT orders_status_not_null;
 
-- Step 3: Now you can safely convert to a real NOT NULL
-- PostgreSQL recognizes the CHECK constraint and skips the full scan
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;
 
-- Step 4: Clean up the CHECK constraint
ALTER TABLE orders DROP CONSTRAINT orders_status_not_null;

Step 2 acquires only a ShareUpdateExclusiveLock, which allows reads and writes. The full table scan happens without blocking your application. Step 3 is then instant because PostgreSQL sees the existing CHECK constraint and knows the column cannot contain NULLs.

Online Schema Migration Tools#

When the built-in PostgreSQL techniques are not enough — or when you are on MySQL — online schema migration tools become essential.

gh-ost (GitHub Online Schema Migration Tool)#

gh-ost is GitHub's tool for MySQL online schema changes. Unlike pt-online-schema-change, it does not use triggers. Instead, it reads from the binary log to track changes.

How it works internally:

Creates a "ghost" table with the desired schema
Copies existing rows from the original table to the ghost table in chunks
Simultaneously reads the MySQL binary log to capture any changes (INSERT, UPDATE, DELETE) happening on the original table and replays them on the ghost table
Once the copy is complete and the ghost table is caught up with the binlog, it performs an atomic rename: the original table becomes _old and the ghost table takes its name

The binary log approach is brilliant because triggers have serious performance implications — they run synchronously within the original transaction, doubling write latency. gh-ost's approach is asynchronous and adds nearly zero overhead to the original table's write path.

bash

gh-ost \
  --host=db-replica.internal \
  --database=myapp \
  --table=orders \
  --alter="ADD COLUMN tracking_url VARCHAR(512)" \
  --execute

Key operational details:

You can throttle the migration based on replication lag, server load, or a custom flag file
You can pause and resume at any time
You can do a "dry run" that does everything except the final table swap
The cutover (final rename) takes a brief lock, typically under a second

pt-online-schema-change#

Percona's pt-online-schema-change is the older MySQL approach. It uses triggers:

Creates a new table with the desired schema
Creates INSERT, UPDATE, and DELETE triggers on the original table that replicate changes to the new table
Copies rows in chunks from original to new
Drops the original table and renames the new table

The trigger-based approach is simpler to reason about but has performance costs. Every write to the original table now also triggers a write to the new table. On write-heavy tables, this can double your write IOPS and significantly increase transaction latency.

pgroll for PostgreSQL#

pgroll is a newer tool from Xata specifically for PostgreSQL. It takes a different approach — instead of copying tables, it uses views and triggers to present multiple schema versions simultaneously.

When you start a migration with pgroll:

It creates the new column (or table changes)
It creates a view that presents the old schema to old application code
It sets up triggers to keep both schemas in sync
Old and new application versions can coexist, each seeing their own schema version
Once all old application versions are gone, you "complete" the migration and the views are removed

This is particularly powerful for blue-green deployments because both the old and new application versions can run against the same database simultaneously, each seeing the schema they expect.

json

{
  "name": "add_tracking_url",
  "operations": [
    {
      "add_column": {
        "table": "orders",
        "column": {
          "name": "tracking_url",
          "type": "text",
          "nullable": true
        }
      }
    }
  ]
}

The ORM Migration Story: When They Help and When They Hurt#

ORMs generate migrations. This is convenient during development and dangerous in production. Let me be specific about the three big ones in the Node.js ecosystem.

Prisma Migrate#

Prisma generates SQL migrations from schema diffs. You change your schema.prisma file, run prisma migrate dev, and it produces a SQL migration file. In development, this workflow is fantastic.

In production, it has sharp edges:

prisma

// Before
model Order {
  id        Int     @id @default(autoincrement())
  status    String?
}
 
// After — you changed String? to String (nullable to required)
model Order {
  id        Int     @id @default(autoincrement())
  status    String  @default("pending")
}

Prisma will generate something like:

sql

-- This will FAIL if any rows have NULL status
ALTER TABLE "Order" ALTER COLUMN "status" SET NOT NULL;
ALTER TABLE "Order" ALTER COLUMN "status" SET DEFAULT 'pending';

No backfill. No expand-contract. Just a raw ALTER TABLE that will fail if any rows have NULL values, or lock the table while scanning all rows to verify the NOT NULL constraint. On a table with 50 million rows, you just bought yourself an outage.

The right approach is to edit the generated migration file before applying it:

sql

-- Step 1: Backfill NULLs
UPDATE "Order" SET status = 'pending' WHERE status IS NULL;
 
-- Step 2: Add NOT NULL with the CHECK constraint trick
ALTER TABLE "Order"
  ADD CONSTRAINT order_status_not_null
  CHECK (status IS NOT NULL) NOT VALID;
ALTER TABLE "Order" VALIDATE CONSTRAINT order_status_not_null;
ALTER TABLE "Order" ALTER COLUMN status SET NOT NULL;
ALTER TABLE "Order" DROP CONSTRAINT order_status_not_null;
ALTER TABLE "Order" ALTER COLUMN status SET DEFAULT 'pending';

But now you are maintaining hand-written SQL alongside Prisma, which defeats much of the purpose.

Drizzle Kit#

Drizzle's migration generator is more transparent because you can see and modify the SQL directly. It generates migration files that you can edit freely. The downside is the same: the generated SQL is naive about production concerns.

typescript

// Drizzle schema change
export const orders = pgTable("orders", {
  id: serial("id").primaryKey(),
  status: text("status").notNull().default("pending"),
  // New column
  trackingUrl: text("tracking_url"),
});

Drizzle generates:

sql

ALTER TABLE "orders" ADD COLUMN "tracking_url" text;

This particular migration is fine — adding a nullable column is metadata-only in PostgreSQL. But Drizzle does not generate CREATE INDEX CONCURRENTLY. It does not split migrations into expand-contract phases. It does not set lock timeouts. You have to add all of that yourself.

TypeORM#

TypeORM's migration story is the weakest of the three for production use. Its synchronize: true mode will happily drop columns and tables to match your entity definitions. I have personally seen a production database lose a column because someone deployed with synchronize: true in the production config. The data was gone. Permanently.

Never, under any circumstances, use synchronize: true in production. Use TypeORM's migration generator, review every generated file, and add the safety measures manually.

The Bottom Line on ORMs#

Use ORM migration generators as a starting point, not as the final product. Every generated migration should be reviewed and edited for production safety before it runs against a real database. The generators do not know about your table sizes, your traffic patterns, your lock contention issues, or your uptime requirements.

Data Backfilling Strategies#

When you add a new column that needs to be populated from existing data, you have a backfill problem. The naive approach — a single UPDATE on millions of rows — will blow out your WAL, spike your disk I/O, and potentially lock the table for minutes.

Batch Updates#

Break the backfill into small batches:

sql

-- Backfill in batches of 10,000
DO $$
DECLARE
  batch_size INT := 10000;
  rows_updated INT;
BEGIN
  LOOP
    UPDATE orders
    SET tracking_url = compute_tracking_url(carrier, tracking_number)
    WHERE tracking_url IS NULL
      AND id IN (
        SELECT id FROM orders
        WHERE tracking_url IS NULL
        LIMIT batch_size
        FOR UPDATE SKIP LOCKED
      );
 
    GET DIAGNOSTICS rows_updated = ROW_COUNT;
    RAISE NOTICE 'Updated % rows', rows_updated;
 
    IF rows_updated = 0 THEN
      EXIT;
    END IF;
 
    -- Brief pause to let other transactions breathe
    PERFORM pg_sleep(0.1);
  END LOOP;
END $$;

The FOR UPDATE SKIP LOCKED is critical — it prevents the batch update from blocking on rows that are currently locked by application transactions.

Background Job Approach#

For very large backfills, prefer running them as background jobs in the application layer:

typescript

async function backfillTrackingUrls() {
  const BATCH_SIZE = 5000;
  let totalUpdated = 0;
 
  while (true) {
    const result = await db.execute(sql`
      WITH batch AS (
        SELECT id, carrier, tracking_number
        FROM orders
        WHERE tracking_url IS NULL
        LIMIT ${BATCH_SIZE}
      )
      UPDATE orders o
      SET tracking_url = CASE
        WHEN b.carrier = 'fedex' THEN 'https://track.fedex.com/' || b.tracking_number
        WHEN b.carrier = 'ups' THEN 'https://track.ups.com/' || b.tracking_number
        ELSE NULL
      END
      FROM batch b
      WHERE o.id = b.id
      RETURNING o.id
    `);
 
    totalUpdated += result.rowCount;
    console.log(`Backfilled ${totalUpdated} rows`);
 
    if (result.rowCount === 0) break;
 
    // Throttle based on replication lag or server load
    await sleep(200);
  }
}

This approach gives you better control: you can monitor progress, throttle based on server load, pause and resume, and log the progress.

Dual-Write Pattern#

For columns that need to be populated going forward while you backfill old data, use the dual-write pattern:

Deploy application code that writes to both old and new columns
Start the backfill of old rows in the background
Once the backfill is complete, switch reads to the new column
Stop writing to the old column
Drop the old column

The tricky part is step 1. Your application code needs to handle the case where the new column does not exist yet (during the deployment rollout) and where it does. Feature flags help here.

Handling Nullable to Non-Nullable Transitions Safely#

This is one of the most common migration patterns and one of the most frequently botched. You have a column that was originally nullable, and now you want to make it required.

The safe multi-step process:

Step 1: Add a default value at the application level. Every INSERT and UPDATE should now provide a value for this column. Deploy this first and verify it is working.

Step 2: Backfill all existing NULL rows with the appropriate default value.

sql

-- Do this in batches for large tables
UPDATE users SET timezone = 'UTC' WHERE timezone IS NULL;

Step 3: Add a CHECK constraint without validation to prevent new NULLs immediately:

sql

ALTER TABLE users
  ADD CONSTRAINT users_timezone_not_null
  CHECK (timezone IS NOT NULL) NOT VALID;

Step 4: Validate the constraint (this scans all rows but does not block writes):

sql

ALTER TABLE users VALIDATE CONSTRAINT users_timezone_not_null;

Step 5: Now add the actual NOT NULL constraint (instant because PostgreSQL sees the CHECK):

sql

ALTER TABLE users ALTER COLUMN timezone SET NOT NULL;
ALTER TABLE users DROP CONSTRAINT users_timezone_not_null;

If you skip to step 5 directly, PostgreSQL scans every row while holding a lock. On a table with 100 million rows, that scan takes minutes. With the CHECK constraint trick, the lock duration is milliseconds.

Foreign Key Additions on Large Tables#

Adding a foreign key to a large table is another migration that looks simple but can cause serious problems.

sql

-- This looks innocent. It is not.
ALTER TABLE orders ADD CONSTRAINT fk_orders_customer
  FOREIGN KEY (customer_id) REFERENCES customers(id);

This acquires an ACCESS EXCLUSIVE lock on the orders table, then scans every row to validate that all customer_id values reference valid rows in customers. If orders has 200 million rows, this can take a very long time.

The same NOT VALID trick works here:

sql

-- Step 1: Add constraint without validation (brief lock)
ALTER TABLE orders ADD CONSTRAINT fk_orders_customer
  FOREIGN KEY (customer_id) REFERENCES customers(id)
  NOT VALID;
 
-- Step 2: Validate separately (scans table but allows concurrent reads/writes)
ALTER TABLE orders VALIDATE CONSTRAINT fk_orders_customer;

The NOT VALID constraint still enforces the foreign key for all new and updated rows immediately — it just skips the validation scan for existing rows. Step 2 then validates existing rows without blocking.

The Deadlock Dance#

Foreign key constraints create implicit locks on referenced tables. When you INSERT into orders (which has an FK to customers), PostgreSQL takes a share lock on the referenced row in customers to ensure it is not deleted during the transaction. This can cause deadlocks when two transactions are modifying related rows in both tables simultaneously.

A common production incident: adding a foreign key between orders and products causes a spike in deadlocks. The root cause is often a background job that updates product prices (locking product rows) at the same time as the checkout flow creates orders (which now needs to lock the same product rows for FK validation). The fix is to adjust the background job to run during off-peak hours and to use shorter transactions.

When adding foreign keys, always check for patterns in your application that touch both tables in the same transaction. Add monitoring for deadlocks before and after the change:

sql

SELECT datname, deadlocks
FROM pg_stat_database
WHERE datname = current_database();

Blue-Green Deployments with Schema Changes#

Blue-green deployments are standard for application code. You have two environments — blue (current production) and green (new version). You switch traffic from blue to green, and if something goes wrong, you switch back.

When database schema changes are involved, this gets complicated. The database is shared between blue and green, so both application versions must be compatible with the database schema at all times during the transition.

This is the compatibility window concept. At any point during the deployment, you might have old and new application code running against the same database. Your schema must be compatible with both.

This rules out certain changes during a blue-green deploy:

Renaming a column: Old code will look for the old name. New code will look for the new name. Someone breaks.
Dropping a column: If you drop it before switching, old code breaks. If new code does not use it, that is fine — but what about rollback?
Changing column type: Old code might send the wrong type.

The solution is always the expand-contract pattern, with each phase aligned to a deployment:

Deploy #1 (green): Add new column, application writes to both, reads from old
  -> Switch traffic to green
  -> Both blue and green compatible with current schema

Deploy #2 (green): Backfill complete, switch reads to new column
  -> Switch traffic to green
  -> Old column still exists for rollback

Deploy #3 (green): Drop old column
  -> Switch traffic to green
  -> No rollback possible to pre-migration state (but you shouldn't need it by now)

Each deploy maintains backward compatibility. Each deploy can be rolled back independently. The cost is three deployments instead of one.

Testing Migrations Against Production-Sized Data#

The migration that works perfectly on your development database with 1,000 rows will destroy production with 100 million rows. You absolutely must test against production-sized data. Here are the approaches I have used:

Database Snapshots#

If you are on managed PostgreSQL (RDS, Cloud SQL, Supabase), create a snapshot of production and restore it to a test instance. Run your migration against the test instance, measure the time, check for lock contention.

bash

# RDS example
aws rds create-db-snapshot \
  --db-instance-identifier prod-db \
  --db-snapshot-identifier migration-test-20260322
 
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier migration-test \
  --db-snapshot-identifier migration-test-20260322

Synthetic Data Generation#

When you cannot use production data (compliance, GDPR, sensitivity), generate synthetic data at production scale:

sql

-- Generate 100M realistic-looking orders
INSERT INTO orders (customer_id, status, created_at, total_amount)
SELECT
  (random() * 1000000)::int,
  (ARRAY['pending', 'shipped', 'delivered', 'cancelled'])[floor(random() * 4 + 1)],
  now() - (random() * interval '365 days'),
  (random() * 10000)::numeric(10,2)
FROM generate_series(1, 100000000);

This takes a while, but it gives you a table that matches production's row count and data distribution.

EXPLAIN Without Running#

For quick sanity checks on queries that a migration involves, use EXPLAIN on production without actually running the statement:

sql

-- See the plan for the migration without running it
EXPLAIN ALTER TABLE orders ADD CONSTRAINT check_status
  CHECK (status IN ('pending', 'shipped', 'delivered', 'cancelled'));

This does not tell you how long it will take, but it tells you whether PostgreSQL plans to scan the table.

pg_stat_user_tables for Size Estimates#

Before running any migration, know your table sizes:

sql

SELECT
  relname AS table_name,
  pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
  n_live_tup AS row_count,
  n_dead_tup AS dead_rows,
  last_vacuum,
  last_autovacuum
FROM pg_stat_user_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 20;

If the table has 500 million rows and 50 GB of data, you should not be running naive ALTER TABLE statements against it.

The Rollback Problem#

Here is an uncomfortable truth that most migration guides gloss over: rollback migrations almost never work in production.

Think about it. You run a forward migration that adds a column and starts populating it with data. The rollback drops that column. But during the time between the forward migration and the rollback, new rows were inserted with data in that column. Data you might need. Data that is now gone.

Or consider a more complex scenario: you split a table into two. The forward migration creates the new table, sets up dual-writes, and starts backfilling. The rollback would need to... merge the tables back? What about data that was written only to the new table during the split?

Why I Stopped Writing Rollback Migrations#

After the third incident where a rollback migration caused more damage than the original problem, I adopted a different philosophy: rollback migrations are a code smell. If your forward migration requires a complex rollback, your forward migration is too aggressive.

Instead, use these strategies:

Fix forward: If a migration causes problems, deploy a new forward migration that fixes the issue. This is safer because you are always moving in one direction and you can test the fix before deploying it.

Expand-contract with long verification periods: Between each phase of an expand-contract migration, wait days or weeks. Run both old and new structures in parallel. Monitor for issues. Only proceed to the next phase when you are confident.

Logical rollbacks: Instead of undoing DDL, undo the application behavior. If the new column is causing problems, deploy application code that ignores it. The column still exists in the database, but nothing uses it. Fix the issue, then start using it again.

typescript

// "Logical rollback" — feature flag controls whether we use the new column
const useTrackingUrl = await featureFlags.isEnabled("use_tracking_url");
 
const order = useTrackingUrl ? await getOrderWithTracking(orderId) : await getOrderLegacy(orderId);

The Exception: Simple Additive Changes#

The one case where rollback migrations are safe and reasonable is simple additive changes. If your forward migration added a nullable column with no default, the rollback is just DROP COLUMN. No data loss that matters (the column was just added and is likely empty or irrelevant). No complex logic. This is fine.

sql

-- Forward
ALTER TABLE orders ADD COLUMN tracking_url TEXT;
 
-- Rollback (safe because the column was just added)
ALTER TABLE orders DROP COLUMN tracking_url;

Real Incident: The 30-Second Migration That Took 4 Hours#

Let me tell you about the worst migration incident I have ever been involved with. It is a case study in everything that can go wrong.

The setup: a SaaS platform with a sessions table. 200 million rows. PostgreSQL 12. The table was hot — every API request read from or wrote to it. Average traffic: 3,000 queries per second hitting this table.

The migration: add a device_type column with a default of 'unknown'.

The developer knew about the PostgreSQL 11 improvement for column defaults. They tested on a staging database with 1 million rows. The migration took 100 milliseconds. They figured production would be similarly fast. They were right — the DDL itself would have taken milliseconds.

But they did not account for the lock acquisition.

Here is what happened:

2:14 PM — A reporting query started a sequential scan of the sessions table. Estimated runtime: 8 minutes. This query held an ACCESS SHARE lock.

2:15 PM — The developer ran the migration. The ALTER TABLE requested an ACCESS EXCLUSIVE lock. It was blocked by the reporting query.

2:15 PM — Immediately, all new queries to the sessions table began queuing behind the pending ALTER TABLE.

2:16 PM — The application's connection pool maxed out. New requests started getting connection timeouts. Error rates spiked to 40%.

2:17 PM — The on-call engineer was paged. They saw the error rate spike but did not immediately connect it to a migration. They started investigating application-level issues.

2:20 PM — The engineer found the lock contention in pg_stat_activity. They killed the reporting query.

2:20 PM — The ALTER TABLE acquired its lock, ran in 12 milliseconds, and released the lock. The queued queries flooded in.

2:20 PM — The thundering herd overwhelmed PostgreSQL. 800 simultaneous queries hit the table. PostgreSQL's CPU went to 100%. The connection pool overflowed again.

2:25 PM — The team started killing queries to reduce load. But new queries kept coming in. The application's retry logic was creating even more load.

2:30 PM — They made the decision to restart the application servers with reduced concurrency. This took 5 minutes across all pods.

2:35 PM — Servers came back with reduced connection pools. Load stabilized.

2:40 PM — They discovered that the thundering herd had caused several deadlocks, and some transactions had been rolled back but not properly retried by the application.

3:00 PM — Full incident response. They found 847 orders that were in an inconsistent state due to the partial transaction failures.

4:00 PM — Manual data reconciliation began.

6:00 PM — All inconsistent orders were fixed. The incident was declared over.

Total impact: 3 hours and 45 minutes of degraded service. 847 orders required manual intervention. The migration itself was 12 milliseconds.

The postmortem identified several fixes:

Always set lock_timeout. If the developer had used SET lock_timeout = '3s', the migration would have failed harmlessly after 3 seconds instead of queuing hundreds of queries.
Kill long-running queries before migrating hot tables. Or better: schedule migrations during maintenance windows.
The application needed better circuit breaking. When the connection pool maxed out, the retry logic made things worse.
Connection pool limits should be lower than PostgreSQL's max_connections. This prevents the thundering herd from overwhelming the database.
Monitor pg_stat_activity during migrations. Have a dashboard showing lock waits in real time.

Migration Checklist#

After all these incidents and lessons, I now use this checklist for every production migration:

Know your table size. Run pg_stat_user_tables for row counts. Run pg_total_relation_size() for disk size. If either number is large, proceed with extreme caution.
Set lock_timeout and statement_timeout. Always. No exceptions.
Check for long-running queries before starting the migration. Kill them if possible.
Test on production-sized data. A staging database with 1,000 rows tells you nothing about a production table with 100 million rows.
Use CONCURRENTLY for index creation. Every time. Even if the table is small now.
Use NOT VALID for constraint additions. Add the constraint without validation, then validate separately.
Backfill in batches. Never update millions of rows in a single statement.
Deploy in expand-contract phases. Never make a breaking change in one step.
Monitor after deployment. Watch for deadlocks, slow queries, increased lock wait times.
Do not trust rollback migrations. Plan to fix forward instead.

Closing Thoughts#

Zero-downtime migrations are not about clever tricks or fancy tools. They are about discipline. Every schema change, no matter how small, deserves a moment of thought: "What will this do on a table with 200 million rows during peak traffic?" If the answer is anything other than "complete in milliseconds without blocking," you need a plan.

The expand-contract pattern adds overhead to every migration. Lock timeouts add complexity to your deployment scripts. Testing against production-sized data takes time and money. But the alternative is the 3 AM phone call, the weekend-long incident, the multi-page postmortem, and the lost transactions.

The trade-off is obvious once you have lived through the alternative.

The database does not care about your deployment schedule, your sprint commitments, or your Friday evening plans. It will take a lock and hold it for as long as it needs to. Your job is to make sure that "as long as it needs to" is measured in milliseconds, not hours.

Every migration in this post has one thing in common: it trades developer convenience for operational safety. You write more SQL. You deploy more often. You wait longer between phases. And your users never notice a thing. That is the goal. That is zero-downtime migration done right.

The migration itself? Five lines of SQL. The postmortem? Forty-seven pages.

Everything. Everything can go wrong.

This post collects the patterns, tool recommendations, and warnings that come from real production incidents.

Why ALTER TABLE Can Take Your Site Down#

Let me illustrate with a timeline:

T=0s    Long-running SELECT starts (holds ACCESS SHARE lock)
T=1s    ALTER TABLE requests ACCESS EXCLUSIVE lock — BLOCKED by the SELECT
T=2s    New SELECT comes in — BLOCKED by the pending ALTER TABLE
T=3s    Another SELECT — BLOCKED
T=4s    INSERT — BLOCKED
T=5s    100 more queries — ALL BLOCKED
...
T=300s  Original SELECT finishes
T=300s  ALTER TABLE acquires lock, runs in 5ms, releases lock
T=300s  All 500+ queued queries flood in simultaneously

The Lock Timeout Defense#

The first defense is always to set a lock timeout:

sql

SET lock_timeout = '5s';
ALTER TABLE orders ADD COLUMN tracking_url TEXT;

In your migration scripts, always set this:

sql

BEGIN;
SET LOCAL lock_timeout = '3s';
SET LOCAL statement_timeout = '30s';
-- your DDL here
COMMIT;

Identifying Lock Contenders#

Before running any migration on a hot table, check what is currently holding locks:

sql

SELECT
  pid,
  now() - pg_stat_activity.query_start AS duration,
  query,
  state,
  wait_event_type,
  wait_event
FROM pg_stat_activity
WHERE datname = current_database()
  AND state != 'idle'
  AND pid != pg_backend_pid()
ORDER BY duration DESC;

The Expand-Contract Pattern#

Phase 1 — Expand: Add the new structure alongside the old one. Both coexist. The application writes to both but reads from the old.

Phase 2 — Migrate: Backfill data into the new structure. Switch the application to read from the new structure. The old structure is still there as a safety net.

Phase 3 — Contract: Once you are confident the new structure is correct and the old one is no longer needed, remove the old structure.

Here is a concrete example. Say you want to rename a column from email to email_address:

sql

-- Phase 1: Expand (deploy #1)
ALTER TABLE users ADD COLUMN email_address TEXT;
 
-- Phase 2: Migrate (deploy #2, after application code handles both)
UPDATE users SET email_address = email WHERE email_address IS NULL;
-- Application now reads from email_address, writes to both
 
-- Phase 3: Contract (deploy #3, after verification period)
ALTER TABLE users DROP COLUMN email;

Each phase gets its own deployment. Each deployment is independently safe to roll back. At no point does the application break.

The expand-contract pattern applies to almost every schema change:

Renaming a column: Add new, copy data, switch reads, drop old
Changing a column type: Add new column with new type, dual-write, backfill, switch, drop old
Splitting a table: Create new table, dual-write, backfill, switch reads, stop writing to old
Merging tables: Start writing to target, backfill from source, switch reads, drop source

The cost is that every "simple" change becomes three deployments spread over days or weeks. That is the price of zero downtime. There is no shortcut.

PostgreSQL-Specific Techniques#

PostgreSQL has some specific behaviors and improvements that matter a lot for migrations.

Adding Columns with Defaults: The pg11 Revolution#

sql

-- This is now fast in PostgreSQL 11+ (milliseconds, not hours)
ALTER TABLE orders ADD COLUMN status TEXT NOT NULL DEFAULT 'pending';

This was one of the most important improvements in PostgreSQL's history for operational teams. But there are caveats:

The default must be a constant or immutable function. DEFAULT now() will still rewrite the table because now() is not immutable.
If you later DROP DEFAULT on such a column, PostgreSQL does not rewrite — the existing metadata default still applies to old rows.
Some pg_dump versions had bugs with this feature in early pg11. Make sure your tooling is up to date.

Creating Indexes Concurrently#

Regular CREATE INDEX acquires a lock that blocks writes (though not reads in PostgreSQL). On a large table, index creation can take minutes or hours.

CREATE INDEX CONCURRENTLY is the solution:

sql

CREATE INDEX CONCURRENTLY idx_orders_customer_id ON orders (customer_id);

Important gotchas with concurrent index creation:

It cannot run inside a transaction. If your migration tool wraps everything in a transaction, you need to handle this specially.
If it fails, it leaves an INVALID index behind. Always check for invalid indexes after running concurrent index builds:

sql

SELECT indexrelid::regclass, indisvalid
FROM pg_index
WHERE NOT indisvalid;

Drop invalid indexes and retry. Do not leave them — they consume disk space, slow down writes (PostgreSQL still updates them), and confuse the query planner.

It holds a ShareUpdateExclusiveLock for the duration of the build. This blocks other DDL operations on the same table (including other concurrent index builds), but allows normal reads and writes. Do not try to run two concurrent index builds on the same table simultaneously — they will deadlock.

Adding NOT NULL Constraints#

Adding a NOT NULL constraint to an existing column requires a full table scan to verify no existing rows violate the constraint. On a large table, this scan holds a lock for the entire duration.

The workaround in PostgreSQL 12+ is to use a CHECK constraint with NOT VALID, then validate it separately:

sql

-- Step 1: Add the constraint without validating existing rows (fast, brief lock)
ALTER TABLE orders
  ADD CONSTRAINT orders_status_not_null
  CHECK (status IS NOT NULL) NOT VALID;
 
-- Step 2: Validate existing rows (slow, but doesn't block reads/writes)
ALTER TABLE orders VALIDATE CONSTRAINT orders_status_not_null;
 
-- Step 3: Now you can safely convert to a real NOT NULL
-- PostgreSQL recognizes the CHECK constraint and skips the full scan
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;
 
-- Step 4: Clean up the CHECK constraint
ALTER TABLE orders DROP CONSTRAINT orders_status_not_null;

Online Schema Migration Tools#

When the built-in PostgreSQL techniques are not enough — or when you are on MySQL — online schema migration tools become essential.

gh-ost (GitHub Online Schema Migration Tool)#

gh-ost is GitHub's tool for MySQL online schema changes. Unlike pt-online-schema-change, it does not use triggers. Instead, it reads from the binary log to track changes.

How it works internally:

Creates a "ghost" table with the desired schema
Copies existing rows from the original table to the ghost table in chunks
Simultaneously reads the MySQL binary log to capture any changes (INSERT, UPDATE, DELETE) happening on the original table and replays them on the ghost table
Once the copy is complete and the ghost table is caught up with the binlog, it performs an atomic rename: the original table becomes _old and the ghost table takes its name

bash

gh-ost \
  --host=db-replica.internal \
  --database=myapp \
  --table=orders \
  --alter="ADD COLUMN tracking_url VARCHAR(512)" \
  --execute

Key operational details:

You can throttle the migration based on replication lag, server load, or a custom flag file
You can pause and resume at any time
You can do a "dry run" that does everything except the final table swap
The cutover (final rename) takes a brief lock, typically under a second

pt-online-schema-change#

Percona's pt-online-schema-change is the older MySQL approach. It uses triggers:

Creates a new table with the desired schema
Creates INSERT, UPDATE, and DELETE triggers on the original table that replicate changes to the new table
Copies rows in chunks from original to new
Drops the original table and renames the new table

pgroll for PostgreSQL#

When you start a migration with pgroll:

It creates the new column (or table changes)
It creates a view that presents the old schema to old application code
It sets up triggers to keep both schemas in sync
Old and new application versions can coexist, each seeing their own schema version
Once all old application versions are gone, you "complete" the migration and the views are removed

This is particularly powerful for blue-green deployments because both the old and new application versions can run against the same database simultaneously, each seeing the schema they expect.

json

{
  "name": "add_tracking_url",
  "operations": [
    {
      "add_column": {
        "table": "orders",
        "column": {
          "name": "tracking_url",
          "type": "text",
          "nullable": true
        }
      }
    }
  ]
}

The ORM Migration Story: When They Help and When They Hurt#

ORMs generate migrations. This is convenient during development and dangerous in production. Let me be specific about the three big ones in the Node.js ecosystem.

Prisma Migrate#

Prisma generates SQL migrations from schema diffs. You change your schema.prisma file, run prisma migrate dev, and it produces a SQL migration file. In development, this workflow is fantastic.

In production, it has sharp edges:

prisma

// Before
model Order {
  id        Int     @id @default(autoincrement())
  status    String?
}
 
// After — you changed String? to String (nullable to required)
model Order {
  id        Int     @id @default(autoincrement())
  status    String  @default("pending")
}

Prisma will generate something like:

sql

-- This will FAIL if any rows have NULL status
ALTER TABLE "Order" ALTER COLUMN "status" SET NOT NULL;
ALTER TABLE "Order" ALTER COLUMN "status" SET DEFAULT 'pending';

The right approach is to edit the generated migration file before applying it:

sql

-- Step 1: Backfill NULLs
UPDATE "Order" SET status = 'pending' WHERE status IS NULL;
 
-- Step 2: Add NOT NULL with the CHECK constraint trick
ALTER TABLE "Order"
  ADD CONSTRAINT order_status_not_null
  CHECK (status IS NOT NULL) NOT VALID;
ALTER TABLE "Order" VALIDATE CONSTRAINT order_status_not_null;
ALTER TABLE "Order" ALTER COLUMN status SET NOT NULL;
ALTER TABLE "Order" DROP CONSTRAINT order_status_not_null;
ALTER TABLE "Order" ALTER COLUMN status SET DEFAULT 'pending';

But now you are maintaining hand-written SQL alongside Prisma, which defeats much of the purpose.

Drizzle Kit#

typescript

// Drizzle schema change
export const orders = pgTable("orders", {
  id: serial("id").primaryKey(),
  status: text("status").notNull().default("pending"),
  // New column
  trackingUrl: text("tracking_url"),
});

Drizzle generates:

sql

ALTER TABLE "orders" ADD COLUMN "tracking_url" text;

TypeORM#

Never, under any circumstances, use synchronize: true in production. Use TypeORM's migration generator, review every generated file, and add the safety measures manually.

The Bottom Line on ORMs#

Data Backfilling Strategies#

Batch Updates#

Break the backfill into small batches:

sql

-- Backfill in batches of 10,000
DO $$
DECLARE
  batch_size INT := 10000;
  rows_updated INT;
BEGIN
  LOOP
    UPDATE orders
    SET tracking_url = compute_tracking_url(carrier, tracking_number)
    WHERE tracking_url IS NULL
      AND id IN (
        SELECT id FROM orders
        WHERE tracking_url IS NULL
        LIMIT batch_size
        FOR UPDATE SKIP LOCKED
      );
 
    GET DIAGNOSTICS rows_updated = ROW_COUNT;
    RAISE NOTICE 'Updated % rows', rows_updated;
 
    IF rows_updated = 0 THEN
      EXIT;
    END IF;
 
    -- Brief pause to let other transactions breathe
    PERFORM pg_sleep(0.1);
  END LOOP;
END $$;

The FOR UPDATE SKIP LOCKED is critical — it prevents the batch update from blocking on rows that are currently locked by application transactions.

Background Job Approach#

For very large backfills, prefer running them as background jobs in the application layer:

typescript

async function backfillTrackingUrls() {
  const BATCH_SIZE = 5000;
  let totalUpdated = 0;
 
  while (true) {
    const result = await db.execute(sql`
      WITH batch AS (
        SELECT id, carrier, tracking_number
        FROM orders
        WHERE tracking_url IS NULL
        LIMIT ${BATCH_SIZE}
      )
      UPDATE orders o
      SET tracking_url = CASE
        WHEN b.carrier = 'fedex' THEN 'https://track.fedex.com/' || b.tracking_number
        WHEN b.carrier = 'ups' THEN 'https://track.ups.com/' || b.tracking_number
        ELSE NULL
      END
      FROM batch b
      WHERE o.id = b.id
      RETURNING o.id
    `);
 
    totalUpdated += result.rowCount;
    console.log(`Backfilled ${totalUpdated} rows`);
 
    if (result.rowCount === 0) break;
 
    // Throttle based on replication lag or server load
    await sleep(200);
  }
}

This approach gives you better control: you can monitor progress, throttle based on server load, pause and resume, and log the progress.

Dual-Write Pattern#

For columns that need to be populated going forward while you backfill old data, use the dual-write pattern:

Deploy application code that writes to both old and new columns
Start the backfill of old rows in the background
Once the backfill is complete, switch reads to the new column
Stop writing to the old column
Drop the old column

The tricky part is step 1. Your application code needs to handle the case where the new column does not exist yet (during the deployment rollout) and where it does. Feature flags help here.

Handling Nullable to Non-Nullable Transitions Safely#

This is one of the most common migration patterns and one of the most frequently botched. You have a column that was originally nullable, and now you want to make it required.

The safe multi-step process:

Step 1: Add a default value at the application level. Every INSERT and UPDATE should now provide a value for this column. Deploy this first and verify it is working.

Step 2: Backfill all existing NULL rows with the appropriate default value.

sql

-- Do this in batches for large tables
UPDATE users SET timezone = 'UTC' WHERE timezone IS NULL;

Step 3: Add a CHECK constraint without validation to prevent new NULLs immediately:

sql

ALTER TABLE users
  ADD CONSTRAINT users_timezone_not_null
  CHECK (timezone IS NOT NULL) NOT VALID;

Step 4: Validate the constraint (this scans all rows but does not block writes):

sql

ALTER TABLE users VALIDATE CONSTRAINT users_timezone_not_null;

Step 5: Now add the actual NOT NULL constraint (instant because PostgreSQL sees the CHECK):

sql

ALTER TABLE users ALTER COLUMN timezone SET NOT NULL;
ALTER TABLE users DROP CONSTRAINT users_timezone_not_null;

Foreign Key Additions on Large Tables#

Adding a foreign key to a large table is another migration that looks simple but can cause serious problems.

sql

-- This looks innocent. It is not.
ALTER TABLE orders ADD CONSTRAINT fk_orders_customer
  FOREIGN KEY (customer_id) REFERENCES customers(id);

The same NOT VALID trick works here:

sql

-- Step 1: Add constraint without validation (brief lock)
ALTER TABLE orders ADD CONSTRAINT fk_orders_customer
  FOREIGN KEY (customer_id) REFERENCES customers(id)
  NOT VALID;
 
-- Step 2: Validate separately (scans table but allows concurrent reads/writes)
ALTER TABLE orders VALIDATE CONSTRAINT fk_orders_customer;

The Deadlock Dance#

When adding foreign keys, always check for patterns in your application that touch both tables in the same transaction. Add monitoring for deadlocks before and after the change:

sql

SELECT datname, deadlocks
FROM pg_stat_database
WHERE datname = current_database();

Blue-Green Deployments with Schema Changes#

This is the compatibility window concept. At any point during the deployment, you might have old and new application code running against the same database. Your schema must be compatible with both.

This rules out certain changes during a blue-green deploy:

Renaming a column: Old code will look for the old name. New code will look for the new name. Someone breaks.
Dropping a column: If you drop it before switching, old code breaks. If new code does not use it, that is fine — but what about rollback?
Changing column type: Old code might send the wrong type.

The solution is always the expand-contract pattern, with each phase aligned to a deployment:

Deploy #1 (green): Add new column, application writes to both, reads from old
  -> Switch traffic to green
  -> Both blue and green compatible with current schema

Deploy #2 (green): Backfill complete, switch reads to new column
  -> Switch traffic to green
  -> Old column still exists for rollback

Deploy #3 (green): Drop old column
  -> Switch traffic to green
  -> No rollback possible to pre-migration state (but you shouldn't need it by now)

Each deploy maintains backward compatibility. Each deploy can be rolled back independently. The cost is three deployments instead of one.

Testing Migrations Against Production-Sized Data#

Database Snapshots#

bash

# RDS example
aws rds create-db-snapshot \
  --db-instance-identifier prod-db \
  --db-snapshot-identifier migration-test-20260322
 
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier migration-test \
  --db-snapshot-identifier migration-test-20260322

Synthetic Data Generation#

When you cannot use production data (compliance, GDPR, sensitivity), generate synthetic data at production scale:

sql

-- Generate 100M realistic-looking orders
INSERT INTO orders (customer_id, status, created_at, total_amount)
SELECT
  (random() * 1000000)::int,
  (ARRAY['pending', 'shipped', 'delivered', 'cancelled'])[floor(random() * 4 + 1)],
  now() - (random() * interval '365 days'),
  (random() * 10000)::numeric(10,2)
FROM generate_series(1, 100000000);

This takes a while, but it gives you a table that matches production's row count and data distribution.

EXPLAIN Without Running#

For quick sanity checks on queries that a migration involves, use EXPLAIN on production without actually running the statement:

sql

-- See the plan for the migration without running it
EXPLAIN ALTER TABLE orders ADD CONSTRAINT check_status
  CHECK (status IN ('pending', 'shipped', 'delivered', 'cancelled'));

This does not tell you how long it will take, but it tells you whether PostgreSQL plans to scan the table.

pg_stat_user_tables for Size Estimates#

Before running any migration, know your table sizes:

sql

SELECT
  relname AS table_name,
  pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
  n_live_tup AS row_count,
  n_dead_tup AS dead_rows,
  last_vacuum,
  last_autovacuum
FROM pg_stat_user_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 20;

If the table has 500 million rows and 50 GB of data, you should not be running naive ALTER TABLE statements against it.

The Rollback Problem#

Here is an uncomfortable truth that most migration guides gloss over: rollback migrations almost never work in production.

Why I Stopped Writing Rollback Migrations#

Instead, use these strategies:

typescript

// "Logical rollback" — feature flag controls whether we use the new column
const useTrackingUrl = await featureFlags.isEnabled("use_tracking_url");
 
const order = useTrackingUrl ? await getOrderWithTracking(orderId) : await getOrderLegacy(orderId);

The Exception: Simple Additive Changes#

sql

-- Forward
ALTER TABLE orders ADD COLUMN tracking_url TEXT;
 
-- Rollback (safe because the column was just added)
ALTER TABLE orders DROP COLUMN tracking_url;