Why PostgreSQL Disaster Recovery Plans Fail (And How to Fix Yours)
After 20+ years managing production databases, I can tell you the pattern is distressingly consistent: companies invest in backups, pat themselves on the back, and then discover during an actual incident that their "backup strategy" was theater. The backup job ran. The data wasn't actually recoverable.
PostgreSQL disaster recovery is a discipline, not a checkbox. This guide covers the architectural decisions, tools, and testing habits that separate a real recovery plan from a false sense of security.
Logical vs. Physical Backups: Know the Difference
PostgreSQL gives you two fundamentally different backup approaches, and most teams mix them up. Understanding the tradeoffs is the foundation of any DR plan.
Logical backups (pg_dump, pg_dumpall) export your data as SQL statements or custom-format archives. They're human-readable, selective (you can dump a single table), and portable across PostgreSQL versions. But they're slow to restore at scale — a 500GB database dump can take 6–8 hours to restore even on fast hardware. For RPO/RTO targets under 4 hours, logical backups alone won't cut it.
Physical backups (pg_basebackup, file system snapshots) capture the raw data directory. They're faster to restore at scale, and — crucially — they can be combined with WAL archiving for point-in-time recovery. The downside: they're version-specific and you can't selectively restore a single table without additional tooling.
The right answer for most production PostgreSQL deployments is both: physical backups as your primary DR mechanism, supplemented by periodic logical exports for long-term archival and selective recovery scenarios.
WAL Archiving: The Key to Point-in-Time Recovery
Write-Ahead Log (WAL) archiving is where PostgreSQL's disaster recovery story gets genuinely powerful. WAL is the transaction log — every change to your database is written to WAL before it's applied to the data files. Archive that log continuously, and you can recover to any point in time, not just your last backup.
To enable WAL archiving, set these parameters in postgresql.conf:
wal_level = replica
archive_mode = on
archive_command = 'cp %p /pg/wal_archive/%f'
In production, your archive_command should ship WAL files to durable off-site storage — S3, GCS, or Azure Blob — not a local path. Tools like pgBackRest and Barman handle this with compression, encryption, and retry logic built in.
Critical gotcha: If your archive_command fails, PostgreSQL keeps WAL files locally until archiving succeeds. If archiving is broken for 24 hours and your pg_wal volume fills, your database will stop accepting writes. Monitor your WAL archive lag as a production metric.
pg_basebackup Best Practices
pg_basebackup is PostgreSQL's built-in tool for taking physical base backups. It's reliable, ships with every PostgreSQL installation, and works well as part of a scripted backup routine. Here's what 20 years of production experience tells me about using it correctly:
- Use
--checkpoint=fastjudiciously. Without it, pg_basebackup waits for the next scheduled checkpoint, which can add minutes of delay. With it, it forces an immediate checkpoint, adding I/O load. On busy OLTP systems, schedule base backups during off-peak windows. - Enable WAL streaming (
--wal-method=stream). This ensures WAL generated during the backup is included, giving you a self-consistent backup without relying on a separate WAL archive. - Compress your backups.
--compress=9reduces backup size 60–70% for typical OLTP data. At scale, storage costs matter. - Verify the backup. Run pg_basebackup with
--verify-checksumsif data checksums are enabled. Enable checksums on new clusters withinitdb --data-checksums. - Test your restore path, not just the backup. A backup that takes 30 minutes to create but 8 hours to restore may not meet your RTO. Measure restore time quarterly.
For large databases (100GB+), evaluate pgBackRest as a pg_basebackup alternative. It adds incremental and differential backup modes, parallel backup/restore, and built-in retention management — capabilities pg_basebackup doesn't offer natively.
PostgreSQL Point-in-Time Recovery in Practice
Point-in-time recovery (PITR) is what saves you when someone runs DELETE FROM orders without a WHERE clause at 2pm on a Tuesday. You don't restore from last night's backup and lose a day of transactions — you restore to 1:59pm.
The mechanics: restore a physical base backup, then replay WAL archives up to your target time. In PostgreSQL 12+, the recovery configuration lives in postgresql.conf:
restore_command = 'cp /pg/wal_archive/%f %p'
recovery_target_time = '2026-05-02 13:59:00'
recovery_target_action = 'promote'
Older versions use a separate recovery.conf file. Know which version you're running and keep your recovery runbook current.
Things that trip up PITR in production:
- WAL archive gaps — if even one WAL segment is missing, recovery stops at that point
- Timeline confusion — after a PITR restore you're on a new timeline; subsequent WAL archiving references the new timeline
- Unlogged tables — data in UNLOGGED tables is not WAL-logged and won't survive PITR
- Logical replication slots — these don't survive PITR by default; re-establish replication after recovery
Your PostgreSQL Disaster Recovery Checklist
Paste this into your runbook. If you can't answer "yes" to every item, you have gaps in your DR posture.
- ✅ Physical base backups running daily — stored off-site, not on the same server
- ✅ WAL archiving configured and monitored — archive lag alert set, archive storage durable
- ✅ Logical backups (pg_dump) running weekly — for selective table recovery and long-term archival
- ✅ Restore tested monthly — not just "backup ran successfully," but data verified after restore
- ✅ PITR procedure documented and tested quarterly — can your team execute it at 3am under pressure?
- ✅ RTO and RPO defined and met — if you don't know your targets, you can't design a plan to meet them
- ✅ Data checksums enabled — catches storage corruption before it silently corrupts backups
- ✅ Backup retention policy defined — how long you keep daily, weekly, and monthly snapshots
- ✅ Streaming replica or standby — for fast failover that doesn't require a full restore
- ✅ DR runbook updated — includes connection strings, archive locations, and who to call
The Part Most Teams Skip: Failover Testing
Your PostgreSQL disaster recovery plan is only as good as your last successful test. I've seen teams with technically correct DR configurations discover — at the worst possible moment — that their recovery procedure had a 4-hour manual step nobody documented.
Run a full DR drill at least twice a year. Restore to a separate environment, validate the data, measure how long it actually takes, and update your runbook based on what you learn. The first time you run a recovery should not be during an actual incident.
If this sounds like more PostgreSQL operational overhead than your team can sustain, that's exactly where a remote DBA adds disproportionate value. You get 20+ years of DR expertise applied to your specific environment — backups tested, PITR validated, runbook maintained — without the overhead of a full-time hire. If you're not sure where your current backup strategy stands, the free database health assessment is a good starting point.
Related: Database Backup Strategies: RPO, RTO, and What Actually Matters · SQL Server to PostgreSQL Migration: A DBA's Practical Guide