PostgreSQL at Scale: HA, Performance, and DR

PostgreSQL high availability is not achieved by adding one replica and hoping failover works. At scale, PostgreSQL becomes a reliability system: storage behavior, replication lag, connection storms, backups, failover automation, observability, and application retry logic all interact. A database can look healthy from the outside while p99 latency grows, locks accumulate, autovacuum falls behind, and user requests start timing out.

For SteadyOps work, PostgreSQL at scale means designing for predictable failure. The goal is not only more throughput. The goal is to keep latency stable, failover controlled, restore tested, and operational decisions clear enough that an on-call engineer can act without guessing.

Start with workload and failure model

Before choosing bare metal, Kubernetes, cloud managed PostgreSQL, Patroni, or a database operator, define the workload. Is it read-heavy or write-heavy? Does the application tolerate stale reads? How many connections can appear during a traffic spike? What is the acceptable RPO/RTO? Which tables grow fastest? Which queries dominate p95 and p99 latency?

Scaling without this model usually creates expensive fragility. Teams add replicas, but the application still sends all writes and most reads to the primary. They increase instance size, but the bottleneck is connection count or lock contention. They add Kubernetes automation, but persistent storage and failover behavior are not tested.

A practical PostgreSQL scaling review starts with:

Active connections and connection spikes.
Slow queries and query plans.
Lock waits and long transactions.
Replication lag and replay delay.
Checkpoint frequency and disk latency.
Autovacuum health and table bloat.
Backup duration and restore test status.

Bare metal versus Kubernetes for PostgreSQL

Bare metal often gives the most predictable I/O path. There are fewer layers between PostgreSQL and disks, easier control over kernel parameters, and clearer ownership of network and storage behavior. For latency-sensitive databases, this simplicity can be a major advantage.

Kubernetes can still be useful, especially when the team already operates a mature platform with strong storage classes, node isolation, backup automation, and tested operators. But Kubernetes does not remove database engineering. It adds scheduling, networking, and storage abstractions that must be understood during failover.

The decision is less about ideology and more about operational maturity:

Platform	Best for	Stability impact	Complexity
Bare metal PostgreSQL	Predictable I/O and critical databases	Strong control over latency and storage	Medium
VM-based PostgreSQL	Teams needing isolation and simpler operations	Good balance of manageability and control	Medium
Kubernetes operator	Mature platform teams with tested storage	Consistent automation when designed well	High
Managed PostgreSQL	Teams prioritizing operational offload	Good baseline, less low-level control	Low/Medium

Patroni, etcd, and failover discipline

Patroni is a strong choice for PostgreSQL high availability when the team understands the failure model. It coordinates leader election through a distributed consensus store such as etcd or Consul and manages promotion of replicas. But Patroni is not magic. If etcd is unstable, network partitions are ignored, or health checks are wrong, failover can become dangerous.

Useful checks:

patronictl list
etcdctl endpoint health --cluster
curl -s http://127.0.0.1:8008/cluster
psql -c "select pg_is_in_recovery();"
psql -c "select client_addr, state, sync_state, replay_lag from pg_stat_replication;"

The runbook must define who approves manual failover, when automatic failover is allowed, and how to validate the new primary. After promotion, check application writes, replication rebuild, backup continuity, and monitoring targets.

PgBouncer and connection control

Many PostgreSQL incidents are connection incidents. Application pods scale out, each pod opens too many connections, deploys restart workers simultaneously, and PostgreSQL spends resources managing sessions instead of queries. PgBouncer can protect the database, but it must be configured for the application behavior.

Typical checks:

psql -c "select count(*) from pg_stat_activity;"
psql -c "select state, count(*) from pg_stat_activity group by state;"
psql -c "select wait_event_type, wait_event, count(*) from pg_stat_activity group by 1,2 order by 3 desc;"

Connection pooling should be paired with sane application pool limits. PgBouncer cannot fix unbounded concurrency, missing indexes, or long transactions. It gives the database breathing room while you correct the workload.

Monitoring PostgreSQL before users feel pain

A reliable PostgreSQL monitoring setup should show saturation early. CPU is useful, but it is rarely enough. Watch disk latency, checkpoint behavior, cache hit ratio, replication lag, locks, dead tuples, autovacuum, query latency, and connection count.

For production dashboards, I want to see:

p95/p99 query latency for critical paths.
Replication lag by standby.
Active sessions and waiting sessions.
Long transactions and idle-in-transaction sessions.
WAL generation rate.
Backup success and restore test age.
Disk latency and free space trend.

Alerts should point to action. “Database CPU high” is weak. “Primary has 90% connection saturation and API p99 doubled after deploy” is useful.

Backup and restore testing

Backups are not a recovery strategy until restore is tested. PostgreSQL disaster recovery must include base backups, WAL availability, restore procedure, target recovery point, credentials path, and application validation.

A practical restore test should answer:

Can we restore into a clean environment?
How long did restore take?
What data point did we recover to?
Did application smoke tests pass?
Did backup monitoring detect the latest backup?

Restore tests are also SEO-worthy because many teams search for “PostgreSQL disaster recovery” after learning the hard way that backup success does not mean recovery success.

HA & DR Runbooks — PostgreSQL HA is only reliable when failover and restore steps are documented and rehearsed.
SOC 2-ready Ops Model — backup evidence, access control, and incident response matter for audit-ready infrastructure.
Load Balancing: Comparative Architectures — database failover usually depends on clear routing and health-check behavior.

Key takeaways

PostgreSQL scaling starts with workload analysis, not bigger hardware.
Patroni improves HA only when consensus, health checks, and failover runbooks are correct.
PgBouncer protects PostgreSQL from connection storms but does not replace query tuning.
Monitoring must include replication lag, locks, disk latency, and restore test status.
Backups are not real until a clean restore has passed.

Operational takeaway

Treat PostgreSQL as a reliability product. Measure saturation, control connections, test failover, rehearse restore, and make every recovery step boring before production is under pressure.

Need PostgreSQL HA help?

SteadyOps can review your PostgreSQL architecture, Patroni setup, PgBouncer configuration, backups, and restore process to produce a practical reliability plan.

Ask DevOps Copilot Request audit