DisCopy


Tuesday, 31 March 2026

PostgreSQL hot_standby_feedback impact on Primary node!

Have you ever encountered a scenario where a long-running transaction on a PostgreSQL standby node caused performance degradation or table bloat on the primary node?

It seems strange, a secondary node is supposed to be read-only, so querying it shouldn't impact the primary at all, right? It is a common misconception that asynchronous replication protects the primary from everything happening on the secondary. While async mode protects your primary's write speed, it does not protect your primary's disk space and vacuum processes.

However, in PostgreSQL architecture, a long-running transaction (SELECT) on a secondary node can severely impact or even bring down the primary. This happens primarily through three specific parameters designed to keep the databases synchronized and keep the integrity.


Here is exactly why this strange thing happens:

The hot_standby_feedback Trap (Bloating on Primary Node):

If a primary node updates (auto correction mode i.e. delete+insert) or deletes a row, VACUUM will eventually clean up the old "dead" row. But what if another query on the secondary is currently reading that exact row? If the primary deletes it, the secondary query will fail with a "snapshot too old" error. To prevent the query on the secondary from failing, PostgreSQL provides hot_standby_feedback, to be enabled.

  • What happens, If we enable hot_standby_feedback: The secondary node now, keep-on messages the primary requesting, "I have a transaction open using transaction ID XXX, hence don’t vacuum anything newer than this XXX".
  • What is the Impact on Primary: If some long report running on the secondary (typically runs 1-2 hours), the primary is completely forbidden from cleaning up any dead tuples across the entire database for those 1-2 hours. Hence, primary tables and indexes will get huge bloat, causing disk I/O to spike and overall performance to degrade heavily.

Replication Slots & Disk Space Exhaustion:

If the secondary node is running a massive transaction, it is consuming heavy CPU, Memory, and Disk I/O.

  • What happens: The secondary becomes so starved for resources that the startup process (which replays the Write-Ahead Logs from the primary) slows down to a grinding halt or stops entirely.
  • The Impact on Primary: If you are using physical replication slots, the primary is forced to hold onto all WAL files until the secondary confirms it has replayed all of  them. If the secondary is bogged down by a large query and stops replaying, the WAL files will pile up on the primary's pg_wal directory until it runs out of disk space. When the primary hits 100% disk usage, the database crashes.

Synchronous Replication Blocking (synchronous_commit):

This is more problematic now. If we have configured our cluster for high availability using synchronous replication (where synchronous_commit is set to on, remote_write, or remote_apply), the primary must wait for the secondary to acknowledge transactions.

  • What happens in the Secondary: A huge query on the secondary hogs the disk I/O or locks resources.
  • The Impact on Primary: Write transactions (INSERT, UPDATE, DELETE) on the primary will hang. The primary might successfully execute the write locally, but the COMMIT will sit in a waiting state until the sluggish secondary processes the WAL and acknowledges it back to Primary. This causes application connections to pile up on the primary, eventually exhausting max_connections and the database crashes.

How to Mitigate This?

In an OLTP production environment, the ultimate goal is to make sure Primary database is always up, running and available. To protect the primary from secondary-node abuse, we have to establish the cited guardrails:

  • Tune max_standby_streaming_delay: Set this to a reasonable limit (e.g., 30s or 1min). If a query on the replica blocks replication for longer than this time, PostgreSQL then forcefully cancel the conflicting query on the Secondary node. The user's query fails, but the primary stays alive.
  • Use statement_timeout on the Replica: Enforce a strict time limit for read-only queries on the secondary so long running queries/reports are killed automatically.
  • Turn off hot_standby_feedback: If we can tolerate occasional query cancellations on the replica due to the error “snapshot too old”, but, turning hot_standby_feedback off ensures the primary's VACUUM is never blocked.