DisCopy


Thursday, 2 April 2026

PostgreSQL Logical Replication: Configuration and Recovery Protocols!

Deploying and Rescuing PostgreSQL Logical Replication: Handling Dropped Publications & Subscriptions!

I’ve configured replication/high-availability on every RDBMS from Oracle (GG) to MS SQL and Sybase to MySQL. After all that trauma, setting up PostgreSQL replication is a walk in the park at sunrise.. 😊

First things First:

Anytime logical replication breaks for any reason, we need to immediately check the replication slots on the primary. If replication is broken and we cannot fix it immediately, we must drop the slot manually to save the primary database from an out-of-disk outage due to WAL files retention requirements.

SQL to find orphaned slots (on the Primary Node)

 

SELECT slot_name, plugin, active, restart_lsn,

    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as wal_lag_size

FROM pg_replication_slots;

If wal_lag_size is growing into 100s of gigabytes and active is false, drop the slot (pg_drop_replication_slot('slot_name')) to save the primary, and accept that we will have to do a full data re-sync (copy_data = true/false) later.

Quick glance at setting up logical replication:


1.     The Pre-requisite (Database Configuration requirements for logical replication)

Set the wal_level to logical in the Primary/Source Database.

On the Publisher (Primary Node), check postgresql.conf file or run SHOW wal_level; to verify the wal_level is set to logical or not (By default it is replica):

2.     The Publisher (Primary Node)

Log into the source database where Primary/active data lives.

a.     Create the Table: Create a test table or we can skip this step to setup replication to an existing table.

Sample SQL:

CREATE TABLE TEST

(

    id SERIAL PRIMARY KEY,

    description VARCHAR(100),

    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP

);

(Important: Logical replication requires a Primary Key or a Replica Identity on the table).

 

b.     Create the Publication: This notifies PostgreSQL to start tracking changes for this table.

Sample SQL:

CREATE PUBLICATION test_pub FOR TABLE TEST;

 

3.     The Subscriber (Secondary/Target Database)

Log into the target database where the data to be replicated.

a.     Create the Table in the Secondary: Logical replication does not create the target table or replicate schema changes. We must create the target table with the exact source table DDL before subscribing.

Sample SQL:

CREATE TABLE TEST

(

    id SERIAL PRIMARY KEY,

    description VARCHAR(100),

    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP

);

 

b.     Create the Subscription: The Create subscription command connects to the Primary/Publisher, materializes the initial data (Full Load), and establishes the continuous stream (CDC). We need to replace the connection string with the actual source database conn-info details.

Sample SQL:

CREATE SUBSCRIPTION test_sub

CONNECTION 'host=XXX.XX.XX.XX port=5432 dbname=source_db user=postgres password=’XXXXXX'

PUBLICATION test_pub;

 

4.     Test, Validate and Verify the Replication: Let’s insert and perform some DMLs on the Primary/Source TEST table.

a.     Run a couple of Insert statements on Publisher:

Sample SQL:

INSERT INTO TEST (description) VALUES ('Hello from Kasi V Dogga!');

INSERT INTO TEST (description) VALUES ('Hope the Logical replication is active.');

 

b.     Check the data on the Subscriber:

Sample SQL:

SELECT * FROM TEST;

We can see the two tuples appear on the target database to confirm that, we have successfully established a logical replication for a Table.

 

Fixing the replication issues (When Publication or Subscription Dropped):

When logical replication is disrupted, such as by an accidental drop of a publication or subscription, the data becomes inconsistent. In this situation, only the primary node can be relied upon to handle both OLTP (DML operations) and OLAP (read/query workloads). This is actually a destructive administrative action, not just a "pause."

Here are the DBA's actions/options for fixing both these scenarios.

Scenario A: The Publication is Dropped (On the Primary Node)

If someone executes DROP PUBLICATION on the source database, the subscriber will immediately start throwing errors in the logs (e.g., ERROR: publication "my_table_pub" does not exist), and replication will halt. Unlike dropping a subscription, dropping a publication does not automatically drop the replication slot, as the publisher still thinks a subscriber is out there, so it will start aggressively hoarding Write-Ahead Logs (WAL) on the primary node. If we do not fix this quickly, the primary server's disk will fill up to 100% and the database will crash.

The Fix: If we identify this issue quickly and the replication slot is still intact, we can seamlessly resume replication without having to recopy all GB/TB of data.

  1. Recreate the Publication (On Primary Node): We need to recreate the publication exactly as it was created before.

Sample SQL:

CREATE PUBLICATION test_pub FOR TABLE test;

  1. Refresh the Subscription (On Secondary/Subscriber Node): We will refresh the subscriber to re-establish replication by reaching the newly created publication, and resume pulling from the exact LSN (Log Sequence Number) where it left off.

Sample SQL:

ALTER SUBSCRIPTION test_sub REFRESH PUBLICATION;

This will make sure the subscriber will reconnect to the existing replication slot and instantly drain the hoarded WAL files. Replication is restored and data will be in sync.

Scenario B: The Subscription is Dropped (On the Secondary/Subscriber Node)

As mentioned earlier, dropping the subscription is a destructive action that usually drops the replication slot on the primary and discards the WAL history. So, we cannot simply resume/refresh the replication.

The Fix:

  1. Verify the Slot is dropped (On Primary/Publisher Node): Ensure the slot was actually dropped to prevent disk bloat.

Sample SQL:

SELECT slot_name, active FROM pg_replication_slots;

-- If the old slot is still there and active=f, drop it:

-- SELECT pg_drop_replication_slot('slot_name');

  1. Re-establish Replication (On Secondary/Subscriber Node): Let the recreate subscription truncate and materialize/sync entire table’s data.

Sample SQL:

CREATE SUBSCRIPTION test_sub

CONNECTION 'CONNINFO'

PUBLICATION test_pub

WITH (copy_data = true);


Tuesday, 31 March 2026

PostgreSQL hot_standby_feedback impact on Primary node!

Have you ever encountered a scenario where a long-running transaction on a PostgreSQL standby node caused performance degradation or table bloat on the primary node?

It seems strange, a secondary node is supposed to be read-only, so querying it shouldn't impact the primary at all, right? It is a common misconception that asynchronous replication protects the primary from everything happening on the secondary. While async mode protects your primary's write speed, it does not protect your primary's disk space and vacuum processes.

However, in PostgreSQL architecture, a long-running transaction (SELECT) on a secondary node can severely impact or even bring down the primary. This happens primarily through three specific parameters designed to keep the databases synchronized and keep the integrity.


Here is exactly why this strange thing happens:

The hot_standby_feedback Trap (Bloating on Primary Node):

If a primary node updates (auto correction mode i.e. delete+insert) or deletes a row, VACUUM will eventually clean up the old "dead" row. But what if another query on the secondary is currently reading that exact row? If the primary deletes it, the secondary query will fail with a "snapshot too old" error. To prevent the query on the secondary from failing, PostgreSQL provides hot_standby_feedback, to be enabled.

  • What happens, If we enable hot_standby_feedback: The secondary node now, keep-on messages the primary requesting, "I have a transaction open using transaction ID XXX, hence don’t vacuum anything newer than this XXX".
  • What is the Impact on Primary: If some long report running on the secondary (typically runs 1-2 hours), the primary is completely forbidden from cleaning up any dead tuples across the entire database for those 1-2 hours. Hence, primary tables and indexes will get huge bloat, causing disk I/O to spike and overall performance to degrade heavily.

Replication Slots & Disk Space Exhaustion:

If the secondary node is running a massive transaction, it is consuming heavy CPU, Memory, and Disk I/O.

  • What happens: The secondary becomes so starved for resources that the startup process (which replays the Write-Ahead Logs from the primary) slows down to a grinding halt or stops entirely.
  • The Impact on Primary: If you are using physical replication slots, the primary is forced to hold onto all WAL files until the secondary confirms it has replayed all of  them. If the secondary is bogged down by a large query and stops replaying, the WAL files will pile up on the primary's pg_wal directory until it runs out of disk space. When the primary hits 100% disk usage, the database crashes.

Synchronous Replication Blocking (synchronous_commit):

This is more problematic now. If we have configured our cluster for high availability using synchronous replication (where synchronous_commit is set to on, remote_write, or remote_apply), the primary must wait for the secondary to acknowledge transactions.

  • What happens in the Secondary: A huge query on the secondary hogs the disk I/O or locks resources.
  • The Impact on Primary: Write transactions (INSERT, UPDATE, DELETE) on the primary will hang. The primary might successfully execute the write locally, but the COMMIT will sit in a waiting state until the sluggish secondary processes the WAL and acknowledges it back to Primary. This causes application connections to pile up on the primary, eventually exhausting max_connections and the database crashes.

How to Mitigate This?

In an OLTP production environment, the ultimate goal is to make sure Primary database is always up, running and available. To protect the primary from secondary-node abuse, we have to establish the cited guardrails:

  • Tune max_standby_streaming_delay: Set this to a reasonable limit (e.g., 30s or 1min). If a query on the replica blocks replication for longer than this time, PostgreSQL then forcefully cancel the conflicting query on the Secondary node. The user's query fails, but the primary stays alive.
  • Use statement_timeout on the Replica: Enforce a strict time limit for read-only queries on the secondary so long running queries/reports are killed automatically.
  • Turn off hot_standby_feedback: If we can tolerate occasional query cancellations on the replica due to the error “snapshot too old”, but, turning hot_standby_feedback off ensures the primary's VACUUM is never blocked.

 

Friday, 27 March 2026

PostgreSQL Cluster Switch-Over streaming issues!

How to handle replication after a failover of PostgreSQL Cluster from Primary to Secondary and back to Primary for the data streaming using Debezium and Kafka?

In a standard PostgreSQL and Debezium architecture, Debezium will not automatically pick up the new WAL and resume seamlessly after a true failover and failback. It almost always requires manual intervention or a reset.

The reason for this comes down to how PostgreSQL handles logical replication.

Ø  The Core Problem: Logical Replication Slots

Debezium does not read the raw physical WAL files directly. It relies on a Logical Replication Slot (usually using the pgoutput plugin) created on the source database.

1.     Slots are Node-Specific: By default, logical replication slots are tied exclusively to the primary server. They are not physically replicated to the DR standby node alongside the data.

2.     The Failover (To DR): When you failed over to the DR node, that DR database did not have Debezium's logical replication slot. The Debezium connector immediately threw an error and failed.

3.     The Timeline Change: When a DR node is promoted to Primary, PostgreSQL increments its "timeline."

4.     The Failback (To Primary): When you failed back to the original primary, the primary had to reconcile with the DR node's new timeline. In most failback scenarios (especially if pg_rewind or a fresh backup was used), the original logical replication slot is destroyed or rendered invalid because the Log Sequence Numbers (LSNs) no longer perfectly align.

Ø  How to Resolve and Reset

If the Debezium connector is currently in a FAILED state, then we will likely need to reset it. Here is the standard recovery path:

ü          Step 1: Check the Slot on the Primary First, verify if the slot survived the failback on the source PostgreSQL database:

Query:

SELECT slot_name, plugin, slot_type, active, restart_lsn

FROM pg_replication_slots

WHERE slot_name = 'your_debezium_slot_name';

·       If the query returns zero rows, the slot was destroyed.

·       If it returns a row but active is false and Debezium refuses to connect, the LSNs are out of sync.

ü          Step 2: Drop the Corrupted Slot (If it exists) If the slot is there but broken, we must drop it manually so Debezium can recreate it:

Query:

               SELECT pg_drop_replication_slot('your_debezium_slot_name');

ü          Step 3: Reset the Debezium Connector Because the LSN pointers have changed and the slot is gone, we cannot just restart the connector. 

            We have to force Debezium to take a new initial snapshot to ensure zero data loss.

·       Delete the connector's offsets in the Kafka Connect offsets topic (or via the Kafka Connect REST API if you are using a tool that supports offset deletion).

·       Reconfigure the Debezium connector with snapshot.mode set to initial (or always).

·       Restart the connector. It will recreate the replication slot, perform a fresh SELECT * baseline read of your tables, and then begin streaming the new WAL changes.

 

 

Thursday, 26 March 2026

PostgreSQL Databases Historical Data Archival and Storage!

Hot, Warm, Cold: Architecting a 3-Tier Data Lifecycle for PostgreSQL




Managing data of multi tera bytes requires a planned separation of live data from the warm, cold storage. Keeping terabytes of historical data on premium database disks, even if the partitions are detached and no longer impacting query planner overhead besides impacts expensive IOPS and inflates our backup/restore RTO windows.

Implementing a robust historical data archival strategy is one of the most impactful architectural improvements you can make for a large-scale PostgreSQL estate. It directly solves the pain points of bloated indexes, extended backup windows, and the aggressive locking issues, we experience in our day-to-day work loads.

I'm considering a usecase of 13 servers with ~1TB of raw data and ~750GB of archival data per each database/server on average, the most robust architecture shifts this data off the relational engine and into highly compressed, immutable Object Storage in Cloud storage like AWS S3 or Azure BLOB Storage using a middle tier for OLAP like ClickHouse.

Below mentioned options provides a comprehensive blueprint for the extraction, storage, and retrieval phases, along with the estimated storage footprint and cost projections.

Phase 1: The Extraction & Transformation Strategy

Once a partition rolls past the historical retention period and is detached from the main table, it must be extracted efficiently without causing I/O spikes that compete with our active DML operations.

  • The Format (Parquet vs. CSV/ZSTD): * Zstandard (ZSTD) Compressed CSV: If this data is strictly for compliance and rarely touched, streaming the detached partition via COPY to a ZSTD-compressed CSV is the fastest and most CPU-efficient method.

o   Apache Parquet: If there is a chance the business or analytics teams will need to query this historical data, extract it into Parquet format. Parquet is columnar, highly compressed, and can be queried directly in object storage.

·       The Workflow:

    1. Detach the partition from the parent table (instantly removes it from live query plans).
    2. Execute a background worker script to export the detached table to the target format into AWS S3 or Azure blob storage.
    3. Stream the output directly to Object Storage (e.g., Azure Blob Storage or AWS S3) to bypass local disk staging.
    4. Verify the checksum of the uploaded data before dropping from Source.
    5. DROP TABLE on the PostgreSQL server to reclaim the premium block storage of ~500GB on each DB Server. 

Phase 2: Storage & Compression Estimates

Relational database data is highly compressible. When you extract 10TB of raw PostgreSQL data (excluding the B-Tree indexes), the actual storage footprint drops dramatically.

Assuming a standard compression ratio of 4:1 (75% reduction) using ZSTD or Parquet:

Metric/Parameter

Size Per Server

Total (13 Servers)

Raw PostgreSQL Data

~750 GB

~10 TB

Index Overhead Removed

~100 GB

2.5 TB

Data Payload to Compress

~500 GB

7.5 TB

Estimated Target Storage (Compressed)

~150 GB

~2 TB

 

Phase 3: The Cost Estimates & Tiering

By shifting this data off Premium SSDs (which typically cost around $150 to $200 per provisioned TB per month depending on the cloud provider or SAN tier) and moving it to Object Storage, the ROI is immediate.

Here is the estimated monthly cost to store the resulting ~2 TB of compressed archive data across standard cloud object storage tiers:

Storage Tier

Best For

Approx. Monthly Cost (2 TB)

Retrieval (per-GB) Characteristics

Hot / Standard Object

Data queried frequently

~$50.00 / month

Millisecond access. Standard retrieval costs.

Cool / Infrequent Access

Queried once a month

~$20.00 / month

Millisecond access. Higher retrieval fee.

Cold / Archive / Glacier

Audits only (1-2x a year)

~$2.50 / month

Hours to retrieve. Highest retrieval fees.

 

Phase 4: The Retrieval Architecture

When compliance or business users inevitably ask for data from this Historical Archival period, We need to provide the same without restoring 1TB back into our live transactional database.

  • Option A (The Native FDW): If the data was archived as CSV, you can temporarily create a file_fdw (Foreign Data Wrapper) or use s3_fdw/azure_storage_fdw to map the remote file as a read-only table in a staging PostgreSQL instance.
  • Option B (The Data Lake Approach): If the data was archived as Parquet, point an analytical query engine like DuckDB or a ClickHouse DB directly at the storage bucket. This allows you to run standard SQL against the deep archive without touching our primary PostgreSQL clusters.


OLAP phase: 

To bridge the gap between active transactional data and deep archival storage, we have implemented ClickHouse as a dedicated OLAP tier. This ensures high-performance, low-latency querying for frequently accessed historical data, mitigating the latency, indexing, and functional limitations of querying cold data directly from S3 object storage.

This can be best achieved through an event-driven Change Data Capture (CDC) pipeline that transforms raw transactional logs into query-ready analytical states.

Here are the critical architectural steps to build this pipeline:

  • Source Configuration (PostgreSQL): Configure the database for logical decoding by setting wal_level = logical, establishing a publication for the specific tables, and creating a logical replication slot to track WAL consumption.
  • Event Capture (Debezium): Attach a CDC connector to the replication slot to read the WAL stream, translating every INSERT, UPDATE, and DELETE into a structured event payload (typically JSON or Avro).
  • The Streaming Buffer (Kafka): Route these event payloads into dedicated Apache Kafka topics. This decoupling layer provides durable storage and acts as a shock absorber during massive transactional bursts on the source.
  • The Ingestion Engine (ClickHouse): Provision a table using the Kafka engine within ClickHouse. This acts as an active consumer group, continuously pulling batches of raw events directly from the Kafka broker.
  • State Materialization (ClickHouse MVs): Deploy a Materialized View to act as the transformation pipeline. It automatically reads from the Kafka engine table, applies necessary transformations, and routes the final data into a persistent storage table—most often a ReplacingMergeTree or CollapsingMergeTree to efficiently process mutations and deduplicate records.