Deploying and Rescuing PostgreSQL Logical Replication: Handling Dropped Publications & Subscriptions and Fixing replication Errors!
I’ve configured replication/high-availability on every RDBMS from Oracle
(GG) to MS SQL and Sybase to MySQL. After all that trauma, setting up
PostgreSQL replication is a walk in the park at sunrise.. 😊
First things First:
Anytime logical replication breaks for any reason, we
need to immediately check the replication slots on the primary. If replication
is broken and we cannot fix it immediately, we must drop the slot manually to
save the primary database from an out-of-disk outage due to WAL files retention
requirements.
SQL to find orphaned slots (on the Primary Node)
SELECT slot_name, plugin, active, restart_lsn,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(),
restart_lsn)) as wal_lag_size
FROM pg_replication_slots;
If wal_lag_size is growing into 100s of gigabytes
and active is false, drop the slot (pg_drop_replication_slot('slot_name')) to save
the primary, and accept that we will have to do a full data re-sync (copy_data
= true/false) later.
Quick
glance at setting up logical replication:
1.
The
Pre-requisite (Database Configuration requirements for logical replication)
Set the wal_level to logical in the Primary/Source
Database.
On the Publisher (Primary Node),
check postgresql.conf file or run SHOW wal_level; to
verify the wal_level is set to logical or not (By default it is replica):
2.
The
Publisher (Primary Node)
Log into the source database where Primary/active
data lives.
a.
Create
the Table: Create a
test table or we can skip this step to setup replication to an existing table.
Sample SQL:
CREATE TABLE TEST
(
id SERIAL
PRIMARY KEY,
description VARCHAR(100),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
(Important: Logical replication requires
a Primary Key or a Replica Identity on the table).
b.
Create
the Publication: This notifies
PostgreSQL to start tracking changes for this table.
Sample SQL:
CREATE PUBLICATION test_pub FOR TABLE TEST;
3.
The
Subscriber (Secondary/Target Database)
Log into the target database where the
data to be replicated.
a.
Create
the Table in the Secondary:
Logical replication does not create the target table or replicate schema
changes. We must create the target table with the exact source table DDL before
subscribing.
Sample SQL:
CREATE TABLE TEST
(
id SERIAL
PRIMARY KEY,
description VARCHAR(100),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
b.
Create
the Subscription: The
Create subscription command connects to the Primary/Publisher, materializes the
initial data (Full Load), and establishes the continuous stream (CDC). We need
to replace the connection string with the actual source database conn-info details.
Sample SQL:
CREATE SUBSCRIPTION test_sub
CONNECTION 'host=XXX.XX.XX.XX port=5432
dbname=source_db user=postgres password=’XXXXXX'
PUBLICATION test_pub;
4.
Test,
Validate and Verify the Replication: Let’s
insert and perform some DMLs on the Primary/Source TEST table.
a.
Run
a couple of Insert statements on Publisher:
Sample SQL:
INSERT INTO TEST (description) VALUES ('Hello from Kasi
V Dogga!');
INSERT INTO TEST (description) VALUES ('Hope the Logical
replication is active.');
b.
Check
the data on the Subscriber:
Sample SQL:
SELECT * FROM TEST;
We
can see the two tuples appear on the target database to confirm that, we have
successfully established a logical replication for a Table.
Fixing the replication issues:
When logical replication is disrupted,
such as by an accidental drop of a publication or subscription, the data
becomes inconsistent. In this situation, only the primary node can be relied
upon to handle both OLTP (DML operations) and OLAP (read/query workloads). This
is actually a destructive administrative action, not just a "pause."
Here are the DBA's actions/options for
fixing these scenarios.
Scenario A: The Publication is Dropped (On the Primary Node)
If someone executes DROP PUBLICATION on
the source database, the subscriber will immediately start throwing errors in
the logs (e.g., ERROR: publication "my_table_pub" does not exist),
and replication will halt. Unlike dropping a subscription, dropping a
publication does not automatically drop the replication slot, as the
publisher still thinks a subscriber is out there, so it will start aggressively
hoarding Write-Ahead Logs (WAL) on the primary node. If we do not fix this
quickly, the primary server's disk will fill up to 100% and the database will
crash.
The Fix: If we identify this issue quickly
and the replication slot is still intact, we can seamlessly resume replication
without having to recopy all GB/TB of data.
- Recreate
the Publication (On Primary Node): We need to recreate the publication exactly as it was created
before.
Sample SQL:
CREATE PUBLICATION test_pub FOR TABLE
test;
- Refresh
the Subscription (On Secondary/Subscriber Node): We will refresh the subscriber to
re-establish replication by reaching the newly created publication, and
resume pulling from the exact LSN (Log Sequence Number) where it left off.
Sample SQL:
ALTER SUBSCRIPTION test_sub REFRESH
PUBLICATION;
This will make sure the subscriber will
reconnect to the existing replication slot and instantly drain the hoarded WAL
files. Replication is restored and data will be in sync.
Scenario B: The Subscription is Dropped (On the Secondary/Subscriber
Node)
As mentioned earlier, dropping the
subscription is a destructive action that usually drops the replication slot on
the primary and discards the WAL history. So, we cannot simply
resume/refresh the replication.
The Fix:
- Verify
the Slot is dropped (On Primary/Publisher Node): Ensure the slot was actually dropped to
prevent disk bloat.
Sample SQL:
SELECT slot_name, active FROM
pg_replication_slots;
-- If the old slot is still there and
active=f, drop it:
-- SELECT
pg_drop_replication_slot('slot_name');
- Re-establish
Replication (On Secondary/Subscriber Node): Let the recreate subscription truncate
and materialize/sync entire table’s data.
Sample SQL:
CREATE SUBSCRIPTION test_sub
CONNECTION 'CONNINFO'
PUBLICATION test_pub
WITH (copy_data = true);
Scenario C: Transaction Failed in Subscriber/Secondary Node
A duplicate key error
during PostgreSQL replication typically occurs in logical
replication when the subscriber tries to apply an INSERT/UPDATE that
violates a PRIMARY KEY or UNIQUE constraint.
Error:
ERROR:
duplicate key value violates unique constraint "pk_test"
DETAIL: Key (id)=(521) already exists.
Common causes:
- Data already exists on subscriber but not in sync with
publisher
- Manual changes done on subscriber (not recommended)
- Replication restarted after inconsistency
- Dropped/recreated publication or subscription
- Missing initial data sync
Fix Options (Based on the failure scenario and data inconsistency)
Option 1: Delete conflicting row on Subscriber
(Quick Fix)
DELETE FROM test_table WHERE id = 101;
Then restart replication:
ALTER SUBSCRIPTION test_sub ENABLE;
Option 2: Skip the conflicting
transaction (Supported in PostgreSQL 15+)
ALTER SUBSCRIPTION test_sub SKIP (lsn = '0/21DKVR05');
Option 3: Truncate and Resync
Table
ALTER SUBSCRIPTION test_sub DISABLE;
TRUNCATE TABLE test_table;
ALTER SUBSCRIPTION test_sub ENABLE;
Option 4: Refresh Subscription (Best Option) –
This option re-sync metadata + data
ALTER SUBSCRIPTION test_sub REFRESH PUBLICATION;
Option 5: Recreate Subscription (Full
Reset)
DROP SUBSCRIPTION test_sub;
CREATE SUBSCRIPTION test_sub
CONNECTION '...'
PUBLICATION test_pub
WITH (copy_data = true);
|
Scenario |
Recommended
Action |
|
Few duplicate errors |
Delete conflicting rows and restart
the replication |
|
Frequent errors |
Disable the replication and Enable
with Re-sync affected tables |
|
Major inconsistency |
Drop and Recreate subscription with
copy_data |