DisCopy


Friday, 27 March 2026

PostgreSQL Cluster Switch-Over streaming issues!

How to handle replication after a failover of PostgreSQL Cluster from Primary to Secondary and back to Primary for the data streaming using Debezium and Kafka?

In a standard PostgreSQL and Debezium architecture, Debezium will not automatically pick up the new WAL and resume seamlessly after a true failover and failback. It almost always requires manual intervention or a reset.

The reason for this comes down to how PostgreSQL handles logical replication.

Ø  The Core Problem: Logical Replication Slots

Debezium does not read the raw physical WAL files directly. It relies on a Logical Replication Slot (usually using the pgoutput plugin) created on the source database.

1.     Slots are Node-Specific: By default, logical replication slots are tied exclusively to the primary server. They are not physically replicated to the DR standby node alongside the data.

2.     The Failover (To DR): When you failed over to the DR node, that DR database did not have Debezium's logical replication slot. The Debezium connector immediately threw an error and failed.

3.     The Timeline Change: When a DR node is promoted to Primary, PostgreSQL increments its "timeline."

4.     The Failback (To Primary): When you failed back to the original primary, the primary had to reconcile with the DR node's new timeline. In most failback scenarios (especially if pg_rewind or a fresh backup was used), the original logical replication slot is destroyed or rendered invalid because the Log Sequence Numbers (LSNs) no longer perfectly align.

Ø  How to Resolve and Reset

If the Debezium connector is currently in a FAILED state, then we will likely need to reset it. Here is the standard recovery path:

ü          Step 1: Check the Slot on the Primary First, verify if the slot survived the failback on the source PostgreSQL database:

Query:

SELECT slot_name, plugin, slot_type, active, restart_lsn

FROM pg_replication_slots

WHERE slot_name = 'your_debezium_slot_name';

·       If the query returns zero rows, the slot was destroyed.

·       If it returns a row but active is false and Debezium refuses to connect, the LSNs are out of sync.

ü          Step 2: Drop the Corrupted Slot (If it exists) If the slot is there but broken, we must drop it manually so Debezium can recreate it:

Query:

               SELECT pg_drop_replication_slot('your_debezium_slot_name');

ü          Step 3: Reset the Debezium Connector Because the LSN pointers have changed and the slot is gone, we cannot just restart the connector. 

            We have to force Debezium to take a new initial snapshot to ensure zero data loss.

·       Delete the connector's offsets in the Kafka Connect offsets topic (or via the Kafka Connect REST API if you are using a tool that supports offset deletion).

·       Reconfigure the Debezium connector with snapshot.mode set to initial (or always).

·       Restart the connector. It will recreate the replication slot, perform a fresh SELECT * baseline read of your tables, and then begin streaming the new WAL changes.

 

 

Thursday, 26 March 2026

PostgreSQL Databases Historical Data Archival and Storage!

Hot, Warm, Cold: Architecting a 3-Tier Data Lifecycle for PostgreSQL




Managing data of multi tera bytes requires a planned separation of live data from the warm, cold storage. Keeping terabytes of historical data on premium database disks, even if the partitions are detached and no longer impacting query planner overhead besides impacts expensive IOPS and inflates our backup/restore RTO windows.

Implementing a robust historical data archival strategy is one of the most impactful architectural improvements you can make for a large-scale PostgreSQL estate. It directly solves the pain points of bloated indexes, extended backup windows, and the aggressive locking issues, we experience in our day-to-day work loads.

I'm considering a usecase of 13 servers with ~1TB of raw data and ~750GB of archival data per each database/server on average, the most robust architecture shifts this data off the relational engine and into highly compressed, immutable Object Storage in Cloud storage like AWS S3 or Azure BLOB Storage using a middle tier for OLAP like ClickHouse.

Below mentioned options provides a comprehensive blueprint for the extraction, storage, and retrieval phases, along with the estimated storage footprint and cost projections.

Phase 1: The Extraction & Transformation Strategy

Once a partition rolls past the historical retention period and is detached from the main table, it must be extracted efficiently without causing I/O spikes that compete with our active DML operations.

  • The Format (Parquet vs. CSV/ZSTD): * Zstandard (ZSTD) Compressed CSV: If this data is strictly for compliance and rarely touched, streaming the detached partition via COPY to a ZSTD-compressed CSV is the fastest and most CPU-efficient method.

o   Apache Parquet: If there is a chance the business or analytics teams will need to query this historical data, extract it into Parquet format. Parquet is columnar, highly compressed, and can be queried directly in object storage.

·       The Workflow:

    1. Detach the partition from the parent table (instantly removes it from live query plans).
    2. Execute a background worker script to export the detached table to the target format into AWS S3 or Azure blob storage.
    3. Stream the output directly to Object Storage (e.g., Azure Blob Storage or AWS S3) to bypass local disk staging.
    4. Verify the checksum of the uploaded data before dropping from Source.
    5. DROP TABLE on the PostgreSQL server to reclaim the premium block storage of ~500GB on each DB Server. 

Phase 2: Storage & Compression Estimates

Relational database data is highly compressible. When you extract 10TB of raw PostgreSQL data (excluding the B-Tree indexes), the actual storage footprint drops dramatically.

Assuming a standard compression ratio of 4:1 (75% reduction) using ZSTD or Parquet:

Metric/Parameter

Size Per Server

Total (13 Servers)

Raw PostgreSQL Data

~750 GB

~10 TB

Index Overhead Removed

~100 GB

2.5 TB

Data Payload to Compress

~500 GB

7.5 TB

Estimated Target Storage (Compressed)

~150 GB

~2 TB

 

Phase 3: The Cost Estimates & Tiering

By shifting this data off Premium SSDs (which typically cost around $150 to $200 per provisioned TB per month depending on the cloud provider or SAN tier) and moving it to Object Storage, the ROI is immediate.

Here is the estimated monthly cost to store the resulting ~2 TB of compressed archive data across standard cloud object storage tiers:

Storage Tier

Best For

Approx. Monthly Cost (2 TB)

Retrieval (per-GB) Characteristics

Hot / Standard Object

Data queried frequently

~$50.00 / month

Millisecond access. Standard retrieval costs.

Cool / Infrequent Access

Queried once a month

~$20.00 / month

Millisecond access. Higher retrieval fee.

Cold / Archive / Glacier

Audits only (1-2x a year)

~$2.50 / month

Hours to retrieve. Highest retrieval fees.

 

Phase 4: The Retrieval Architecture

When compliance or business users inevitably ask for data from this Historical Archival period, We need to provide the same without restoring 1TB back into our live transactional database.

  • Option A (The Native FDW): If the data was archived as CSV, you can temporarily create a file_fdw (Foreign Data Wrapper) or use s3_fdw/azure_storage_fdw to map the remote file as a read-only table in a staging PostgreSQL instance.
  • Option B (The Data Lake Approach): If the data was archived as Parquet, point an analytical query engine like DuckDB or a ClickHouse DB directly at the storage bucket. This allows you to run standard SQL against the deep archive without touching our primary PostgreSQL clusters.


OLAP phase: 

To bridge the gap between active transactional data and deep archival storage, we have implemented ClickHouse as a dedicated OLAP tier. This ensures high-performance, low-latency querying for frequently accessed historical data, mitigating the latency, indexing, and functional limitations of querying cold data directly from S3 object storage.

This can be best achieved through an event-driven Change Data Capture (CDC) pipeline that transforms raw transactional logs into query-ready analytical states.

Here are the critical architectural steps to build this pipeline:

  • Source Configuration (PostgreSQL): Configure the database for logical decoding by setting wal_level = logical, establishing a publication for the specific tables, and creating a logical replication slot to track WAL consumption.
  • Event Capture (Debezium): Attach a CDC connector to the replication slot to read the WAL stream, translating every INSERT, UPDATE, and DELETE into a structured event payload (typically JSON or Avro).
  • The Streaming Buffer (Kafka): Route these event payloads into dedicated Apache Kafka topics. This decoupling layer provides durable storage and acts as a shock absorber during massive transactional bursts on the source.
  • The Ingestion Engine (ClickHouse): Provision a table using the Kafka engine within ClickHouse. This acts as an active consumer group, continuously pulling batches of raw events directly from the Kafka broker.
  • State Materialization (ClickHouse MVs): Deploy a Materialized View to act as the transformation pipeline. It automatically reads from the Kafka engine table, applies necessary transformations, and routes the final data into a persistent storage table—most often a ReplacingMergeTree or CollapsingMergeTree to efficiently process mutations and deduplicate records.

Wednesday, 9 October 2024

PostgreSQL Performance Tuning and Optimization..

Availability > Functionality > Performance! 



Performance tuning is a continuous process in any database environments. Many organizations spend more time and resources on tuning the database than on the original development because of data growth and usage on top of a poor database design. 80% of performance issues can be avoided with a proper database design. In addition, developing SQL code is very different from developing front-end applications code and middle-tier components. The SQL code developed by front-end or middle-tier experts can often be optimized depending on the transparency of the code. Finally, there is a lot of requirement and necessity for optimizing databases when the volume of operations grows and the application that worked great with 5-10 users and 100-200GB no longer lives up to tera bytes of data and 1000s of users' expectations (Scalability).  Yes as all of us aware scalability is the root cause.

A common place where most people like to start tuning databases is tuning the Server/Database configuration. Generally, Most developers that are unhappy with the application performance will demand adding resources like more cpu and memory to the server and reconfiguring the dynamic configuration parameters. But, extending the memory and adding more cpus or keep on changing configuration parameters will only help optimizing the performance up to a certain point. In other words, if you keep adding memory and do not tune the application or the sql code in any other way, you will reach the point where additional memory or re-configuration produces marginal or no performance improvement. 

It is also important to realize that improvement in one area often means compromising others. If you can optimize 90% of the critical queries by slowing down the performance of other 10% it might be well worth of your time. Sometimes you can improve the performance of online transactions by increasing the response time with the expense of reducing concurrency or throughput. Therefore it is important to determine the application performance requirements.  If we improve the performance of selects obviously we compromise DML statements/transactions (may be like adding additional indexes).

It is common to spend more time identifying the cause of the problem then actually troubleshooting and fixing it. If all other areas of application are working properly and you can be sure there is a problem with the database code, then you need to investigate your code modules and decide which one is causing problems. Many times improvement in only one stored procedure or trigger can fix most of the issues with its replicating/cascading performance by reducing locking/blocking.

The Optimizer of the PostgreSQL database engine takes a query and finds the best way to execute it. The optimization is done based on the statistics and available indexes for a view or a table. The optimized plan stays in effect until the statistics are updated or the query changes. So vacuuming frequently and rightly is crucial for Optimizer choosing optimal query plan.

Carefully considered indexes, built on top of a good database design, are the foundation of a high-performance PostgreSQL configuration. However, adding indexes without proper analysis can reduce the overall performance of your system. Insert, update, and delete operations can take longer when a large number of indexes need to be updated.

 

What to observe - high level (We have many more ... :)  Here top 5:

  1. Server error logs
  2. OS error logs
  3. Configuration files
  4. Workload pattern
  5. Peak hours and Off-peak Hours processes

 

What to set/configure - high level (We have many ... :)  Here top 5:

1.   shared_buffers

Most important parameter which Caches frequently accessed data in memory blocks also known as Cache hit ratio. This parameter allows PostgreSQL to read data from memory instead of Storage disk, which can speed up query execution significantly.

Recommended value :  15%-25% of Total RAM 

2.   work_mem

Defines the amount of memory used for internal sort operations and hash tables before writing to disk. Insufficient memory allocation can lead to slower query performance as Sort operations are used for order by, distinct, and merge join operations. Hash tables are used in hash joins and hash based aggregation.

Recommended value :  25% of Total RAM divided by max_connections

3.   effective_cache_size

Informs the optimizer about the level of cache available in the kernel/OS. Index scans are most likely to be used against higher values; otherwise, sequential scans will be used if the value is low. Setting this value too low can cause the query planner to avoid using certain indexes. 

Recommended value :  50% of Total RAM 

4.   wal_buffers

The wal_buffers setting controls the amount of shared memory used for the Write Ahead Log (WAL) data that has not yet been written to disk. The default size is 16 MB, but a higher value can improve performance on a busy server.

Recommended value :  16MB or higher

5.   effective_io_concurrency

Defines the number of simultaneous read and write operations that can be operated by the underlying storage disk. The allowed range is 1 to 1000, or zero to disable asynchronous I/O requests.

Recommended value :  HDD - 2, SSD – 200 and SAN - 300

 

Query tuning, Object optimization, maintenance window for vacuum and analyze, monitoring and reacting on blocked processes/locks, analyzing and fine-tuning long running and slow running Queries including index optimization, access path and right join type will be discussed in the next blog post.

-HTH :)

Monday, 7 October 2024

What to choose between PostgreSQL and MySQL

What to choose between PostgreSQL and MySQL (Best Opensource Cloud RDBMS offerings)



What to choose, If two products are nearly identical in quality and functionality, we will probably choose the free one, as the primary factor would be the cost, free is very hard to beat!  😇
But, what If both are free :)? We have to consider all parameters and facilities in terms of availability, performance, functionality, support, updates, or additional features between both the systems.

When we need to migrate and modernize an on-prem legacy brown-field database environment, the key factor to be considered is the compatibility of source and proposed target database engine and code conversion complexity. Most of the RDBMS are ~95% compatible for Storage Objects (Tables, Views and Indexes) but code objects compatibility varies anywhere from 60% to 80% and complexity is depending on how much customization or typical code stored in Stored procedures, Functions and Triggers. There are multiple Schema conversion tools including AWS SCT, Azure DMA, GCP DMS, ispirer, migVisor, StarM, DMAP and the most popular and free Ora2PG to assess the source database systems to choose the target database engine.

This blogpost is more to choose between PostgreSQL and MySQL for a green-field database system to be developed/designed with no tag of license costs :)

We are living in a new scale i.e. Hyperscale and in this cloud world, open source databases PostgreSQL and MySQL stand out as the two most popular choices as both are supported by DBaaS systems today in all Hyperscale but PostgreSQL is an edge over MySQL. Lets take Amazon Aurora, is the premier PaaS/DBaaS offering from Amazon, supports both of these databases, but GCP AlloyDB is only PostgreSQL variant. Although both databases offer robust features and share many similarities, deep inside the capabilities, they possess noteworthy differences in the features and functionalities, and one has edge over other RDBMS for specific workloads.

Both PostgreSQL and MySQL are widely used open-source databases that power and suit a variety of real-time applications. MySQL is recognized as the world’s most popular RDBMS, which was created by a Swedish company, MySQL AB, founded by Swedes David Axmark, Allan Larsson and Finnish Michael "Monty" Widenius. Original development of MySQL by Widenius and Axmark began in 1994, and the other side PostgreSQL is often described as the world’s most advanced Object relational database management system (ORDBMS) and the implementation of POSTGRES began in 1986 almost 8 years before the MySQL.

 

Let’s review and compare key factors, functionalities of these TWO most popular open-source RDBMS services (Key differences: PostgreSQL vs MySQL)

MySQL and PostgreSQL are two of the most widely used and service offered open-source relational database management systems. MySQL is known for its speed and ease of use, making it ideal for web applications and read-heavy workloads. PostgreSQL offers advanced features, new data types and extensions by making it suitable for complex queries and transactions. Below are the some of the key differences:

Ø  ACID compliance (Winner: PostgreSQL)

Atomicity, consistency, isolation, and durability (ACID) are database properties that ensure a database remains in a consistent state even after system failures.

MySQL offers ACID compliance only when you use it with InnoDB storage engine or software modules. BTW, PostgreSQL is fully ACID compliant in all variants.

Ø  Concurrency control (Winner: PostgreSQL)

Multiversion concurrency control (MVCC) is an advanced database feature that creates multiple copies of the records of a table, to safely read and update the data in parallel. With MVCC, multiple users can read (SELECT) and modify (DML) the same data simultaneously without locking the table yet preserving data integrity.

MVCC varies by Storage Engine in MySQL. MVCC is fully supported with the InnoDB storage engine, but not supported in the MyISAM storage engine. Other side, PostgreSQL supports MVCC in all variants/configurations.

Ø  Indexes (Winner: PostgreSQL)

Indexes are database objects that can be created for a table to get direct access to specific data rows thus improve the performance. Indexes store the values of the key(s) that were named when the index was created, and logical pointers to the data.

MySQL supports B-tree and R-tree indexing where as PostgreSQL supports multiple types of Indexes include trees, expression indexes, partial indexes, and hash indexes to fine-tune your database performance.

Ø  Data types (Winner: PostgreSQL)

MySQL is a relational database provides various data types of a typical RDBMS to cater regular business needs , but PostgreSQL is an object-relational database supports to store data as objects with properties, just like in many programming languages like Java. PostgreSQL supports all MySQL data types plus additional data types like geometric, enumerated, network address, arrays, ranges, XML, hstore, and composite to facilitate optimal storage for new data entities and a clear winner.

Ø  Views (Winner: PostgreSQL)

A view is a subset of one or more tables i.e.  an alternative way of looking at the data in one or more tables to enforce better joins or security to restrict access to base tables.

MySQL supports regular views, but PostgreSQL offers advanced view options like materialized views. Materialized views improve database performance for queries repeatedly access same set of data.

Ø  Stored procedures (Winner: PostgreSQL)

Stored procedures are structured query language (SQL) queries or a named collection of SQL statements or control-of-flow language. We can create stored procedures for commonly used functions, and to improve performance.

MySQL and PostgreSQL both support stored procedures, but the versatility of PostgreSQL allows you to call stored procedures written in languages other than SQL.

Ø  Triggers (Winner: PostgreSQL)

A trigger is a stored procedure that runs automatically when a user attempts a specified data modification statement on a specified table to enforce integrity constraints.

MySQL database supports both AFTER and BEFORE triggers for DML statements i.e. the associated procedure will run automatically before or after user modifies the data. In contrast, PostgreSQL supports the INSTEAD OF trigger, so we can run complex SQL statements using functions.

Ø  Ease of Use (Winner: MySQL)

MySQL is relatively easy to install and configure when compare with PostgreSQL.

 

PostgreSQL vs MySQL – what to choose?

Based on the above classification of functionality, we need to choose the right RDBMS. The following factors also play key roles in choosing the right RDBMS.

1.    Workload type

More Selects/READs                –             MySQL,

More DMLs/Inserts                 –              PostgreSQL.

2.    Application scope

PostgreSQL is better suited for enterprise-level applications with frequent write operations and complex queries.

However, MySQL is the best fit to create internal applications with fewer users for the workloads with more reads and infrequent data updates.

3.    Database development experience

MySQL is much simpler and easier to start with or learn, hence it’s more suitable for beginners. My SQL needs less time as it’s simple to set up MySQL database environment.

PostgreSQL, on the other hand, is a bit more complex than MySQL for beginners as PostgreSQL requires more experience to setup and configure the database environment.

 

Final word:

ü  If we need to build relatively a small database system with more reads than writes and to maintain/manage/administer the database with less experienced manpower then, MySQL is the best bet.

ü  If we need to build a complex OLTP database system with frequent DMLs i.e. typically an OLTP system, with workload of generic reads (Not OLAP kind of reports) but frequent writes and moderate experienced DBAs then PostgreSQL is enterprise level RDBMS, oh! no, ORDBMS.

Monday, 27 June 2022

PostgreSQL VACUUM!

VACUUM

is the garbage-collector of PostgreSQL Database to discard obselete (updated) / deleted records of tables, materialized views and optionally analyze the statistics of a database/object.

If a table doesn’t get vacuumed, it will get bloated, which wastes disk space and also hampers the performance.

 

PostgreSQL's VACUUM command is a maintenance command needs to be run periodically to cater the following vital tasks:

1.      To recover or reuse disk space occupied by updated or deleted rows.

2.      To analyze/update data statistics used by the PostgreSQL query planner.

3.      To protect against loss of very old data due to transaction ID wraparound.

 

In normal PostgreSQL operation, tuples/records/rows that are deleted or obsoleted by an update are not physically removed from their table until a VACUUM is done. Therefore it's necessary to do VACUUM periodically, especially on frequently-updated tables as VACUUM reclaims storage occupied by dead tuples.

Normally we don’t need to take care of all that, because the autovacuum daemon does that with some limitations based on the configuration.

The frequency and scope of the VACUUM operations performed for each of the above reasons will vary depending on the needs of each database environment. Some tables with heavy updates and deletes need frequent vacuuming to discard obselete and deleted records. 

 

  • VACUUM processes every table and materialized view in the current database that the current user has permission to vacuum by default i.e. without a table_and_columns list specified, .With a list, VACUUM processes only those specified table(s).
  • VACUUM ANALYZE performs a VACUUM and then an ANALYZE for each selected table. This is a handy combination form for routine maintenance scripts. This option Updates statistics used by the planner to determine the most efficient way to execute a query thus the Optimizer choose right plan including join and indexes.

VACUUM (without FULLsimply reclaims space and makes it available for re-use. This command can operate in parallel with normal reading and writing of the table, as an exclusive lock is not obtained on the table. However, extra space is not returned to the operating system immediately. it's just kept available for re-use within the same table. It also allows us to leverage multiple CPUs in order to process indexes. This feature is known as parallel vacuum. We can use PARALLEL option and specify parallel workers as zero to disable the parallelism.

  • VACUUM FULL rewrites the entire contents of the table into a new disk file with no extra space, allowing unused space to be returned to the operating system. This form is much slower and requires an ACCESS EXCLUSIVE lock on each table while it is being processed. 

BTW, “full” vacuum, which can reclaim more space, takes much longer and exclusively locks impacts the performance and availability of the specific table. This method also requires extra disk space, since it writes a new copy of the table and doesn't release the old copy until the operation is complete. Usually this should only be used when a significant amount of space needs to be reclaimed from within the table like after a huge purge activity on the table.

 

  • VACUUM FREEZE selects aggressive “freezing” of tuples.

Specifying FREEZE with VACUUM is useful in transaction id wrapping and is equivalent to performing VACUUM with the vacuum_freeze_min_age and vacuum_freeze_table_age   parameters set to zero

 Aggressive freezing is always performed when the table is rewritten, so this option is redundant when FULL is specified.

 

Equivalent Commands in other RDBMS:

  • ·      REORG and UPDATE STATISTICS in Sybase
  • ·      REORG and GATHER STATS in Oracle