Before going into the various types of Postgres CDC and their relative strengths and weaknesses, it is necessary to understand what Postgres and CDC as separate entities are all about.
Postgres
Postgres, also referred to as PostgreSQL, is one of the most widely used and preferred open-source relational databases. The platform can be used for multiple activities such as OLTP workloads and data warehousing and analytics. Generally, it is common for businesses to use relational databases for handling transactional workloads and then perform aggregated reporting and analytics use cases through a separate data warehouse.
However, it is also critical to ensure that a data warehouse has only the latest version of the data in its transactional database. It is often seen that time-bound reporting in organizations is adversely affected because hourly or daily batch requirements are not synced between these databases. The solution to this issue is to implement the Change Data Capture feature so that there is continuous sync between the instances.
Change Data Capture
Change Data Capture or CDC is a software pattern. Its task is to track changes made to a database so that some action can be taken based on the changes. There are many aspects to Postgres CDC with several benefits as follows:
- Captures changes in a database in real time thereby ensuring that data warehouses and downstream systems are in sync with PostgreSQL.
- Reduces the workload of PostgreSQL as processing is done only of the changes made.
- Ensures that use cases that need access to changes made in PostgreSQL without modifying the application code are implemented efficiently
Types of Postgres CDC – The Pros And Cons
We will now go in-depth into the types of Postgres CDC and analyze the benefits and pitfalls of each of them.
Query-based Postgres CDC
Let us start with the query-based model of Postgres CDC.
In this method, when the schema of the database that is being monitored has a timestamp column, PostgreSQL needs to be repeatedly queried using that column. The timestamp on the column indicates the time of the last change of a row. Users are then provided with all changes that have been done since the last time that Postgres was queried. This query-based Postgres CDC can only capture Insert and Update instances and not Delete changes.
Pros
There is no need to make changes to PostgreSQL for deploying query-based CDC since the schema has a timestamp column indicating when the modification of the rows was done.
Cons
# This form of Postgres CDC extracts data through the query layer and hence additional workload is put on PostgreSQL.
# Query-based Postgres CDC requires continual polling of the monitored table. This often leads to wasted resources if there is no change to data in the tables.
# This form of Postgres CDC needs an additional column that tracks the time when the records were last modified.
# Delete changes cannot be captured by this form of CDC unless they are soft instances.
Trigger-based Postgres CDC
In this version of the Postgres CDC, Insert, Update, and Delete changes happening in the table of interest can be identified by the users. They can insert one row for each change made into a change table and build a changelog. This Trigger-based function is supported by the 9.1 version and later of PostgreSQL and can store all changes in the audit.logged function. Though all the changes are stored in PostgreSQL only, users can repeatedly query the change table if they want to save the changes to a data warehouse or other database systems.
Pros
# All change events are captured immediately by Postgres CDC in real time.
# Useful metadata
- The PostgreSQL Trigger function automatically adds useful metadata to the change events. Examples include the transaction ID, the session user name, or the statement that caused the change.
Cons
# Postgres CDC increases the execution time of the original statement, hence adversely affecting the performance of PostgreSQL.
# Only if changes are made to the PostgreSQL database will the triggers in it work.
# A separate data pipeline has to be set up that will poll the table filled with the trigger function. However, this is necessary only when change events are required to be synced to a data store or data warehouse apart from the PostgreSQL database.
Logical Replication-based Postgres CDC
The Logical Replication-based Postgres CDC is an improved model that was launched with the 9.4 version of PostgreSQL and can effortlessly replicate data to various PostgreSQL instances on separate systems. Primarily, it is a write-ahead log on disk that includes all changes to the data in the PostgreSQL database such as Delete, Insert, and Update.
This replication model of the Postgres CDC is not enabled on all database systems by default. It has to be done by enabling changes to the configuration file. While a decoding plugin implements logical replication automatically, the same has to be done manually for PostgreSQL versions older than 10. Most managed PostgreSQL functions provide logical replication support such as AWS RDS, Google Cloud SQL, or Azure Database.
Pros
# Data is captured in real-time by log-based CDC, thereby providing continuous access to the current data from PostgreSQL to downstream applications.
# All types of changes like Update, Delete, and Insert are detected by this method, unlike the other two types.
# Since logical replication-based Postgres CDC leads to direct access to file systems, implementing changes to it does not affect the performance of the PostgreSQL database.
Cons
# Versions of PostgreSQL that were released prior to the launch of the 9.4 version do not support logical replication.
Among the three methods of Postgres CDC, the latest logical-based replication model is the most technologically advanced and the most effective one.