Dive deep into common Change Data Capture (CDC) techniques to efficiently track and understand database changes and their impacts.
Updated May 18, 2025
Data drives business today, but maintaining consistency is a constant challenge. As information multiplies across databases, lakes, and warehouses, discrepancies become inevitable. The shift to cloud platforms only adds another layer of complexity.
This is where Change Data Capture (CDC) shines. It's gaining popularity for good reason - CDC keeps your systems in sync regardless of their location or architecture. Whether you're connecting different cloud environments, integrating new platforms, or updating existing systems, CDC provides the real-time data consistency modern businesses demand. For data teams facing today's complex data landscape, CDC offers a practical, powerful solution that scales with your needs.
In this post, we'll explore CDC as a concept and provide practical steps for implementation.
Change Data Capture (CDC) is a common technique to track data changed in a database system. The goal is simple but critical: capture these changes and relay them in real-time to downstream processes or systems. This ensures that instead of performing a full resync, downstream systems can focus solely on the incremental changes, streamlining data management
At its core, CDC serves multiple purposes:
Historically, CDC gained traction as a more efficient substitute to batch data replication, especially when populating data warehouses for ETL jobs. However, its importance has grown exponentially in the cloud era. Today, CDC stands as a cornerstone for many organizations making the strategic shift to cloud infrastructure.
In the world of Change Data Capture (CDC), various mechanisms help identify and capture alterations in data; understanding these techniques is pivotal for effective and efficient data synchronization.
This method involves adding a dedicated column to tables which reflects the time of the most recent change (e.g., `LAST_MODIFIED`
or `LAST_UPDATED`
). Downstream systems can then query this field to fetch records updated since their last check.
DELETE`
operations.Databases have trigger functions, which are automated stored procedures activated upon specific events (e.g., `INSERT`
, `UPDATE`
, or `DELETE`
). To capture data changes, one trigger per operation per table is set. Changes are stored in a dedicated table, sometimes known as a shadow or event table.
`INSERT`
, `UPDATE`
, and `DELETE`
.Databases log all changes into transaction logs. While primarily for backup and recovery, these can propagate changes to target systems. Data changes are real-time, and as target systems read directly from transaction logs, there's no computational strain on source databases.
When more casually referring to CDC, most people are referencing this mechanism. This mechanism goes by many names: logical replication in PostgreSQL, change streams in MongoDB, binary logs in MySQL, etc.
Operating on the database provided replication log has several advantages over other methods to achieve the same results (for instance, taking periodical table snapshots):
Capturing this change stream does have some tradeoffs:
Change Data Capture (CDC) operates as a watchdog for databases, ensuring that modifications within one platform seamlessly integrate into other systems. At its core, CDC is about mirroring the life of data: its creation, modifications, and deletions. But how exactly does this process work?
When there's an alteration — whether it's a new entry (INSERT), a modification (UPDATE), or a deletion (DELETE) — in a primary database like MySQL or Oracle, the ripple effect of this change needs to extend to secondary platforms. These can range from a cache or search index to broader storage repositories like data lakes or warehouses.
At the highest level, CDC's operation can be bifurcated into two main methodologies: the 'push' model and the 'pull' model.
Here, the initiator is the source database. Whenever there's a change, it actively sends this modified data to the corresponding target systems. It's analogous to a news broadcaster providing live updates. Benefits include near real-time data synchronization, ensuring secondary systems always have the freshest information.
However, this immediacy does have a flip side. If the target platform is temporarily inaccessible, there's a risk of data loss. To mitigate this, many setups use a messaging system, acting as a bridge, ensuring data waits safely until it can reach its intended destination.
Conversely, in the pull approach, the onus is on the downstream systems. These systems keep checking the source database at regular intervals to see if there's new data to fetch. The source, in this approach, has a more passive role, simply logging the changes. To draw a parallel, it's akin to checking your mailbox for new letters periodically.
While this method might introduce a slight delay (as data updates are contingent on the frequency of checks), it does reduce the workload on the source database. And, similar to the push method, an intermediary messaging system is often used to shield against potential data loss if the target system is temporarily unreachable.
In essence, the choice between push and pull Change Data Capture methods boils down to the immediacy of your data needs. For applications craving real-time updates, the push approach is ideal. However, if your systems can afford a slight delay and prioritize bulk data transfers, the pull methodology would be more apt.
Debezium (pronounced as "dee-BEE-zee-uhm") is an open-source tool that abstracts away the complexities of working with low-level database events. A common Debezium architecture looks like this:
Currently, Debezium supports four databases: MySQL, Mongo, Postgres, SQL Server, Oracle servers, and Cassandra databases. In addition, there are work-in-progress Debezium connectors for Vitess servers, and Google Spanner.
If your use case is a low-latency one, that is, getting near real-time updates from database change events, Debezium would be a right fit. However, if you would prefer a "pull" rather than "push" model, i.e., polling database changes at a regular interval, and act upon them in batches, Debezium may require more work downstream of the Kafka connectors.
Overall, while Debezium offloads the labor of decoding raw log formats from various databases, it still requires much hands-on experience. For example, setting up and maintaining a cluster of Kafka brokers, as well as the Zookeepers to manage them. In addition, downstream subscribers must do extra work to handle duplicate events, in event of Debezium service going down.
Ascend.io provides a unified platform for data engineering work. Grabbing CDC data is made simple in Ascend with just a few clicks:
Internally, Ascend reads and parses the appropriate change stream provided by the database (for example, using logical replication in Postgres). Users can configure these readers (Ascend Connectors) to fetch at a given frequency. Ascend will then store the last read checkpoint, pull events occurring after the checkpoint, and then update the last checkpoint.As Change Data Capture is strictly an incremental dataset, that is, historical data is never mutated, it works with Ascend's notion of data partitions. Each Ascend partition contains a series of change events, and users can compute in parallel to construct powerful solutions.
Regardless of whether you bring in CDC data via Debezium or Ascend, here are some common patterns with which you can make use of them.
Most databases only retain change logs for a limited period (typically a few days). This practice allows databases to serve basic replication use cases while minimizing storage costs on the database servers. ETLing this changelog enables cost-effective retention of the full change stream in cheaper storage tiers, like cloud blob store instead of on disk.
This archived copy of the full change stream allows for more complex use cases of time-travel across the database and deeper analytics than can be performed on the re-materialized state.
In Ascend the change stream replications are already stored in the blobstore that backs your Ascend deployment. These change streams are then fully accessible to enable the more complex use cases below.
It may sound counter-intuitive: why would we reconstruct a table using CDC event stream, when we could just pull the latest table as-is?
As mentioned earlier, "SELECT *" is potentially a heavy operation to be performed on a production database. Compared to full table sync, the "incremental column" approach puts much less load on the system, where each sync only fetches rows changed or modified since the last recorded checkpoint. These changed rows then get combined with the previous table snapshot, reconstructing the latest state of table data—a recursive model to continuously apply new changes on the existing dataset.
Change Data Capture further builds on the incrementality idea. Since it reads log files instead of querying the database, it also further reduces system load.
The following diagram illustrates the few steps to re-materialize a database table to its latest state:
In Ascend, this recursive type of transformation (snapshot + delta) is native to PySpark and Scala transformations. The code (in Python) looks similar to:
from pyspark.sql import DataFrame, SparkSession, Window
import pyspark.sql.functions as F
from typing import List
def transform(spark_session: SparkSession, inputs: List[DataFrame], credentials=None) -> DataFrame:
df = inputs[0]
prior = inputs[1]
if len(prior.columns) == 0:
return df
return df \
.union(prior) \
.withColumn("versions_ago",
F.row_number().over(Window.partitionBy("primary_key_col_name").orderBy(F.col("server_ts").desc()))) \
.where("versions_ago == 1") \
.where("operation != 'delete'") \
.drop("versions_ago")
It is quite trivial to conduct "time-traveling" using CDC, that is, reconstruct a source table to a point back in time (let's say, t1). To do that, we just need to start from the initial DB table state, build the rematerialized table with only transaction logs earlier than t1.
Change Data Capture is also valuable to help us unlock some questions on the dataset, that other methods can hardly achieve:
These advanced use cases and more are all possible in Ascend by adding additional transformations on top of the change stream (since Ascend makes the full change stream accessible). Similar to the re-materialize example above, the same strategies of aggregating changes apply to creating slowly changing dimension tables, snapshot tables, or whatever other data set needs to be constructed.
Change Data Capture is a promising pattern as it fully harnesses the power of incrementality, as the size of data keeps growing in transactional databases. At Ascend, we are excited to build with customers and explore the full potential of automated CDC dataflows.