In an era where data is at the core of business operations, ensuring consistency and accuracy is paramount. However, the constant flux in data can lead to inconsistencies among databases, data lakes, and data warehouses. Add to this the current trend of cloud migration, and the risk of data silos grows even more.
Change Data Capture (CDC) is becoming a go-to solution. It’s essential for keeping systems in sync, especially when bridging on-premises setups with cloud environments. Whether planning a phased cloud transition or maintaining a hybrid system, CDC offers the robust solution data engineers need.
What is Change Data Capture?
Change Data Capture (CDC) is a common technique to track data changed in a database system. The goal is simple but critical: capture these changes and relay them in real-time to downstream processes or systems. This ensures that instead of performing a full resync, downstream systems can focus solely on the incremental changes, streamlining data management
At its core, CDC serves multiple purposes:
- Data Integrity & Consistency: CDC ensures that data remains consistent and integral across diverse systems and deployment scenarios, eliminating the risks of data discrepancies.
- Optimal Tool Utilization: CDC empowers organizations to transition data from traditional databases to specialized platforms like document databases, search databases, or data warehouses. This flexibility allows companies to leverage the best tools for specific tasks.
Historically, CDC gained traction as a more efficient substitute to batch data replication, especially when populating data warehouses for ETL jobs. However, its importance has grown exponentially in the cloud era. Today, CDC stands as a cornerstone for many organizations making the strategic shift to cloud infrastructure.
CDC Common Mechanisms
In the world of Change Data Capture (CDC), various mechanisms help identify and capture alterations in data; understanding these techniques is pivotal for effective and efficient data synchronization.
Timestamp-Based CDC
This method involves adding a dedicated column to tables which reflects the time of the most recent change (e.g., `LAST_MODIFIED`
or `LAST_UPDATED`
). Downstream systems can then query this field to fetch records updated since their last check.
- Pros:
- Straightforward to implement.
- Cons:
- Only captures soft deletes, not `
DELETE`
operations. - Imposes computational load on source system due to full-row scans.
- Only captures soft deletes, not `
Trigger-Based CDC
Databases have trigger functions, which are automated stored procedures activated upon specific events (e.g., `INSERT`
, `UPDATE`
, or `DELETE`
). To capture data changes, one trigger per operation per table is set. Changes are stored in a dedicated table, sometimes known as a shadow or event table.
- Pros:
- Detects all change types:
`INSERT`
,`UPDATE`
, and`DELETE`
. - Supported by most databases.
- Detects all change types:
- Cons:
- Impacts source database performance due to additional writes.
- Requires alterations to the source database schema.
Log-Based CDC
Databases log all changes into transaction logs. While primarily for backup and recovery, these can propagate changes to target systems. Data changes are real-time, and as target systems read directly from transaction logs, there’s no computational strain on source databases.
- Pros:
- No added load on source databases.
- Captures all change types without needing schema changes.
- Cons:
- Lack of standardization in transaction log formats across vendors.
- Target systems must discern and negate changes rolled back in source databases.
A Deeper View Into Log-Based Change Data Capture
When more casually referring to CDC, most people are referencing this mechanism. This mechanism goes by many names: logical replication in PostgreSQL, change streams in MongoDB, binary logs in MySQL, etc.
Operating on the database provided replication log has several advantages over other methods to achieve the same results (for instance, taking periodical table snapshots):
- Less latency: Subscribers of Change Data Capture event streams can get near real-time updates of transactions going on in the database if needed.
- Less load on source systems: Pulling entire table snapshots is expensive for large tables where older records are only sporadically updated. If such pulls happen at a frequent interval, it may significantly impact databases’ normal functionalities. Reading from the database logs stored in the file system, on the other hand, does not rely on querying production data.
- Capturing deletes: Record deletion is often hard to track and may require building an external lookup table or modifying application code. Reading from the event stream, however, makes capturing deletes no different than insertion and updates—just another event type to deal with separately.
Capturing this change stream does have some tradeoffs:
- Setup / Different Security Boundaries: Most databases make capturing the change stream a different setup process than issuing regular queries. Databases typically require additional setup since the change stream is often across the entire database as opposed to per table/schema. For strict use cases, access to the event stream may not be allowed or may require re-architecting to split out some data to a different database isolated from the data where the event stream is desired.
- Maintenance Overhead: Some databases will crash due to a lack of disk space if a replication slot is created to store change stream events and the consumer is either not consuming or not consuming fast enough. Other times there may be connectivity issues and the change stream has gaps in it due to the retention period of the change stream being shorter than the issue. These types of issues may increase maintenance overhead.
How Change Data Capture Works
Change Data Capture (CDC) operates as a watchdog for databases, ensuring that modifications within one platform seamlessly integrate into other systems. At its core, CDC is about mirroring the life of data: its creation, modifications, and deletions. But how exactly does this process work?
When there’s an alteration — whether it’s a new entry (INSERT), a modification (UPDATE), or a deletion (DELETE) — in a primary database like MySQL or Oracle, the ripple effect of this change needs to extend to secondary platforms. These can range from a cache or search index to broader storage repositories like data lakes or warehouses.
At the highest level, CDC’s operation can be bifurcated into two main methodologies: the ‘push’ model and the ‘pull’ model.
1. The Push Method
Here, the initiator is the source database. Whenever there’s a change, it actively sends this modified data to the corresponding target systems. It’s analogous to a news broadcaster providing live updates. Benefits include near real-time data synchronization, ensuring secondary systems always have the freshest information.
However, this immediacy does have a flip side. If the target platform is temporarily inaccessible, there’s a risk of data loss. To mitigate this, many setups use a messaging system, acting as a bridge, ensuring data waits safely until it can reach its intended destination.
2. The Pull Method
Conversely, in the pull approach, the onus is on the downstream systems. These systems keep checking the source database at regular intervals to see if there’s new data to fetch. The source, in this approach, has a more passive role, simply logging the changes. To draw a parallel, it’s akin to checking your mailbox for new letters periodically.
While this method might introduce a slight delay (as data updates are contingent on the frequency of checks), it does reduce the workload on the source database. And, similar to the push method, an intermediary messaging system is often used to shield against potential data loss if the target system is temporarily unreachable.
In essence, the choice between push and pull Change Data Capture methods boils down to the immediacy of your data needs. For applications craving real-time updates, the push approach is ideal. However, if your systems can afford a slight delay and prioritize bulk data transfers, the pull methodology would be more apt.
Acquiring CDC Data
CDC with Debezium
Currently, Debezium supports four databases: MySQL, Mongo, Postgres, SQL Server, Oracle servers, and Cassandra databases. In addition, there are work-in-progress Debezium connectors for Vitess servers, and Google Spanner.
If your use case is a low-latency one, that is, getting near real-time updates from database change events, Debezium would be a right fit. However, if you would prefer a “pull” rather than “push” model, i.e., polling database changes at a regular interval, and act upon them in batches, Debezium may require more work downstream of the Kafka connectors.
Overall, while Debezium offloads the labor of decoding raw log formats from various databases, it still requires much hands-on experience. For example, setting up and maintaining a cluster of Kafka brokers, as well as the Zookeepers to manage them. In addition, downstream subscribers must do extra work to handle duplicate events, in event of Debezium service going down.
CDC with Ascend
Ascend.io provides a unified platform for data engineering work. Grabbing CDC data is made simple in Ascend with just a few clicks:
Internally, Ascend reads and parses the appropriate change stream provided by the database (for example, using logical replication in Postgres). Users can configure these readers (Ascend Connectors) to fetch at a given frequency. Ascend will then store the last read checkpoint, pull events occurring after the checkpoint, and then update the last checkpoint. As Change Data Capture is strictly an incremental dataset, that is, historical data is never mutated, it works with Ascend’s notion of data partitions. Each Ascend partition contains a series of change events, and users can compute in parallel to construct powerful solutions.
Common Change Data Capture Usage Patterns
Backup
Most databases only retain change logs for a limited period (typically a few days). This practice allows databases to serve basic replication use cases while minimizing storage costs on the database servers. ETLing this changelog enables cost-effective retention of the full change stream in cheaper storage tiers, like cloud blob store instead of on disk.
This archived copy of the full change stream allows for more complex use cases of time-travel across the database and deeper analytics than can be performed on the re-materialized state.
In Ascend the change stream replications are already stored in the blob store that backs your Ascend deployment. These change streams are then fully accessible to enable the more complex use cases below.
Reconstruction and Time-Traveling
It may sound counter-intuitive: why would we reconstruct a table using CDC event stream, when we could just pull the latest table as-is?
As mentioned earlier, “SELECT *” is potentially a heavy operation to be performed on a production database. Compared to full table sync, the “incremental column” approach puts much less load on the system, where each sync only fetches rows changed or modified since the last recorded checkpoint. These changed rows then get combined with the previous table snapshot, reconstructing the latest state of table data—a recursive model to continuously apply new changes on the existing dataset.
Change Data Capture further builds on the incrementality idea. Since it reads log files instead of querying the database, it also further reduces system load.
The following diagram illustrates the few steps to re-materialize a database table to its latest state:
- We first take a one-time table snapshot before the CDC stream is enabled. Add a column “operation_ts” to all rows with the value of the current timestamp.
- Read in a batch of CDC events from database logs. This is usually configured at an interval based on how frequently the downstream use cases need to have the dataset available.
- For each batch of CDC events, construct an “incremental table” that features all changed rows of records. This mini-table should have the same schema as the source table, in addition, also contains an “operation_ts” column recording timestamp of the last transaction applied on each changed row of records.
- Union the initial snapshot with the “incremental table”. For any row that appears in both “snapshot table” and “incremental table”, compare their “operation_ts” column, and accept the newest version; if it appears in just one table, just use that version.
- The end result of step 4 is our final, reconstructed table, representing the latest state as in the original database. This state now becomes the initial snapshot for the next iteration of step 4, thus considered as a recursive approach.
In Ascend, this recursive type of transformation (snapshot + delta) is native to PySpark and Scala transformations. The code (in Python) looks similar to:
from pyspark.sql import DataFrame, SparkSession, Window
import pyspark.sql.functions as F
from typing import List
def transform(spark_session: SparkSession, inputs: List[DataFrame], credentials=None) -> DataFrame:
df = inputs[0]
prior = inputs[1]
if len(prior.columns) == 0:
return df
return df \
.union(prior) \
.withColumn("versions_ago",
F.row_number().over(Window.partitionBy("primary_key_col_name").orderBy(F.col("server_ts").desc()))) \
.where("versions_ago == 1") \
.where("operation != 'delete'") \
.drop("versions_ago")
Advanced Usage Patterns
Change Data Capture is also valuable to help us unlock some questions on the dataset, that other methods can hardly achieve:
- Auditing: Databases reflect the most up-to-date state of data and lose the trace of how they got there. CDC data can serve as bread crumbs and help answer compliance-related questions, sometimes even performing true-ups when records in various systems do not match.
- Analytics: Questions like “How many times did users change their addresses for the past 6 months”, “How often does a ticket gets reopened last month”, cannot be answered with a simple “current state” table. To unlock this data / analytical value typically requires the application developers to store history. Alternatively, a CDC event stream brings in this data without having to pull in additional development work.
- Anomaly detection: Corrupt data, or unintended schema changes in source databases, often lead to unexpected chaos downstream. With the CDC event stream, one can build a notification system to alarm these undesired events. Even better, we can build logic to filter out or quarantine these changes, not allowing them to flow downstream at all when reconstructing tables for analytical purposes. This pattern is possible since the schema related to CDC is fixed by the schema of the change event, rather than of the target table’s own schema. From the change event, certain columns like the record values or the schema of record are encoded in JSON, allowing for a semi-structured format that isn’t brittle against schema changes.
These advanced use cases and more are all possible in Ascend by adding additional transformations on top of the change stream (since Ascend makes the full change stream accessible). Similar to the re-materialize example above, the same strategies of aggregating changes apply to creating slowly changing dimension tables, snapshot tables, or whatever other data set needs to be constructed.