Change Data Capture (CDC)
Change Data Capture (CDC) is a design pattern used in modern data architectures to detect and track changes in data. This process helps synchronize the changes from the source system to a target system, such as data warehouses, data lakes, or other databases.
Change Data Capture (CDC) is employed to detect and respond to changes within data sources. There are two primary strategies for capturing changes, push-based and pull-based with various mechanisms within each category.
Below, we explore both, offering examples and considering the challenges that might be encountered.
Pull-Based Mechanisms
Row Versioning
Row versioning is a technique used to detect changes within a data record, especially in a database.
How It Works:
- Version Column: A specific column, usually named something like “version” is added to each record in the database table. This column stores an integer that represents the version of the record.
- Incrementing the Version: Every time a change is made to the record, the version number is incremented by 1. This could include any modification like adding, updating, or deleting data.
- Target System Processing: The target system maintains a reference table that stores the last known version for each record. It periodically checks for rows with a version number greater than the stored value, captures these records, and reflects the changes in its system.
- Update Reference Table: The reference table must be updated to reflect the new version numbers for the records processed.
Example:
Let’s consider a customer database with the following record:
If the customer updates their email, the database record would change to:
Challenges:
- Complexity: Requires careful tracking of each record’s version number.
- Overhead: Maintaining the reference table adds extra storage and processing overhead.
- Concurrency: Concurrent updates to the same record must be handled to prevent conflicts in version numbers.