Change Data Capture (CDC)

7 min read2 days ago

Change Data Capture (CDC) is a design pattern used in modern data architectures to detect and track changes in data. This process helps synchronize the changes from the source system to a target system, such as data warehouses, data lakes, or other databases.

Change Data Capture (CDC) is employed to detect and respond to changes within data sources. There are two primary strategies for capturing changes, push-based and pull-based with various mechanisms within each category.

Below, we explore both, offering examples and considering the challenges that might be encountered.

Pull-Based Mechanisms

Row Versioning

Row versioning is a technique used to detect changes within a data record, especially in a database.

How It Works:

Version Column: A specific column, usually named something like “version” is added to each record in the database table. This column stores an integer that represents the version of the record.
Incrementing the Version: Every time a change is made to the record, the version number is incremented by 1. This could include any modification like adding, updating, or deleting data.
Target System Processing: The target system maintains a reference table that stores the last known version for each record. It periodically checks for rows with a version number greater than the stored value, captures these records, and reflects the changes in its system.
Update Reference Table: The reference table must be updated to reflect the new version numbers for the records processed.

Example:

Let’s consider a customer database with the following record:

If the customer updates their email, the database record would change to:

Challenges:

Complexity: Requires careful tracking of each record’s version number.
Overhead: Maintaining the reference table adds extra storage and processing overhead.
Concurrency: Concurrent updates to the same record must be handled to prevent conflicts in version numbers.

Change Data Capture (CDC)

Pull-Based Mechanisms

Row Versioning

How It Works:

Example:

Challenges:

Read the full story with a free account.

Written by Venkatakrishnan

More from Venkatakrishnan

Streamlining Data Ingestion with Databricks Delta Lake Auto Loader

Introduction

Setup Airflow in VS Code

To set up Airflow in Visual Studio Code (VS Code), you’ll need to install some dependencies and configure the VS Code workspace to work…

Boosting Performance with Predicate Pushdown in Apache Spark

Introduction

Simplifying Streaming Data Pipelines with Delta Live Tables (DLT)

Introduction

Recommended from Medium

Streaming change data capture (CDC) data between databases using Kafka

This article will briefly introduce Kafka, how to connect database sources to it using the Kafka SQL client ksqlDB and create and…

Short Reads: SQL Window Functions with Engaging Examples

Introduction:

Lists

New_Reading_List

Natural Language Processing

Now in AI: Handpicked by Better Programming

8 Comparisons of DQL Operations between SQL and PySpark

8 key functions that help you level up your SQL skills to PySpark and gain exposure to distributed computing and big data analytics

Avro vs Parquet

Let’s talk about the difference between Avro and Parquet.

Apache Spark RDD

RDD stands for Resilient Distributed Dataset, a fundamental data structure in Apache Spark. It is an immutable distributed collection of…

Building Data Lakes on AWS with Kafka Connect, Debezium, Apicurio Registry, and Apache Hudi

Learn how to build a near real-time transactional data lake on AWS using a combination of Open Source Software (OSS) and AWS Services