Understanding Change Data Capture (CDC) w/ John from Striim
Part 2 of the series on real-time analytics infrastructure and use cases.
In part 1 of this series, we covered how streaming data analytics is going mainstream.
Therefore, it’s useful to understand the role of change data capture or CDC in real-time analytics since CDC is a key component of streaming data infrastructure.
In the episode, John Kutay from Striim, a real-time data integration solution, answers some fundamental questions about CDC like:
What is the role of CDC in analytics?
Why should companies care about capturing changes to data in real time?
How is CDC-enabled real-time ETL is different from batch ETL?
And real-time analytics isn't just for data folks to understand.
John also explains how data-adjacent teams like Product and Growth can leverage real-time data to supercharge their day-to-day workflows.
Let’s dive in:
Q. What is the simplest definition of Change Data Capture or CDC?
In broad terms, change data capture is simply tracking changes in a system.
You can compare it to a change log — if you go into Google Sheets or Google Docs, you'll see a log of changes that have taken place (version history).
Data systems have similar concepts — databases have write-ahead logs which journal all the changes that take place inside a database, and change data capture is the process of tracking and collecting those changes in a usable manner.
Okay, that sounds simple!
Q. So what is the role of CDC in analytics?
There's a long history of CDC — it was initially built for the recovery processes of databases.
When databases were first being built and rolled out into enterprises, the write-ahead logs from the databases were used as a disaster recovery mechanism. If a database was shut off in the middle of an operation, enterprises could use the logs to take the database back to a normal state.
However, the role of CDC is also very applicable to analytics because you can use that same process of mining the changes from the write-ahead log to feed the data into operational analytics systems such as data warehouses (like Snowflake and BigQuery) where you’d run your analytics and reporting.
Q. What are the prerequisites in terms of the data stack to enable change data capture?
Initially, change data capture projects will start where there's some use case for the analytics team to pull data out of the operational database.
Let's say your dev team uses like MongoDB or Postgres as your backend database where you're tracking customer payments or signups, and your analytics team wants to build reports on top of that data.
In order to do that, you need to make sure that it's a cross-functional effort where engineering and analytics teams are working together to say, "Hey this is how we're gonna get the data, make sure it's secure, make sure it's efficient" because you don't want to create a client that's running more queries on top of a production engineering database.
You want to enable change data capture which is efficiently mining the logs. For instance, if you're running AWS RDS Postgres or Aurora, you have to enable write-ahead logging and think about the file rollover timeframes, etc.
CDC is definitely cross-functional with the engineering teams that own the database and the analytics teams that wish to leverage that data.
Q. Why should companies care about capturing changes in real time?
So there's always the classic batch ELT which is just taking changes from one system and applying it to another.
However, capturing changes in real time can be a real competitive advantage in terms of building real-time customer experiences.
Think about your everyday apps such as Uber — it basically connects you to a driver who's in your area now, not one who was in your area 30 minutes ago, and Uber uses real-time data infrastructure to do that.
Q. And typically how big or small are data team companies that successfully implement a CDC infrastructure?
It's across the board. Bigger teams might invest in rolling their own CDC infrastructure, but since they have more responsibilities, they might end up with an out-of-the-box product like Striim.
That said, small data teams can also use change data capture because ultimately, it's about collecting data from the cloud database and pushing it to the analytics system — essentially, teams of all sizes can implement a CDC infrastructure.
Q. Can you briefly explain how CDC-enabled real-time ETL is different from Batch ETL?
Batch ETL is inherently built on batch processing systems. Whenever a tool has terminology in it like "transform jobs" or "sync jobs", it's essentially collecting and processing data in batches which will always introduce latency somewhere in the pipe.
Even if I'm doing change data capture from the database, if the transform job and load job are on a batch frequency, that's going to add at least 15 to 30 minutes or an hour of latency in the process.
CDC with streaming will actually enable streaming ETL where you can capture and load the data as soon as it's available.
🤔 Have questions for John?
Q. Can you describe the top two use cases for real-time data streaming?
Depending on the industry there are tons of popular use cases for CDC.
A major airline I work with leverages CDC to send maintenance data from aircrafts to the ops teams in real time to cut down the cycles where people spend time waiting on the airplane for maintenance.
Another healthcare company I work with is able to take health records and put them into a smart analytics system to centralize it for their care teams in real time as well.
But a very generic horizontal use case is simply moving data from operational systems to analytical systems in an event-driven format without copying it.
Essentially, CDC eliminates waste and opt imizes performance and costs in the process of moving the data.
As you know, the goal here with this show is to enable less technical people or even non-data people to learn more about this stuff.
Q. So how can Product and Growth people — folks who work in data-adjacent teams — use real-time data in their day-to-day?
I run the growth team here at Striim and I'm very familiar with taking data from databases and actioning it for RevOps, Sales, and Marketing use cases.
For example, activating customer data during instances such as, "Okay this customer's usage has spiked in the last 30 minutes, so we should assign a support engineer right now to make sure that the customer isn’t running into any issues or incurring unforeseen costs which may upset them".
Or building real-time customer experiences by ensuring that the inventory that customers see when shopping on an e-commerce site is real-time and not stale."
Imagine this: You visit an e-commerce site, an item you wanted is in stock and ships tomorrow, you buy it, and then suddenly you find out that the items is actually out of stock (the inventory data was stale). This is not an experience you’d want your customers to have.
These are a couple of common examples but there are many more use cases for data-adjacent teams to leverage real-time data in their workflows.
Last question — what should companies look for when evaluating CDC vendors?
There's been so much innovation in the modern data stack and cloud products that have made it easy for people of all skill levels to do analytics — I believe people should double down on that strategy when looking for CDC vendors.
Look for a product that runs fully in the cloud and handles all the edge cases out of the box.
Especially if you have non-technical users who wish to leverage real-time data, and you want very low maintenance in terms of CDC.
So I’d definitely recommend an out-of-the-box solution for teams that wish to do very good operational analytics with both non-technical and technical people working together.
If your team is purely technical and you have like a very large engineering organization, you can consider stitching together a bunch of tools, especially if you must build things in-house.
Whether you go with a build-your-own or a fully managed solution like Striim for change data capture, you must guarantee to your business users that you're meeting the data SLAs and SLOs.
The product should deliver the data within the timeframe that your stakeholders expect it, and no matter what, it should make it very easy for you to have visibility into your CDC workflows.
Prefer watching the interview?
If you found this useful, check out part 1 of this series on real-time analytics infrastructure: