Data Observability is an established category but the tools that fall under this category don't necessarily have the same capabilities or even solve the same problems.
There are infrastructure monitoring tools, pipeline monitoring tools, and tools to monitor the actual data that rests in a database/warehouse/lake. And then there are data testing tools and tools to understand data lineage.
In this episode, Kevin Hu makes it sound all too simple, and he does it with a big smile.
But that's not it — Kevin is a brilliant mind so we also got him to share some advice for companies looking to invest in data observability efforts.
Let’s dive in:
Q. Please tell us what exactly is data observability.
Data observability is the degree of visibility you have into your data systems. And this visibility helps address many use cases from detecting data issues to understanding the impact of those issues or diagnosing the root cause of those issues.
There's a fair bit of confusion in the data observability space as there are many tools with varying capabilities. So let's try to address that.
Q. Can you first describe what is data infrastructure monitoring?
A. Infrastructure monitoring is a space that emerged decades ago, but really came to the fore around 10 years ago with the emergence of the cloud, like Amazon Web Services. So tools like Datadog and Splunk and New Relic help you understand whether your infrastructure is healthy. For example, how much free storage you have in your database. What are the median response times of your API or the RAM usage of an EC2 instance. And this is really critical for software teams, especially as they deploy more and more of their resources into the cloud.
Q. And can you explain what is data pipeline monitoring?
A. Pipelines, to put it simply, take A, turn it into B, and put it into C. And this is used across the data system, whether it's using airflow to pull data from a first-party system into your data warehouse or to transform data within a data warehouse or even prepare features for machine learning. And data pipeline monitoring, on the first level, is trying to understand, are my jobs running? This is a surprisingly hard question to answer sometimes. But the level two question is, are my jobs running correctly? As I take A, turn into B, and put it into C, is A what I expect, is B what I expect, and was it loaded into C correctly?
You make it sound so simple!
Q. What about monitoring the actual data in the warehouse? How would you describe that?
So cloud data warehouses, like Snowflake, Redshift, and BigQuery are increasingly the center of gravity of data within companies. To put it more simply, it's where you put everything. And a lot of applications, whether it's a BI tool like Looker or reverse-ETL tool, a machine learning model, are kind of mounted on top of the warehouse. So data warehouse monitoring tries to understand whether the data within the warehouse that is used for all these systems is correct.
Q. Some observability tools also offer data cataloging and data lineage capabilities. Can you explain those briefly?
Data cataloging tries to address the problem, what does this data mean? And there is a gap between how data is represented in a technical system to how it represents business objects. So a data catalog is an easy way to attach semantic meaning to the objects within your data system. Here's how a metric is derived. Here is how a table is derived. So when the VP of Data asks you about this revenue metric, you point them towards a data catalog as opposed to having to type out the answer.
Data lineage solves the problem of understanding how data within your system relate to each other. If you trace data all the way back to the source, either a machine created it or a human put it in, but rarely do the end users of data use that raw data. This is in some ways, the job of a data team is to turn that raw data into an analytics-ready form that can be used for many different purposes. Data lineage tracks it all the way from the source down to where it's used and sometimes beyond.
Q. So where does data testing fit into the observability stack? We've talked about all of these different capabilities, so tell us about data testing.
Just to be clear that data quality and data observability are two different things. Specifically, data quality is a problem. People wake up in the world saying, shoot, I have to fix this data quality problem. Data observability is a technology that can help address the problem, but isn't a silver bullet. And it's similar to software where if any tool says that they're gonna fix all of your software bugs, they're lying. And the same thing is true for data observability tools.
We can help you build better processes to measure your data quality and to help you identify issues, prevent them, and resolve them. But we can't do everything for you. So testing fits in the picture because data quality is one of the core use cases that data observability addresses, and that testing is one particularly good way to catch issues.
Q. Do you think there should be a single data observability solution that addresses all of these different issues? Or do you think a purpose-built best-of-breed solution is a better approach for companies?
The classic bundling unbundling question!
The most important thing that I care about is that we solve real problems for people. And the best way that we've seen teams introduce data observability into their data stack is to start small and to start simple. And typically that means having a very focused set of features with a very well-defined goal, and then introducing that. And when it works, expand from there.
So I don't have too much of an opinion on if you should bring on one all-in-one tool or several best-of-breed tools. What I care about is that you bring on observability correctly and that you focus on very specific problems.
Q. Why is now a good time for companies to invest in a data observability solution and what are the downsides of not doing so?
At Metaplane, our customers kind of fall in the two buckets. Half of them say, okay, I'm building a data team from scratch, and I wanna get ahead of data observability. The other half of them, it's no fault of theirs, but something has happened internally. And now they're reacting to the issue. Now is the best time to bring on a data observability tool in the same way that now is the right time to bring on a software observability tool.
Where 10 years ago, you might wait for your API to go down and your customers to complain before stepping back, having a postmortem, and saying, okay we should bring on Datadog. But nowadays, Datadog is one of the first things that a backend team installs. The reason is that when an issue occurs, you have maximum historical context to both detect and resolve the issue. And that when an issue occurs, you don't have two problems, which is that you both have to fix the issue and you have to bring on a tool.
Q. Who is the typical user of a data observability solution and who are the beneficiaries?
A typical user is a data engineer or an analytics engineer. It depends on the size of company. We have some customers with like 20-person teams and their first data hire brings on Metaplane. Other customers are like 10,000-person enterprises and their head of data governance brings us on. The user of Metaplane is usually someone who is held responsible for data quality issues. Who gets the pings when a dashboard is delayed, but also has a bit of power to impact data quality issues. However, that's a very separate answer than who benefits from data quality.
Ultimately, data is not created or consumed by the data team. It's created by upstream teams, the engineering team migrating a schema in their transactional database, product and growth teams using Segment to track usage analytics events, a go-to-market (GTM) team inputting order forms. And it's used by those same teams. So when data quality improves, the entire company benefits, even though they might not call it a data quality issue. They might say, hey, this number looks wrong. Or this dashboard is delayed, which is very important to speak the language of your users when you're on your data team.
Q. Yeah, that resonates a lot because I say this all the time. Good data infrastructure should ultimately benefit the entire business. And it shouldn't just be for data teams to be more productive.
So moving on, what are the prerequisites in terms of the data stack for a company to derive value from a data observability solution?
If you have a database, you could get value out of it today. It doesn't have to be a cloud analytics data store. It could be Postgres or your transactional database, something that you want to gain greater awareness, for example, about schema changes, or if a row count drops, or if some constraint is violated. You don't necessarily have to be on a data team either. But if you have that database, you can get value from it today.
Q. Last question — what is the one piece of advice you have for companies that are evaluating a data observability solution today?
We mentioned it before — start simple. You're busy. Everyone is busy. Data teams are especially busy, 'cause people are knocking down their doors and you're having to grow your team. So data observability shouldn't be a big headache to bring on. You have many other things to do.
So start simple, bring on a tool in, for example, 10 minutes, create some simple tests like row count and freshness and schema tests across a broad swath of tables. And maybe if you have the time, dig in a little bit deeper on the most important tables based on usage and lineage, but just see how it works. Sometimes that might be enough.
Sometimes you might want an 80/20 it and go a little bit deeper and say, okay, I know that data observability works. So let's dedicate a bit more time to instrumenting more tests or adding more lineage, but don't try and eat a watermelon all in one bite, just bite by bite!
🥁🥁
You can also tune in on Spotify or Apple Podcasts.
If you’d like to hear other perspectives, check out the other parts of the collection on Data Observability: