Sep 2 • 17M

Demystifying Data Observability w/ Kyle from Bigeye

In the third installment of the series on data observability, the CEO of Bigeye helps us understand the current landscape of observability solutions.

1
 
1.0×
0:00
-17:28
Open in playerListen on);
The Data Beats community aims to beat the gap between data people and non-data people via words, conversations, and beats. On the data_beats show, practitioners and founders of data companies candidly answer hard questions in an attempt to demystify the data landscape for folks working in data-adjacent roles. But it's not all talk, you also get to groove to some kickass beats!
Episode details
Comments

As the former PM of data tooling at Uber, Kyle Kirwan has a lot of good insights to share about the data observability stack and how it's evolving.

I certainly learned some new stuff from Kyle’s answers — especially about the role of lineage in data observability — hope you do too!

Let’s dive in:

Q. What is the simplest definition of data observability?

I think the simplest way to describe it would be to say that you understand what's happening inside your data pipelines, not what is happening about the pipeline infrastructure, which I think maybe we'll talk about later, but can you understand what's happening inside your pipelines?

Do you understand the state of your data in particular for most people, do you understand when something is wrong with your data, and can you pinpoint where and why something might be wrong with it?

Q. Are the terms data observability and data monitoring interchangeable?

Not quite. They're definitely closely related. So the difference is that data observability is a state, which we want to get to, right? So if I say, for example, across all my pipelines, I have data observability, or I have achieved data observability. Then what it means is that I have all of the monitoring in place so that I understand what is happening inside my pipelines.

Monitoring is a step or a tool on the path to data observability. So if you have the ability to observe what's going on inside all your pipelines, monitoring is the interface between you and being able to consume that observability information. So if we have, for example, a whole bunch of metrics and logs about what's going on inside our data pipeline, but we don't have any monitoring in place, it's very difficult for a human being to consume any information and benefit from that observability.

So data observability and data monitoring are not exactly the same thing, they're not quite interchangeable, but they're definitely very closely linked.

There's a fair bit of confusion in the data observability space. Let's change that by defining some common terminology.

Q. What is data infrastructure monitoring?

Yes, so data infrastructure monitoring would be understanding what's happening with the machinery that's processing the data in each step of your pipeline.

For example, is my data warehouse up right now? What is the response time to a given query? Are my jobs and airflow running? Did a job have a failure? If a job has a repeated failure, that might cause a problem to the data that's flowing through the pipeline, because we're not processing it, right?

So data infrastructure monitoring would tell you, is airflow working? Are the jobs running? Was there an error? How is this sort of physical machinery working?

If we were to think about an oil pipeline, I know oil's a common analogy for data, we'd be thinking about the pipe itself, motors that are pumping and moving the oil down the pipe, not about the oil, which is what data observability or data monitoring would be focused on.

Q. That's a nice analogy. So what is data pipeline monitoring?

So “data pipeline” is a really interesting term. I was just talking to somebody about this recently. I think when most people say a pipeline, they have an idea of what they mean, but it's actually a very vague term, right?

It generally means what is the series of steps that the data goes through from origin to wherever it's ending up.

So maybe it's going into an analytics dashboard or maybe it's data that's being served by a predictive model. But when we look at an overall system, 'where is the pipeline' is actually a difficult question, because a lot of times we have data from a whole bunch of different places.

It's getting crisscrossed, it's getting merged. And at least in any company that's been around for a while, the graph of your "pipelines" is really, really messy or interconnected, right? And that's a good thing, 'cause you have data coming from a lot of places, going to a lot of places. It's not necessarily bad.

But if we were to talk about monitoring a pipeline, what we'd need to be able to do is trace the flow of data up and down from one specific point of interest.

So if I'm looking at a particular table somewhere in my data environment, if I wanna look at the "pipeline" of data that's related to that table, we need to look at all the upstream dependencies where it comes from, and then we'd also need to look at all the downstream children where that table eventually flows too.

And that might ultimately go into user queries, SQL queries, might go into a reverse ETL tool and out into HubSpot or Salesforce. It might end up in a Tableau dashboard. It might end up in DynamoDB where it's getting served by a model.

So when we talk about "pipeline" monitoring, I think what most people are describing is some form of the ability to isolate a particular trace in their lineage graph and then overlay on that trace various attributes of importance like what is the freshness, and do we have a particular step in that pipeline where that step is not running on time or the data that we see is not as fresh as it should be? Or do we see the volume of rows moving through a pipeline suddenly drop off in an unexpected way?

So when I think pipeline monitoring, I think about all those questions about freshness, volume, row count, nulls, and duplicates, but layered on top of the lineage graph and then isolated to a specific path that is important to the user.

Q. You mentioned ‘lineage graph’ twice. Can you briefly describe what is data lineage?

Yeah so lineage is probably a word everybody's heard plenty of times at this point, but it is that path from some upstream point. You could trace lineage starting from raw data that lands in the warehouse — obviously, that comes from somewhere else. So we could go further upstream outside of the warehouse and trace it up to a transactional database, could be Kafka topics that are getting loaded into Snowflake. It could be all the way up to what is emitting messages into those Kafka topics. So what are the message producers that are on the other end of that Pub/Sub?

Lineage is the map of where is the data emitted from? Where is it flowing through? What are all the steps that it goes through and then all the way down into some final destination if you will?

As I mentioned with reverse ETL, that might be pushing it down into some target system like Salesforce, which might in turn also be the other end of the start of the lineage graph. So it may often be cyclical, right?

But it is that map of what are all the different interconnects, what tables get joined with what other tables, and ideally it tells you information about, "Hey, here, the data is messages in a Kafka topic here, the data is rows in a table in the raw layer in Snowflake, here, it is data that's been aggregated by a dbt job, and it's being able to trace the data through each of those steps."

Q. And where does lineage fit into the observability stack? Is it like an inbuilt feature of observability tools or do companies use an external tool for data lineage to view the graph?

I was the PM of the data operations tooling team at Uber, and back there lineage was not a product, or it wasn't really part of data observability in an explicit way, it was a metadata service that would collect and crawl the lineage graph by parsing queries.

And then it would expose that graph via API to other products and services that needed to consume the lineage graph. So the data catalog could consume the lineage graph to display the relationships when you were looking at a table in the catalog. The data quality system could look up what the upstream dependencies were.

So if we flagged a problem on a particular table, we could say, "There is a problem in this table, but there are also problems in the parent table." So the problem may actually reside further upstream than the table you're looking at. In that case, the lineage was not necessarily part of the catalog, it wasn't part of quality, it was this distinct metadata service with no interface on top, no UI or anything, but then other products could consume the graph out of it.

What's interesting in the modern data stack is each individual product sort of needs some form of lineage built into it.

So someone may have lineage being collected from their catalog, and then they may also be using Bigeye, for example, for their observability, and they may be getting a second lineage graph from that tool.

So I think there's actually an interesting challenge here where lineage could be part of observability. It could definitely be part of your catalog as well. And if you're using multiple tools that each include lineage, you're gonna have to figure out how to reconcile those.

🤔 Questions?

Leave a comment

So there are some observability tools that also have some lightweight cataloging features.

Q. Do you think eventually there will be a single data quality solution that caters to all of these different use cases, or do you think a best-of-breed purpose-built solution is the way to go?

I think that a catalog in its most basic form is kind of a design pattern, you're probably gonna end up finding many different tools in this data operations space, right? What tables do I have? What schemes are they in? What columns are there? What are their types? These are common challenges for any interface where you need to understand what's going on inside a data environment.

Now a fully purpose-built catalog with documentation with, for example, user comments or things like that, that is a product in and of itself. But the catalog, for example, we have a "catalog" in Bigeye, but it is really not designed to be a governance tool or a discovery tool. It is a catalog insofar as it assists with navigation and understanding of what's going on inside the system.

To your question about best-of-breed, I'm a big believer in the Unix toolchain approach to things, which is that I've got a bunch of different pieces, each one has a specific function, but they interconnect so I can arrange them as needed for a particular workflow or for a particular environment.

I would argue that a catalog and observability system, lineage as an underlying metadata service, access controls, et cetera, I think that these really ought to be distinct components.

Now, whether that means that those components must come from different vendors or not, I think is a different question. I think in an ideal world, there's a vendor who you can do business with, and that vendor provides each of these components and you can pick and choose the ones that make sense for your particular situation and combine them. That way, you're not managing 15 different SaaS vendor contracts. I think that would be the ideal scenario. But to need to procure one giant heavyweight tool that tries to do all of it, I think is also not something that a lot of teams are interested in.

Q. What about data testing? Where does that fit in?

Testing is actually something we started at Uber, right? So we said, "Hey, something goes wrong. Some person sees data that's clearly not correct in an analytics dashboard, or the data is just missing from the dashboard, what do we do about it?"

The first place we went was a test harness. And so the idea was that the data engineer or a data scientist can write some conditions about the data — must have the same number of rows as the parent table, it needs to be reloaded every six hours, things like that. We'd write these bodies of rules and then we'd run those on a schedule, and you would get a pass/fail.

Now, I think that those are very powerful, especially when you wanna do certain things like, if I made a mistake and now I have some explosion in the number of rows in a table — because I did a full join or a cross join, or whatever it is, in those conditions, where now I've got a ton of duplicate primary keys — stop the pipeline.

I know that that is a bad condition, it's a condition I can anticipate upfront. And I know I wanna put a very hard rule in place that tells the system what I want to happen.

That's a great place to use a test or is a warning during development for things that you can anticipate and know would go wrong. So tests are useful, and I recommend them to practically every company that Bigeye works with.

Now, where we ran into a challenge with tests, and where I think a lot of other folks are running into it or will, is that you can't write a rule proactively for every possible thing that's gonna go wrong in most of the data environments that most people find themselves working in, right?

I've run into tables that are 850 columns wide. I've run into environments where there are 10,000 plus tables in Snowflake. It's just not practical to ask a data engineering team to sit down and think about every single rule that they would wanna construct, right? It's not time efficient. It's not what people wanna do.

So observability helps with that long tail if we just harvest metrics from every single column in every single table, and if we know the relationship between the tables, etc, and we can just crawl all this information with a machine, then we can do signal processing on that, and we can identify things that look interesting.

I think that some of those conditions might then be good candidates to put a rule in place to put a test in place for, but it allows you to have this sort of long tail safety blanket or dragnet, whatever you wanna call it, for all those conditions that you would not think to test for, and which really aren't efficient to have human beings spend their time trying to anticipate.

Q. Do data observability tools also monitor the data at rest — data that is stored in the warehouse?

Yeah. I think that the majority of the tools today primarily function on data at rest, right? So as opposed to, "Hey, let's validate the data that's in a data frame while it's being processed before we write the results from a data frame back into the warehouse." That's at least as far as I can tell, that's a very uncommon, if not completely absent technique.

Most of the tools that you'll find in the market today, Bigeye included, query the data at rest, now that could be in a staging layer before it gets promoted into production, but it is still materialized at some point and stored inside the warehouse.

And so what that means is that we can query it. And from that query, we can produce those aggregated statistics, and then we can do our anomaly detection on top of that.

Now, if you are able to speak to a streaming source, for example, if you have a Kafka consumer, you could potentially do reads and things like that and do this anomaly detection on the Kafka stream before the data lands in the warehouse. So that's definitely something I think a lot of folks are interested in. But most tools today do query the data at rest.

Q. Last question — what’s the one piece of advice you have for companies looking to get started with data observability?

A. I think the main thing is that there's no magic wand in pretty much anything in data. A great tool can do a lot. It can create a lot of leverage. It can make it easier for people to work together. It can automate a ton of manual tasks, right?

So these are the things that we work on building every day at Bigeye. What it can't do is the organizational process bits. So in particular, I think what's super important to my comment just a minute ago about the impact to the business, the data team, or whoever it is that's thinking about observability needs to understand where is data in the health of the business?

Are we talking about analytics dashboards for internal stakeholders, execs, VPs, whoever are we talking about, data that's flowing to operational processes that are used by the sales team for example, or are we talking about data that's used in an actual model in production 24x7 that's in the app which is customer facing?

Understanding where is data being used in ways that are valuable to the business, but could also be a risk if they're broken.

Having an inventory or a deep understanding of those is the most important step in being successful with applying data observability because the whole point is to make those applications reliable so that the business can use data in these high-leverage scenarios without worrying that it's gonna break.


You can also tune in on Spotify or Apple Podcasts.

Prefer watching the interview?

If you’d like to hear more perspectives on the Data Observability landscape, check out:

Thanks for reading — let’s beat the gap! 🤝

Share