Building and Using Data Infrastructure w/ David from Metaplane
Building data infrastructure is one thing — and a fun thing for those building it — but getting teams across an org to use and derive value from data is an arduous journey.
David Jayatillake has a ton of experience building data infra as well as figuring out how to get folks to use data in their day-to-day. In this episode, David answers some fundamental questions like:
What are the core components of a well-executed data infrastructure?
What are the prerequisites in terms of the tech stack to set up a basic data infrastructure?
How do data-adjacent teams like Product and Growth make use of and derive value from good data infra?
And he also offers some advice for companies getting started on their data journey.
🤔 Have questions?
Let’s dive in:
Q. Please tell us what it means to build data infrastructure in the context of the modern data landscape.
I think there's a typical stack that's understood as infrastructure in terms of being able to ELT a data warehouse, a BI tool on top, and then now increasingly, there's additional non-core pieces like reverse ETL, observability, CDPs, streaming tools as well that are being added to this infrastructure.
Q. So what are the core components of a well-executed data infrastructure?
So I think, the difference between well-executed, I think it's about things that ensure quality, things that ensure reliability, especially around the development process. So being able to use version control, CI/CD in your development process, that's gonna really enable what I think most people will consider well-executed looks like.
And that's where frameworks like dbt come in, which have been enabling development on top of data pipelines flowing from the data ware, into the data warehouse and on the data warehouse that's enabled. We're looking for more tools like that to spread further out, and dbt to take a bigger footprint as well to push that quality outwards.
Q. If we talk about building a basic data infrastructure — a minimum viable data stack — what would be the two or three tools that would comprise a minimum viable data stack?
Sure, so it depends on your context. So for some companies, if you have to work with a lot of third-party tools, you definitely need an ELT tool like Fivetran, Airbyte, Gravity Data. You need one of those kinds of tools to help you get data from those third party systems into your data warehouse. Obviously then, you need a data warehouse.
I personally think, for a minimal viable data stack you want a data warehouse that's quite easy to use, and that scales without much-needed thought and planning. You don't want to need to have a DBA. So I think Snowflake and BigQuery are the two easiest to use. BigQuery is possibly even easier for a smaller startup. It's basically a "use it and forget about it," nothing to do.
Q. Why has there been an explosion of data infrastructure tooling over the last couple of years?
I think it's because of a hangover from the big data era. So in the big data era, in order to do any amount of data engineering, you'd have to hire a huge amount of very expensive people. So I've been at a company where we were on SQL server when I joined the company as an analytics stack. And they planned to do a big migration to Hadoop on Hortonworks and it took years. And they hired a data engineering team of 50 people paying them a huge amount of money, and it actually failed. They didn't even succeed in this data project.
So what we've realized, and venture capitalists have realized is that the data engineering space is ripe for automation and for SaaS tooling, and so that's more or less achieved now. If you think of Fivetran, especially from a batch point of view, you've got big companies out there now, like Fivetran, Airbyte, who've enabled most of the connectors you'd need in the space. And you've got some VCs even that identify as like Hadoop refugees. They know about the pain from that era, and that's why they've invested in removing some of that pain.
Q. Building infrastructure is one thing and probably a fun thing for those building it, but how do organizations get various teams to actually use the infrastructure and benefit from it?
This is about communication and access. So I think, in the past you've seen because of inelasticity of how compute scaled on those systems, their access to those systems has been closely guarded. Purely data teams would have access to data warehouses and BI tools and things like that. And with the proliferation of cloud and cloud data warehouses, that then got pushed outwards, and at the last company I worked at, Lyst, everyone had access to Looker.
And Looker who was sitting on Snowflake, it never had any like capacity issues. So that's how this data infra becomes accessible to everyone, not just those building it, but the whole organization. So I think that separation of compute from storage was a really key piece of how that became possible.
Q. Do you have any specific thoughts on how data-adjacent teams like product teams and growth teams can actually derive value from a good data infrastructure?
Yeah, and I think this is really interesting because actually those teams are often what I call data domain owners. So if you've got a product team that's building a feature, let's say it's some kind of a feed of products on an e-commerce, for example, they're generating potentially many data points from their product feature, and then every time they iterate on it, it generates more of a different kind.
So actually planning how that tracking is done and planning to make it done well to avoid future analytics engineering work or that work that's not even possible, that, I think is almost like they are self enabled. If they make good decisions when they're planning their engineering work, testing it well, making sure the tracking is good, that enables them to actually get value from the data.
🤔 Have questions?
Q. How do you suggest data teams find a balance between building infrastructure and supporting the day-to-day needs of the organization?
Yeah, this is something I've had to manage, and sometimes you can do it structurally. So I ran a hub and spoke model. So I had a central team of analytics, engineers and a handful of analysts, and then we had distributed analysts and analytics engineers as well.
And the central team would often do that building infrastructure, building new ways of doing things or core pieces of the stack that other teams could then build on top of. And then the day-to-day needs were often met by the team members on the spokes who sat with those commercial parts of the organization that needed support.
Now that helps, but I think apart from structurally, you also need to think about how do we do this from a work philosophy way. We believe in dealing with tech that we believe in spending time on infrastructure to make it good and to be a force multiplier for future execution. That has got to be in there as well.
Q. There are folks who believe that the modern data stack is a fad, and that organizations should refrain from stitching together half dozen tools. What are your thoughts here?
The thing is I come from a background of having done that stitching and got value from it. And I've managed to deliver with a handful of people, and that stitch together stack more than I've seen that big data era or previous era teams able to achieve. And so the modern data stack is not a fad.
Could it be better interoperable? Yes, absolutely. Could there be some bonding? Sure. Half a dozen tools sounds like a lot, I don't think it's that much. But you're seeing now fragmentation of up to a dozen tools, and I think, yes, we're getting to a point where who wants to manage that many vendors? And I've seen new startups actually whose pure focus is stitching the tools together so that any customer has like one Okta entry point and has access and choice to many of these tools, that get automatically stitched together. So you can see that there is need, but I do believe that modern data stack is valuable.
Q. What are the biggest pros of infrastructure comprising best-of-breed purpose-built tools over an all-in-one does-it-all solution?
So I think, if you think about the most typical all-in-one solution, you'd probably think of Microsoft Power BI and like Synapse SQL server type setup. Now, the interoperability on that system is very good. That's one thing that users like about it. The vendor management's very easy. You get all of your tools bundled together at one price, all of your organization has access to it as part of their typical licensing. So in terms of cost and managing the tools, it's easier. There are the pros of that all-in-one stack.
Cons are, fundamentally, they're not focused on any specific area of that tooling. And you've seen this, how Snowflake's beaten Redshift. They've just invested the time and the thought to just focus on this one thing and do it extremely well, and they're reaping the rewards of that. And you've seen Fivetran do the same in ELT. Microsoft's ELT is not as easy to use, and not as comprehensive.
And so what you end up doing is, when you have that all-in-one solution, yes it's cheaper and it's stuck together nicely, but no single piece of it does anything to the very best of it in its class, whereas when you have the best-in-breed solutions, you get things that are really good.
So how do you get the best-in-breed to work together? That's interoperability, and I think with frameworks like dbt app in existence, that's allowing for those tools to interoperate better than before, and I think you'll see in the next couple of years, interoperability as good as a bundled, like an all-in-one stack.
Q. Last question — what is the one piece of advice you have for companies that are just getting started on the data journey?
I think, especially if they're a B2C company, get your tracking right. Just start with your tracking. Put in a CDP, there are many cheap or open-source ones like Rudderstack, Snowplow out there. Put that in right away, put something into government like Avo, and then get that piping to a data warehouse. Don't worry about what you have to do after that, but that's fine.
Don't worry about making some complex data lake pipe to the data warehouse stack. Just get it from your app to your data warehouse reliably, consistently, and completely. That's just a fantastic starting point. And then even if you don't have the people to analyze that or the time to analyze it or get value from it when you do finally get around to it, you've got this wealth of free, good data to use later.
So I think that's one piece of advice I'd give to a company starting on their data journey. And also if you are struggling, maybe it's worth getting consultants in to help you. If you don't feel like you've got the time or the focus to build a data team or to set things up properly but you know you need to, I think, spend the equivalent money on consultants who will just get it done for you, and you'd probably be surprised at how much you'd get for your money because of how much building an internal team and how long it takes to scale costs.
🤔 Have questions?