This is Part 3 of the series titled Deconstructing Data Models
{{line}}
Analytical data modeling is the process of defining what data needs to be made available in what shape for analysis or activation purposes.
But what does this process look like in practice?
Before I answer that, let me specify in the simplest possible terms what an analytical data model is:
An analytical data model represents a set of rules, that once applied to one or more tables in a database, creates a modified view of an existing table or a new table altogether.
In other words, an analytical data model can be used repeatedly to transform raw data that is often messy into clean, usable data.
Moreover, the purpose of an analytical data model is to transform data without modifying the source data. When a model is executed, the resulting table – whether a view (a virtual table) or a materialized view (a new table) – is easier for humans to read and analyze, as well as for machines to interpret and process.
Building models
To create an analytical data model, one has to define a set of rules on top of some available data. This is typically done using SQL which is easy to read and write and is the most widely used language for data modeling.
I recommend learning the basics of SQL if you’re interested in working with data (you wouldn’t be reading this if you weren’t). Knowing some SQL helps understand what happens under the hood in a database, enabling better conversations with data folks.
An SQL query references tables in a database and when it is run – either manually or as per a schedule – it looks for rows of data in the referenced tables, executes the rules, and if there are no errors, outputs a table with the expected data. The output can be analyzed directly, visualized using a BI tool, or synced to activation tools.
Queries vs. Models
An analytical data model is a SQL query but a SQL query doesn’t have to be a data model.
Going back to the example in Part 1 of this series, the hypothetical SaaS company serves 10,000 users across 1000 workspaces out of which 200 are paid. To present data about customers, you need to write a simple SQL query that checks for workspaces that are on one of the paid plans at a given point in time. Assuming that the Workspace table contains a boolean column called is_paid, the query would look like this:
Select * FROM Workspace WHERE is_paid = TRUE;
In SQL, the asterisk represents “all” – the rest of the query is self-explanatory.
The query above can be saved as a data model and can either be executed each time against the original table (Workspace) or a new table called Customer can be created once, which only updates when the model is executed again. The key difference is that the new table (the materialized view) uses additional storage but is also faster to load since the model only needs to be executed if there’s an expected change in the data set. If you only wish to view all the rows in the Customer table, you need not execute the model – you just need to run the following query:
Select * FROM Customer;
It’s also worth highlighting that even though SQL is often associated with data transformation or data modeling, a query doesn’t always transform data – it can simply fetch an entire table (like the one above) or specific columns in a table. It doesn’t make sense to refer to such a query as a data model as it simply fetches data – it doesn’t model it for repeated use.
Now let’s say you want to take a look at data about customers who are on the Pro plan. Assuming that the Customer table contains a column called plan_name (whose data type is enum or string), the query would look like this:
Select * FROM Customer WHERE plan_name = ‘Pro’;
This query is the equivalent of applying a filter in a spreadsheet or a BI tool or creating a simple segment in an email marketing tool.
Whether it makes sense to save a query as a model or not is a matter of personal preference – there’s no right or wrong answer here. The whole idea behind creating data models is to save time by eliminating repetitive work. That said, saving every query as a data model can lead to model bloat which is a painful thing to manage. Remember, more models, more problems.
Managing models
That brings us to the importance of managing analytical data models.
Until a few years ago, a system to manage data models didn’t exist because there wasn’t a need for one. Data modeling was done before data was loaded into a data warehouse – a practice that led to the creation of fewer models and a lower volume of data being stored.
Why?
Because storing and processing large amounts of data for analytical purposes was expensive.
Instead of jumping straight into a history lesson here, I figured it might be more fun to first share a couple of verses from a song I wrote called The Modern Data Track – here you go:
The separation of Compute and Storage in the cloud,
Made warehousing cheaper, more accessible, much faster.
Queries take seconds to process in parallel,
Pay as you go, storage is infinitely scalable.
ETL became ELT
Because Extracting and Loading without Transforming is easy,
Once in the warehouse, whenever the need be,
Data can be transformed by writing a SQL query.
A brief history lesson
Historically, analytical data modeling was performed before data was loaded into an analytical database for reporting purposes (data activation was less common and was largely disconnected from analytical workloads).
Back then, the ETL paradigm (Extract-Transform-Load) was the norm where data was extracted from multiple sources (third-party tools and production databases), transformed or modeled to serve predefined reporting use cases, and only then loaded into an OLAP database (more on that in a bit).
However, the commodification of the cloud over the last decade coupled with a modern architecture has enabled cloud providers to charge separately for data storage and compute – the cost of processing a query and returning an output. As a result, storing large amounts of data in OLAP databases has become cheap, leading to an increased appetite for data inside organizations.
This shift has led to ETL, the old paradigm, being replaced by the new paradigm, ELT (Extract-Load-Transform) under which data is extracted and loaded into a warehouse first, and only later, transformed or modeled to serve reporting and activation use cases – as and when they came up.
ELT is the norm
Since data no longer has to be modeled in transit, when use cases arise, data analysts can quickly get to building models as long as the data they need has been loaded in the data warehouse. Even if it isn’t, data engineers can quickly extract and load the requisite data in the warehouse using readymade integrations offered by ELT tools like Fivetran or Airbyte.
However, there’s a downside of this approach that I’ve talked about earlier – contextless collection where data is extracted from various sources and stored in the warehouse without any predefined context or use case. This in turn leads to more data being stored and more models being created – hence the need for a system to manage data models.
The good news is that today, there are a plethora of tools available to store, manage, and execute large numbers of data models. Started as an open-source project, dbt is a popular model repository, along with others like LookML, created by Looker. Many other BI tools also offer similar capabilities as part of their offering.
Besides helping manage models, these tools also store metadata about data models such as description, dependencies, and execution history. Lastly, model repositories help version-control data models – one of the most useful capabilities because as a business evolves, the data models powering analytics and activation workflows too continue to evolve.
Additional notes
It’s useful to keep in mind that you can build data models using visual (no-code) data transformation tools which essentially generate SQL queries based on your input. These tools were created for reporting use cases and have existed for a while, enabling one to wrangle and transform data by specifying predetermined rules in a visual interface.
One such tool was Trifacta (where I worked briefly) which was heavily used by analysts to transform and clean large data sets, without the need to write SQL. It worked well but was ultimately acquired by its biggest competitor, Alteryx. However, visual modeling tools like Alteryx have struggled to keep pace with the innovation in the data space and are only becoming less relevant with the rise of Generative AI which has made it even easier to generate SQL and create data models using natural language. I expect to see a lot more innovation in this area in the coming years.
Before proceeding, I’d like to touch upon something good to know without going into technicalities or comparisons: SQL is not the only language to create data models; Python is often used and is better suited for more complex use cases. In fact, it’s becoming common for data analysts to use Python alongside SQL, and in some cases, a relatively new language called Rust.
First OLTP, then OLAP
Before concluding this series, I’d like to offer a quick overview of the two main types of databases that data-forward companies rely on.
An OLAP database is purpose-built for analytical workloads, whereas an OLTP database is purpose-built for transactional workloads.
OLAP stands for online analytical processing; a database of this type is commonly referred to as a data warehouse (or data lake, data lakehouse, data cloud, and so on). Examples include Snowflake, BigQuery, Redshift, and ClickHouse.
OLTP stands for online transaction processing; a database of this type is meant to process a large number of transactions that take place concurrently in an application – transactions like creating accounts, running workflows, placing orders, making bookings, sending messages, transferring money, and every other action that involves interacting with a transactional database.
OLTP databases can be either relational like MySQL and PostgreSQL or non-relational like MongoDB and Apache Cassandra.
Closing notes
You need not remember these terms but understanding the difference between them certainly helps understand the difference between application data modeling and analytical data modeling.
Now that you know what OLTP and OLAP databases do, it should make more sense why data is replicated from a source – internal or external – into an analytical database.
Finally and most importantly, it should now be clear that the purpose of analytical data models is to power workflows like:
- The analysis of data to derive and present insights to fuel decision-making and action-taking
- The activation of data to drive experimentation to increase usage (of software or services), purchase frequency, or content consumption
- The creation of machine learning algorithms for recommendation systems, fraud protection, credit approval, identity resolution, and so on.
I hope I’ve managed to deconstruct data modeling in a digestible manner. In the next series, I will discuss the role of analytical data modeling in designing metrics.
Get Yourself an Upgrade!
- A calm, member-only Slack community
- Jam sessions via Zoom
- Expert practitioners who love helping other learn