A Complete Guide to Behavioral Data Collection
How to collect event data or behavioral data from first-party and third-party data sources
As some of you might know, data beats is an evolution of what started as Data-led Academy — to enable folks working in tech, irrespective of their role, to gain fundamental knowledge about data technologies and processes.
I’m not a data engineer, analyst, or scientist, and I strongly believe that you don’t need a background in data to understand the data lifecycle or to improve the outcomes of your everyday efforts using data.
Society benefits from complexity; in reality though, everything is simple.
Understanding what data is collected how is a prerequisite for anyone who works with data in their day-to-day.
However, with a plethora of tooling options and the complexities of adhering to privacy regulations, companies need to be very intentional about what data they collect, and how they make that data available in the tools used by various teams.
Note: How companies collect and use different types of customer data and what privacy controls they offer end-users will be a key topic for 2023 on Data Beats.
Keeping that in mind, I wanted to share this guide (originally published on the Airbyte blog) that offers an in-depth overview of the technologies available to collect behavioral data from primary and secondary data sources.
Ready to take notes? Cool, let’s dive in!
So your company is launching a new product and you’ve been tasked with setting up the behavioral data infrastructure? Or maybe you need to revamp the existing setup using modern tools?
There are a few different technologies (CDI, CDP, ELT/ETL) that can be used to collect behavioral data, and at the same time, there are several tools with capabilities that span multiple technologies.
Navigating this maze and making an informed decision is daunting and time-consuming — this guide aims to change that.
Before going into tools and technologies though, I’d like to shed some light on why collecting behavioral data is important and where exactly behavioral data comes from.
Why collect behavioral data?
Behavioral data is collected when users perform actions or events while interacting with a product.
Behavioral data, also referred to as event data or product-usage data, serves two main purposes for teams — understanding how the product is being used or not used (user behavior) and building personalized customer experiences across various touchpoints to influence user behavior.
Understanding product usage requires prior instrumentation of the features whose usage you’d like to measure — tracking the events a user performs and sending those events to third-party tools for analysis. Additionally, events also help trigger campaigns and experiences via downstream activation tools.
Launching new features without instrumenting them beforehand is a classic mistake — it takes away the opportunity to analyze how those features are used (if at all) and to trigger in-app experiences or messages when relevant events take place (or don’t).
Where does behavioral data come from?
Although the events I’m referring to take place within your product, the actual source of behavioral data can be an external tool or service that’s embedded within your product.
For the love of simplicity, I like to categorize behavioral data sources as primary and secondary.
Primary data sources
Your core product — web app, mobile apps, a smart device, or a combination — powered by proprietary code is a primary or first-party behavioral data source.
If your product is built using no-code tools, you won’t have a primary source for your behavioral data — you’d rely on the no-code tools to make behavioral data available to you (either via webhooks or integrations with data collection tools).
To collect data from your primary sources, you can use the client and server-side SDKs or the APIs provided by data collection tools.
Secondary data sources
Third-party tools that your customers interact with directly or indirectly — tools used for authentication, payments, in-app experiences, support, feedback, engagement, and advertising are secondary data sources.
Customers interact with third-party tools indirectly or unknowingly when they are embedded within your core product experiences.
Examples include Auth0 for authentication, Stripe for payments, and Userflow for in-app experiences — from a user’s point of view, they are using your product even when interacting with these external tools.
Customers also interact with external tools that are evidently not part of the core product experience but are integral touchpoints.
Creating a support ticket via Zendesk, leaving feedback via Typeform, opening an email sent via Intercom, or engaging with an ad on Facebook — these are all interactions that help understand the customer journey.
It’s also helpful to keep in mind that third-party tools generate a lot of data but not all of it is event data. What exactly you can collect in terms of events and objects depends on the integrations offered by the data collection tool you use.
To collect data from secondary sources, you can either use source integrations offered by data collection tools or write your own code.
Technologies and tools to collect behavioral data
Just like all the layers of the modern data landscape, the data collection layer has experienced a lot of activity in the last couple of years, with the launch of several open-source products that have become popular very quickly.
The overlap between products is also increasing as core capabilities are being extended to cover adjacent use cases.
Customer Data Infrastructure or CDI
CDI is a less common term that’s often confused with CDP (Customer Data Platform).
A platform cannot exist without infrastructure, and CDP is essentially a layer on top of CDI — an additional component that offers a visual interface to do some stuff with data (collected via the CDI).
CDI is a standalone solution that can exist without a CDP, whereas a CDP is sold as an add-on by some CDI vendors. Learn more about the differences between the two.
Key aspects of a CDI are as follows:
CDI is purpose-built to collect behavioral data from primary or first-party data sources but some solutions also support a handful of secondary data sources (third-party tools).
Data is typically synced to a cloud data warehouse like Snowflake, BigQuery, or Redshift, but most CDI solutions have the ability to sync data to third-party tools as well.
All CDI vendors offer a variety of data collection SDKs and APIs
Some CDI solutions store a copy of the data, some make it optional, and some don’t.
CDI solutions that store a copy of the data also offer out-of-the-box identity resolution.
The core capabilities of a CDP, on the other hand, include identity resolution and the ability for users to build and sync audiences to external tools using a drag-and-drop UI (without writing SQL).
CDI and CDP solutions
Segment offers multiple products — Connections is their CDI offering, Profiles is an identity resolution add-on, and Twilio Engage includes CDP capabilities. Segment also offers Protocols, a data governance tool.
mParticle takes a slightly different approach — it offers CDI capabilities along with identity resolution in its Standard edition whereas audience building is available on the Premium plan. mPartice also offers additional products for product analytics (Indicative) and predictive analytics (Cortex).
Both Segment and mParticle support data warehouses and a host of third-party tools as destinations, as well as store a copy of your data that can be accessed later if needed.
RudderStack (Event Stream) and Jitsu are open-source CDI solutions positioned as alternatives to Segment Connections. Both products support warehouses and third-party tools but RudderStack offers a more extensive catalog of destinations.
Snowplow is the only CDI solution that literally calls itself a behavioral data platform. It is also open-source and unlike the others, Snowplow doesn’t support third-party tools as — it is focused on warehouses and a few open-source projects as destinations.
Other CDI solutions worth looking into are Freshpaint which offers codeless tracking and MetaRouter which is a server-side CDI that only runs in a private cloud instance.
The links below will take you to the integration catalogs of the respective tools:
Segment: Warehouses and third-party destinations
mParticle: All integrations (note that Feed refers to data sources)
RudderStack: Warehouses and third-party destinations
Jitsu: All destinations
Snowplow: All integrations
Freshpaint: All integrations
MetaRouter: All destinations
ELT solutions are purpose-built to extract all types of data from a large number of third-party tools (secondary sources) and load the data into cloud data warehouses. That said, not all integrations offered by ELT tools support behavioral data or event data.
ELT tools don’t store any data and don’t support third-party tools as destinations.
Airbyte is an open-source ELT tool that offers source connectors with 150+ tools like Zendesk, Intercom, Stripe, Typeform, and Facebook Ads, many of which generate event data. Airbyte also offers a Connector Development Kit (CDK) that you can use to build integrations that are maintained by Airbyte’s community members.
Other ELT vendors include Fivetran, Stitch, and Meltano (also open-source).
As mentioned earlier, CDI solutions also offer source integrations with a few third-party tools but those are not as comprehensive and deep as the integrations offered by ELT tools.
When contemplating whether to use an ELT tool or a source integration of a CDI tool to extract data from a third-party tool, consider the following:
CDI is best-in-class to collect behavioral data from primary or first-party data sources — web and mobile apps, and IoT devices
ELT is best-in-class to collect all types of data including behavioral data from secondary data sources — third-party tools that power various customer experiences.
Product analytics tools
Amplitude, Mixpanel, Indicative (by mParticle), Heap, and PostHog (open-source) are product analytics tools purpose-built for behavioral data analysis. At the same time though, all of these offer SDKs and APIs to collect data from your primary data sources.
Product analytics tools by nature store a copy of your data and allow you to export the data via APIs. Additionally, if you’d like to export data from these tools to your data warehouse, you can either use native integrations that some of these tools offer, or leverage Airbyte’s integrations with Amplitude, Mixpanel, or PostHog.
However, it’s important to keep in mind that beyond analysis, there are plenty of activation use cases for behavioral data.
Custom tracking solutions
If readymade solutions are not for you, you can always build a custom tracking service that collects data from your apps and syncs it to your warehouse and downstream applications. That said, having first-hand experience with such a solution, I can tell you that maintenance and troubleshooting are not trivial and the frustration is real.
More importantly, with so many different flavors of CDI and ELT solutions available, building one’s own is just not the best use of engineering resources. In fact, engineers generally hate building integrations — if you’re one, let me know if I’m wrong.
Using purpose-built data collection tools (CDI and ELT) is more efficient, prevents vendor lock-in, and just makes more sense.
I recommend adopting a CDI to collect data from primary or first-party data sources, and sticking to your ELT tool to collect data from secondary or third-party sources.
Now that you have a better picture of the tools needed to collect behavioral data for analysis and activation, don’t forget to collaborate with stakeholders from various teams when it comes to deciding which events to track and what data to send to which destination.