As some of you might know, databeats is an evolution of what started as Data-led Academy — to enable folks working in tech, irrespective of their role, to gain fundamental knowledge about data technologies and processes.
You doesn’t need a background in data to understand the data lifecycle or to improve the outcomes of your everyday efforts using data.
Understanding what data is collected how is a prerequisite for anyone who works with data in their day-to-day.
However, with a plethora of tooling options and the complexities of adhering to privacy regulations, companies need to be very intentional about what data they collect, and how they make that data available in the tools used by various teams.
Keeping that in mind, I wanted to share this guide (originally published on the Airbyte blog) that offers an in-depth overview of the technologies available to collect event data from primary and secondary data sources.
Ready to take notes? Cool, let’s dive in!
So your company is launching a new product and you’ve been tasked with setting up the event data infrastructure? Or maybe you need to revamp the existing setup using modern tools?
There are a few different technologies (CDI, CDP, ELT/ETL) that can be used to collect event data, and at the same time, there are several tools with capabilities that span multiple technologies.
Navigating this maze and making an informed decision is daunting and time-consuming — this guide aims to change that.
Before going into tools and technologies though, I’d like to shed some light on why collecting event data is important and where exactly event data comes from.
Why collect event data?
Event data is collected when users perform actions or events while interacting with a product.
Event data, also referred to as behavioral data or product-usage data, serves two main purposes for teams — understanding how the product is being used or not used (user behavior) and building personalized customer experiences across various touchpoints to influence user behavior.
Understanding product usage requires prior instrumentation of the features whose usage you’d like to measure — tracking the events a user performs and sending those events to third-party tools for analysis. Additionally, events also help trigger campaigns and experiences via downstream activation tools.
Launching new features without instrumenting them beforehand is a classic mistake — it takes away the opportunity to analyze how those features are used (if at all) and to trigger in-app experiences or messages when relevant events take place (or don’t).
Where does event data come from?
Although the events I’m referring to take place within your product, the actual source of the data can be an external tool or service that’s embedded within the product experience.
For the love of simplicity, I like to categorize the two data sources as primary (includes all internal sources) and secondary (includes all external sources).
Primary data sources
Your core product — web app, mobile apps, a smart device, or a combination — powered by proprietary code is a primary source for event data.
If your product is built using no-code tools, you won’t have a primary source for event data — you’d rely on the no-code tools to make event data available to you (either via webhooks or integrations with data collection tools).
To collect data from your primary sources, you can use the client and server-side SDKs or the APIs provided by data collection tools.
Secondary data sources
External or third-party tools that your customers interact with directly or indirectly — tools used for authentication, payments, in-app experiences, support, feedback, engagement, and advertising are secondary data sources.
Customers interact with external tools indirectly or unknowingly when they are embedded within your core product experiences.
Examples include Auth0 for authentication, Stripe for payments, and Userflow for in-app experiences — from a user’s point of view, they are using your product even when interacting with these external tools.
Customers also interact with external tools that are evidently not part of the core product experience but are integral touchpoints.
Creating a support ticket via Zendesk, leaving feedback via Typeform, opening an email sent via Intercom, or engaging with an ad on Facebook — these are all interactions that help understand the customer journey.
It’s also helpful to keep in mind that external tools generate a lot of data but not all of it is event data. What exactly you can collect in terms of events and objects depends on the integrations offered by the data collection tool you use.
To collect data from secondary sources, you can either use source integrations offered by data collection tools or write your own code.
Moreover, data from external or third-party tools is still first-party data – not to be confused with third-party data that is acquired from an external vendor. To dive deeper into the nuances of first-party data and how it differs from zero, second, or third-party data, check out this guide.
{{button}}
Technologies and tools to collect event data
Just like all the layers of the modern data landscape, the data collection layer has experienced a lot of activity in the last few years, with the launch of several open-source products that have become popular very quickly.
The overlap between products is also increasing as core capabilities are being extended to cover adjacent use cases.
Customer Data Infrastructure or CDI
CDI is a less common term that’s often confused with CDP (Customer Data Platform). In simple terms though, a CDI is one of the many components of a CDP.
That said, a CDI that can exist without a CDP, whereas a CDP usually includes CDI capabilities.
Key aspects of a CDI are as follows:
- CDI is purpose-built to collect event data from primary or first-party data sources but some solutions also support a handful of secondary data sources (third-party tools).
- Data is typically synced to a cloud data warehouse like Snowflake, BigQuery, or Redshift, but most CDI solutions have the ability to sync data to third-party tools as well.
- All CDI vendors offer a variety of data collection SDKs and APIs
- Some CDI solutions store a copy of the data, some make it optional, and some don’t.
- CDI solutions that store a copy of the data also offer out-of-the-box identity resolution.
The core capabilities of a CDP, on the other hand, include identity resolution and the ability for users to build and sync audiences to external tools using a drag-and-drop UI (without writing SQL).
CDI and CDP solutions
Segment offers multiple products — Connections is their CDI offering, Profiles is an identity resolution add-on, and Twilio Engage includes CDP capabilities. Segment also offers Protocols, a data governance tool.
mParticle takes a slightly different approach — it offers CDI capabilities along with identity resolution in its Standard edition whereas audience building is available on the Premium plan. mPartice also offers additional products for product analytics (Indicative) and predictive analytics (Cortex).
Both Segment and mParticle support data warehouses and a host of third-party tools as destinations, as well as store a copy of your data that can be accessed later if needed.
RudderStack (Event Stream) and Jitsu are open-source CDI solutions positioned as alternatives to Segment Connections. Both products support warehouses and third-party tools but RudderStack offers a more extensive catalog of destinations.
Snowplow is the only CDI solution that literally calls itself a behavioral data platform. It is also open-source and unlike the others, Snowplow doesn’t support third-party tools as — it is focused on warehouses and a few open-source projects as destinations.
Other CDI solutions worth looking into are Freshpaint which offers codeless tracking and MetaRouter which is a server-side CDI that only runs in a private cloud instance.
The links below will take you to the integration catalogs of the respective tools:
- Segment: Warehouses and third-party destinations
- mParticle: All integrations (note that Feed refers to data sources)
- RudderStack: Warehouses and third-party destinations
- Jitsu: All destinations
- Snowplow: All integrations
- Freshpaint: All integrations
- MetaRouter: All destinations
ELT solutions
ELT solutions are purpose-built to extract all types of data from a large number of third-party tools (secondary sources) and load the data into cloud data warehouses. That said, not all integrations offered by ELT tools support behavioral data or event data.
ELT tools don’t store any data and don’t support third-party tools as destinations.
Airbyte is an open-source ELT tool that offers source connectors with 150+ tools like Zendesk, Intercom, Stripe, Typeform, and Facebook Ads, many of which generate event data. Airbyte also offers a Connector Development Kit (CDK) that you can use to build integrations that are maintained by Airbyte’s community members.
Other ELT vendors include Fivetran, Stitch, and Meltano (also open-source).
As mentioned earlier, CDI solutions also offer source integrations with a few third-party tools but those are not as comprehensive and deep as the integrations offered by ELT tools.
When contemplating whether to use an ELT tool or a source integration of a CDI tool to extract data from a third-party tool, consider the following:
- CDI is best-in-class to collect event data from primary or first-party data sources — web and mobile apps, and IoT devices
- ELT is best-in-class to collect all types of data including event data from secondary data sources — third-party tools that power various customer experiences.
Product analytics tools
Amplitude, Mixpanel, Indicative (by mParticle), Heap, and PostHog (open-source) are product analytics tools purpose-built for event data analysis (product analytics). At the same time though, all of these offer SDKs and APIs to collect data from your primary data sources.
Product analytics tools by nature store a copy of your data and allow you to export the data via APIs. Additionally, if you’d like to export data from these tools to your data warehouse, you can either use native integrations that some of these tools offer, or leverage the integrations offered by a tool like Airbyte.
However, it’s important to keep in mind that beyond analysis, there are plenty of activation use cases for event data.
Custom tracking solutions
If readymade solutions are not for you, you can always build a custom tracking service that collects data from your apps and syncs it to your warehouse and downstream applications. That said, having first-hand experience with such a solution, I can tell you that maintenance and troubleshooting are not trivial and the frustration is real.
More importantly, with so many different flavors of CDI and ELT solutions available, building one’s own is just not the best use of engineering resources. In fact, engineers generally hate building integrations — if you’re one, let me know if I’m wrong.
Conclusion
Using purpose-built data collection tools (CDI and ELT) is more efficient, prevents vendor lock-in, and just makes more sense.
I recommend adopting a CDI to collect data from primary or first-party data sources, and sticking to your ELT tool to collect data from secondary or third-party sources.
Now that you have a better picture of the tools needed to collect event data for analysis and activation, don’t forget to collaborate with stakeholders from various teams when it comes to deciding which events to track and what data to send to which destination.