Where Does the Data Originate? Internal Sources Explained

Burning Questions, Answered: Part 3

Arpit Choudhury

Created :

June 5, 2024

Created :

May 22, 2024

Updated :

June 12, 2024

(#)

Minutes

This is part 3 of a 5-part series titled Burning Questions, Answered. Make sure to read part 2 before proceeding.

The events needed to identify the points of friction on the path to activation originated in Integromat’s web app which, like most B2B SaaS products, was built using proprietary code. Today, it’s possible to build lightweight SaaS products without proprietary code, using a combination of no-code tools and APIs – a process that’s being aided by the general availability of Generative AI. Therefore, once we know what data is needed to answer a burning question, the next step is to figure out where the data originates. Does it come from an internal (or primary) data source or an external (or secondary) data source?

Products powered by proprietary code – web and mobile apps or smart devices – are internal data sources.

On the contrary, external or third-party tools that users interact with – tools and APIs used for messaging, support, feedback, authentication, payments, and so on – as part of the product experience are external data sources.

From a data collection point, it’s important to highlight the distinction between internal and external data sources because when a product or even a part of a product is built using external tools (or APIs), product-usage data originates in an external environment, in which case, there’s a limitation on the data that can be collected as one has to rely on an external vendor to make the usage data available. Moreover, the data can be made available in different formats – as events (that can be fetched in real-time using webhooks or readymade integrations) or as properties (or attributes) of the entities (or objects) specific to the external app that need to be fetched using APIs or data integration tools.

Knowing where a piece of data originates expedites the data collection process and enables teams to figure out the most efficient way to send the data to the destination(s) where the data is supposed to be consumed.

Here’s a common scenario depicting what happens when data originates in an external source:

Let’s assume that your organization uses Stripe (the payment processing giant) to power the payment flows in your app. From a user’s perspective, they’re still interacting with your app while upgrading or canceling a subscription – even though they’re interacting with Stripe (often unknowingly). Therefore, the data you can collect (pertaining to a payment flow) is limited to the data collected and made available by Stripe, an external data source.

This scenario applies to every third-party tool used to power an experience across the customer lifecycle – payment processing being a common example. In Stripe’s case, important events that take place during a transaction can be fetched in real-time by configuring webhooks or at a later time using their Events API. It’s useful also to note that Stripe guarantees the availability of event data for 30 days (from when the event takes place) which means they don’t store the data on your behalf indefinitely.

Since every third-party service has idiosyncrasies, going over the API docs and understanding the scope of the available integrations offered by an external service helps figure out what data can be collected, in what format, and for how long that service stores data in its servers on behalf of customers.

Confusion alert: First-party data vs. Third-party data

As Stripe’s customer, your organization owns the first-party data generated when your customers interact with Stripe’s services within your product.

Therefore, even though the data originates in an external or third-party tool such as Stripe, that data is first-party data for your org.

On the contrary, had Stripe been a data broker that sold data collected from its customers to your organization, that would have been third-party data for your org.

Data from internal sources

Now back to my original burning question:

“We’re acquiring a ton of users every day but very few end up hitting the activation milestone; what’s preventing the rest from performing the actions leading to activation?”

As mentioned earlier, the events needed to answer this question originated in our web app built using proprietary code; therefore the data also needed to be tracked (or collected) using code.

This is a good time to highlight that, unlike implicit or codeless data collection where the idea is to collect whatever data can be collected automatically on the client-side, we opted for explicit data collection and tracked most of our events server-side because we didn’t want to compromise on data quality.

More importantly though, we certainly didn’t want to collect a piece of data without a predefined purpose. Since each event had to be tracked explicitly using code and then synced to multiple destinations, it was paramount that everything was documented in extreme detail for seamless collaboration between me and my engineering counterparts.

Keeping that in mind, I began the documentation process early. Also, I wanted to give myself enough time to think through the events, the associated event properties, and the user and account properties we needed to collect.

*Data from internal sources that's used for analysis and activation purposes*

Going through this process brings up a lot of questions, especially as you think about how to utilize the events for analytics and experimentation purposes. It also helps you figure out if it makes more sense to collect a piece of data as a property of an entity like User or Account (user property or account property) rather than as an event.

In simple terms, a user property is used to store a piece of data about individual users (traits, demographics, PII, and so on) whereas an account property is used to store data about groups (of users), which, in the context of B2B SaaS, are referred to as workspaces, teams, or organizations.

Depending on the business, there can be many other entities such as Product, Supplier, Location, and so on. Think logistics or retail where a data point is associated with entities other than User.

In Integromat’s case, besides the obvious user properties such as name, email, and country, we also stored a user’s profile info (industry and role) and preferences – what they wish to achieve using our product, whether or not they wanted to be treated as someone new to the product, and the types of emails they’d like to receive – as user properties.

Similarly, data points pertaining to an account (and not a specific user in an account) were stored as account properties; examples include organization_name, subscription_plan_name (free, pro, or business), subscription_plan_type (monthly or annual), active_scenarios_count, and organization_users_count.

Many of these data points were collected keeping in mind the course of action after identifying the points of friction on the path to activation. We had chosen Mixpanel for event analytics (I prefer event analytics over product analytics) and I knew that once the first set of events landed in Mixpanel, the next step would be to run data-powered experiments – using the same events and properties in external activation tools – to get users past those friction points as quickly as possible.

Alongside Mixpanel, we were setting Customer.io for emails and Userflow for interactive product tours (in-app guides); therefore, these activation tools were additional destinations for the events and properties we were collecting. Since each destination behaved differently in terms of how it ingested data (more on that next week) things began to get complicated when it was time to send data to these external destinations – where the data would be consumed for experimentation purposes, or in other words, where the data would be activated.

Now, once the events from our app began to land in Mixpanel, it took very little time to create funnel reports and identify where significant drop-offs were taking place on the path to activation; I finally had some answers to my burning – or should I say “blazing” question:

“We’re acquiring a ton of users every day but very few end up hitting the activation milestone; what’s preventing the rest from performing the actions leading to activation?”

I still remember how gratifying it was to finally see (and not keep guessing) exactly what was going on – the insights led to a flurry of hypotheses and it was time to start experimenting.

At this stage, the goal was to increase the activation rate and later figure out how to get activated users to increase the consumption of operations (equivalent to tasks on Zapier) to hit the limits of our generous free plan.

I had certainly felt the urge to do it all at once – collect all the data and run all possible experiments to improve all the metrics altogether – but we didn’t have the resources for that, and in retrospect, I’m very glad that was the case. After all, no matter how hungry one is, overeating is always a bad idea.

Move on to part 4 that covers the external sources where data originates.