Simplifying Streaming Data Infrastructure w/ Dunith from Redpanda Data
Part 1 of the series on real-time analytics infrastructure and use cases.
Streaming data infrastructure is going mainstream.
Technologies built to cater to the needs of large-scale organizations like LinkedIn and Uber can now be utilized by startups to deliver hyper-personalized product experiences in real time.
While you don’t need to be a data engineer to understand the benefits of streaming data infrastructure or real-time data pipelines, making the case for it and figuring out where to get started is not trivial.
So we got one of the best minds in the streaming data space, Dunith Dhanushka (who was earlier at Startree) to answer some fundamental questions, describe relatable use cases, and share tips on how to get started.
P.S. The folks behind Startree built Apache Pinot — an open source real-time OLAP data store — at LinkedIn. Every time you click "who's viewed your profile" on LinkedIn, the data is presented in real time thanks to Pinot’s capabilities.
Let’s dive in:
Q. In simple terms, what is streaming data infrastructure?
First, I’d like to introduce you to events. Events come before streaming and represent facts about what has happened in the past. When we sequence these events into a stream, we call it streaming data — the infrastructure needed to capture, process, and make sense of events in real time.
Q. What are the prerequisites in terms of the data stack to set up streaming or real-time data pipelines?
Looking at the streaming architecture or the landscape from a high level, we can identify many components and categorize them based on their roles as follows:
The first step is to produce events from your existing applications or operational systems which is done via SDKs or middleware tools.
Secondly, you need a scalable medium to store these events and this is where real-time streaming platforms like Kafka and Pulsar come in.
The third step is what we call “massaging” the data.
So, we have event producers and then the events are ingested into an event streaming platform, post which the events landing here need to go through some sort of transformation because this is raw data we’re talking about.
It could contain some unwanted information and you often need to mask PII data, or sometimes map JSON into XML, or join two streams together and produce an enriched view, etc. This is what we mean by data massaging.
In the fourth step, you need a serving layer to present or serve this aggregated or processed information to your end users — internal customers like analysts or decision makers inside your organization, or external users of your product.
In the case of external users, you need a real-time OLAP database or a read-optimized store to deliver real-time experiences.
So those are the four critical components that you need to set up your streaming infrastructure, but this could certainly vary based on the complexity of your use cases.
Q. So what exactly is Apache Pinot and what does it do?
Apache Pinot is a real-time OLAP database.
There are two things to note here — Real-time and OLAP (Online Analytical Processing).
Real-time indicates that Pinot can ingest data from streaming data sources like Apache Kafka, Kinesis, and Pulsar, and make that data queryable within a few seconds.
On top of that, Pinot makes it very fast to run complex aggregated OLAP queries that scan multiple batches of data and consistently run complex aggregations and filtering with sub-second latency, tuned for user-facing analytics.
Q. Can you give us a common example of user-facing analytics?
Yes! On LinkedIn, you'll get a notification in real time when someone views your profile, right? That feature is called, "Who viewed your profile" and Pinot is what powers it.
One might see this feature as a simple thing but there are lots of complicated things going on in the background to make it possible — Pinot has to ingest real-time click or profile visits from all the front-end processes, then store them in a scalable manner, and then run queries in real-time to answer lots of concurrent questions. We’re talking about multiple hundreds of thousands of queries executing on the database concurrently.
This is one of the best examples of user-facing analytics as Pinot was originally built at LinkedIn for this use case.
🤔 Have questions?
Q. Can you briefly explain the benefits of a real-time OLAP Data Store like Pinot over a regular cloud data warehouse?
There are two main factors that differentiate Pinot from a data warehouse — Latency and Freshness of data.
Pinot can consistently produce queries over sub-second latencies, usually milliseconds. On the other hand, since data warehouses are tuned for internal use cases like exploratory analysis and BI, they produce single-digit latencies most of the time — seconds basically.
When it comes to freshness, Pinot can ingest data from sources in real time, whereas with regular data warehousing solutions, one needs to employ ETL scripts or tools to batch and load data into the warehouse periodically (on a schedule).
Q. Typically how big or small are data teams at companies that successfully implement streaming or real-time data infrastructure?
Well, that's a bit of a difficult question to answer because it depends on the complexity and the velocity of your data infrastructure.
How fast do you want to process your data and how complex is your ecosystem? How many components are there in your system?
Let's say, you want to simply build a real-time dashboard in which case, assuming all the components are available as managed cloud services, you can start with a single data engineer on your team.
But then as your data velocity grows and your requirements grow, you can horizontally scale your team by allocating team members based on data sources, stream processing, or based on capability — some people can work on the serving layer, some can work on the stream processing, while others can work on the data ingestion.
Q. How do data adjacent teams like Product and Growth, utilize steaming data that is stored in Pinot?
We actually come across many use cases related to growth and product analytics, especially for SaaS companies.
They use Pinot to capture and store their product or engagement metrics — by instrumenting their products with an SDK, emitting data points as events into a streaming data platform like Kafka, and then configuring Pinot to ingest from that data platform.
These metrics can range from simple button clicks and page views to more complex data from advertising platforms, and then this vast data set can be utilized to understand user behavior and engagement.
Product teams can run analyses to derive metrics like DAUs (daily active users), or perform funnel analysis to understand points of friction or calculate conversion rates. They can plug this data in to a BI tool to build real-time reports while Pinot ensures that data is fresh and relevant.
Running such analyses really fast and enabling Product and Growth teams to derive insights in real-time is what Pinot excels at.
Q. Last question — what’s your one piece of advice for companies just getting started or looking to get started on their real-time data journey?
Real-time analytics is about processing data as soon as it's available, which means you need to put several things into consideration.
For instance, if you’re processing data pipelines with millions of events coming in per second, you need scalable and reliable computing and storage platforms, or infrastructure to process them, and eventually, make sense of those events.
When it comes to real-time analytics, there will be certain complexities involved, but at the same time, technologies today are getting very cheap, and there are lots of managed services out there to make streaming infrastructure viable for simpler use cases too.
But you do need to identify those use cases first.
If all one needs is to populate a dashboard on a daily basis, it could easily be done with a simple ETL job and a data warehouse/lake.
However, if the use cases involve anomaly detection, real-time recommendations, and real-time dashboards, one needs to carefully plan the storage, computing, and analytics infrastructure needed.
So, to summarize, know your use cases well, measure the complexity, and have a budget in place if you really want to benefit from real-time analytics.
If you'd like to dig deeper into streaming and real-time analytics, check out Dunith’s posts on Medium.