Latest issue Sep 4, 2018

An introduction to Druid, your Interactive Analytics at (big) Scale

This is an introduction to Apache Druid a real-time database to power modern analytics applications, probably outdated now.

Druid

Originally published on Medium in 2018.

I have discovered Druid (http://druid.io) approximately 2 years ago.

I was working at SuperAwesome at that time, and we needed a solution to replace our existing reporting system based on Mongo that was showing its fatigue. Our MongoDB implementation didn’t scale well due to the high cardinality of the data, and the storage cost made us thought it wasn’t the best tool for the job.

At the time, we were handling approximately 100 millions events per day, and some of our reports were taking 30 seconds to generate. We currently handle billions of events per day, and the reporting takes less than 1 second most of the time.

The data we had in MongoDB was migrated. This data stored in MongoDB was using approximately 60GB of disk space, and when indexed inside Druid, the same data represented only 600MB. Yep. 100x less storage!

This post will explain what is Druid, why you should care, a high-level overview on how it works, and some information on how to get started and achieve less than 1 second query time!

Druid? Is that a video-game?

No. But it doesn’t help that it’s a class in WoW; it makes it somewhat difficult to find resources from time to time.

To describe Druid, I’m just going to quote their Website:

Apache Druid (incubating) is a high performance analytics data store for event-driven data.Druid’s core design combines ideas from OLAP/analytic databases, timeseries databases, and search systems to create a unified system for operational analytics.

Druid according to their website

If I had to describe it in my own words:

Druid is a real-time columnar timeseries database on steroids that scales veryyyyy well.

There are obviously other available databases to store timeseries data, but this is not the goal of this post. I only want to introduce you to this one, as I have experience with it, unlike the others. (I tried them quickly at the time, but I’m definitely not comfortable comparing or discussing them.)

Who is using Druid? It’s the first time I hear about it.

Druid is being used by quite a few big players in the tech market. The druid website maintains a list of companies using Druid in their architecture.

A non-exhaustive list of them:

Metamarkets: They created to Druid to power their programmatic reporting.
Netflix: To monitor their infrastructure. They ingest ~2TB/hour.
AirBNB: To get rapid and interactive insight about the users.
Optimizely: Druid to power the results dashboard for Optimizely Personalization.
Walmart: They made a good post about their Event Analytics Stream.
A lot of others…

When should I consider using Druid?

You should use Druid if you have the following problems:

Timeseries data to store
Data has a somewhat high cardinality
You need to be able to query this data fast
You want to support streaming data

A few examples of good use cases:

Digital marketing (ads data)
User analytics and behaviour in your products
APM (application performance management)
OLAP and business intelligence)
IoT and devices metrics

A quick note about how your data is stored

Your data is stored into segments. Segments are immutable. Once they are created, you cannot update it. (You can create a new version of a segment, but that implies re-indexing all the data for the period)

Roll-up explained simply

You can configure how you want those segments to be created (one per day, or one per hour, or one per month, …). You can also define the granularity of the data inside the segments. If you know that you need the data per hour, you can configure your segments to roll-up the data automatically.

Inside a segment, the data is stored by timestamp, dimensions, and metrics.

Timestamp: the timestamp (rolled-up or not)
Dimension: A dimension will be used to splice or filter the data. A few examples of common dimensions can be city, state, country, deviceId, campaignId, …
Metric: A metric is a counter/aggregations that is done. A few examples of metrics can be clicks, impressions, responseTime,…

A druid segment: http://druid.io/docs/latest/design/segments.html

Druid supports a variety of aggregations possible by default, such as first, last, doubleSum, longMax, … There are also custom/experimental aggregations available, such as Approximate Histogram, DataSketch, or your own! You can easily implement your own aggregations as a plugin to Druid.

You can read more about how the data is stored inside segments in the Druid documentation: http://druid.io/docs/latest/design/segments.html

How does it work under the hood?

Every Druid installation is a cluster, that requires multiple components to run. The Druid cluster can run on a single machine (great for development), or totally distributed on a few to hundreds of machines.

Druid Architecture from http://druid.io/docs/latest/design/

Let’s first start with the external dependencies required to Druid:

Metadata storage: An SQL powered database, such as PostgreSQL or MySQL.It is used to store the information about the segments, some loading rules, and to save some tasks information. Derby can be used for development.
Zookeeper: Zookeeper is required to communicate between the different component of the Druid architecture. It used by certain types of nodes to transmit their state and other information to others.
Deep storage: The deep storage is used to save all the segment files for long-term storage. Multiple storages are supported, such as S3, HDFS, local mount, … Some of them are available natively whilst other requires the installation of an extension.

Let’s now have a look at the different node types that are running in a Druid cluster:

Historical: They are loading part or all the segments available in your cluster. They are then responsible for responding to any queries made to those segments. They do not accept any write.
Middle manager: They are responsible to index your data, either streaming or batch ingested. When a segment is being indexed, they are also able to respond to any query to those segments until the hand-off is done to a historical node.
Broker: This is the query interface. It process queries from clients, and dispatch them to the relevant historical and middle manager nodes hosting the relevant segments. In the end, it merges the result back before sending to the clients.
Coordinators: Coordinators are here to manage the state of the cluster. They will notify historical nodes when segments needs to be loaded via zookeeper, or to rebalance the segments across the cluster.
Overlord: It is responsible to manage all the indexing tasks. They coordinate the middle managers and ensure the publishing of the data. Note: There is now a way to run the overlord in the same process as the coordinator.
Router (optional): some kind of API gateway in front of the overlord, broker and coordinator. As you can query those directly, I don’t really see any need for it.

The real-time indexation from the middle manager often runs with Kafka, but other firehose are available (RabbitMQ, RocketMQ, ..) as extensions.

What happens when you run a query?

Let’s now have a look at what happens when a query is sent to the broker.

The query will contain information about the interval (period of time), the dimensions and the metrics required.

1) The query hits the broker. The broker knows where the relevant segments for the requested interval are (i.e. 2 segments are required from the historical node A, 2 from historical node B, and it also requires the data from the currently indexed segments published in the middleManager A).

2) The query is sent to all the nodes required (in our case Historical A, Historical B, and MiddleManager A).

3) Each of those nodes will do the requested aggregations and splice the data according to the query, and they send back the result to the broker.

4) The data is then merged in the broker, depending on the query, and returned to the client.

As the query interface is the same for the broker, middle manager, the historical node (you can send a query directly to a historical node if you want. You may just not get all the data that you expect back.), it is really easy to debug your segments, or test a single historical node. The broker just sends the same queries, but simply change the requested interval to get only the data it needs from each other nodes.

Limitations

Even the best databases have limitations. It’s life.

A few of them that I discovered when working with Druid over the last 2 years:

No windowed functionality, such as rolling-average. You will have to implement it yourself within your API.
Not possible to join data. But if you really have this use case, you’re probably doing something wrong.
You will probably need some kind of API in front of it, just to remap your IDs to a user readable information. As the database is mostly append-only, I would not save the value of something, but only a reference (campaign id instead of campaign name, unless your data is also read-only in your database). There are possible ways to do this directly in Druid, but I haven’t tried yet.

A note about performance and infrastructure management

Let’s be honest, Druid is “quite a beast”. Depending on the amount of data you have, it may require a pretty big infrastructure to maintain a sub-second query time.

You will also need to play with the configuration of the process (heap, CPU, caching, threads…) once you start having more data.

And that’s where Druid falls a bit short in my mind. They do not have any easy tooling available yet (well, there is https://imply.io/product with their cloud offering, but I haven’t tried yet) to configure and maintain your different servers. You will probably need to set-up your own tooling to automate everything, with Chef, Ansible, Puppet, Kubernetes…).

At SuperAwesome, we decided to use Kubernetes in combination with Helm to automate as much as possible of our Druid deployment. If you like solving that kind of problems, SuperAwesome is hiring Full-Stack Engineers and DevOps Engineers!

Where can I try and learn about Druid?

I am currently writing and editing a video course about Druid. It will available here. Make sure to check out the plan and sign up to be notified when it goes live!

You can also subscribe on my blog to be notified about my future posts, as I will probably talk more about Druid than just this post, and also post updates regarding the Druid class.

In the meantime, some other interesting links to get started:

Druid QuickStart: http://druid.io/docs/latest/tutorials/index.html
List of talks/paper about druid: http://druid.io/docs/latest/misc/papers-and-talks.html
Imply distribution of Druid: https://docs.imply.io/on-prem/quickstart