Why we built SlicingDice?

Posted by Rodrigo Mestres on Dec 11, 2018 10:35:40 AM

In Analytics, Big Data, Data Warehouse, Database, Sem categoria, Serverless

Disclaimer: This is a long post containing the detailed history of why and how we decided to build SlicingDice. If you are looking for something shorter, read our team and history page, but if you have time for a brave and crazy history, join us on this reading.

We all know that developing a database is one of the craziest things to do nowadays (although highly talented developers that have done it before recommend the experience). This happens as not only there are plenty of well established solutions in the market, but also because it’s really challenging technically speaking.

Anyway, as we mention on our history page, it wasn't our intention to build our own database in the first place, particularly knowing that Brazil neither offer the necessary resources nor the environment to tackle such a task. However as the saying goes: necessity is the mother of invention.

The mother called DMP

In 2012 SlicingDice’s founding company, Simbiose Ventures, was building and investing in a DMP (Data Management Platform) to compete against well-known venture-backed companies, such as BlueKai (now Oracle), Krux (now SalesForce), Aggregate Knowledge (now Neustar) and others.

A DMP is used by companies (either advertisers or publishers, in the digital marketing scenario) to collect and aggregate data about their users from multiple sources. These data can be applied to better know their users and segment them into “personas”, creating engagement and generating results. Having the data originated from an “online source”, such as web pages access and JavaScript snippets, or from “offline sources”, such as CRM and billing systems, or even from 3rd party database providers, like AddThis and Experian, all data must be transparently ready for analysis on the DMP.

The concept seems cool and simple, but it fact the technical challenge is huge, because the DMP needs to constantly collect, process and update huge amounts of data from hundreds of sources while allowing its customer slice and dice the data in order to create user segments. The main problem is that the DMP customer wants to find out how many users exist in the segment he or she is trying to create very fast. Since a segment is composed by dozens of queries, each one must be executed in less than a second, otherwise creating segments would take hours!

In summary, our task was to daily collect billions of data points from hundreds of sources, process them in near real time and store these data in a way the DMP customers could play around quickly. The customer must be able to make sub-second queries on top of hundreds of billions of rows, using thousands of columns and gain insights that would be used immediately on their marketing campaigns.

Did you say billions? You bet we did. By year end 2015 our DMP customers and partners, including companies like FIAT, Terra Networks (Telefonica’s hispanic web portal), Ambev (part of AB InBev) and AddThis, have collected more than 200 billion data points (such as web page accesses and billing history) from more than 550 million unique users globally, almost 20% of the internet population during that time.

The challenges

Although difficult to solve, our technical challenges were simple to describe:

— The time-series and non-time-series query challenge

A fact we soon learned was that most queries performed on a DMP related time-series and non-time-series data in a single query. For example, a typical query issued by our client would look like this: “How many of my male subscribers (non-time-series data) accessed the page “/news” more than 2 time in the past 30 days (time-series data)?”

This makes our data modeling job really difficult as we must have a way to merge non-time-series and time-series data in order to compute the final query result. To understand why this is a problem, consider that most databases are either optimized to hold either time-series or non-time-series data, and that these optimizations lose effect when mixing both types of data. For instance, in our experience with ElasticSearch, we found that a way of coping non-time-series and time-series data was to replicate the former into the latter while creating one table (or ElasticSearch index) per day. Although this allowed running the queries we needed, it dramatically increased storage costs and became infeasible as the number of fields and users grew.

— The unknown query challenge

Due to the nature of a DMP — it is very similar to a Business Intelligence system to some extent — , you never know in advance what kind of queries your customers are going to make. After all its purpose is to allow customers to find the hidden patterns from their data and use them to their advantage.

How can we support all kinds of query without having hundreds of servers and still expect sub-second responses?

It’s not feasible to compute all the possible combinations of columns and values in advance (like Business Intelligence systems normally do when creating cubes) because this would require the power of supercomputers given the huge amount of data we were dealing with.

— The fast query challenge

So if we can’t know beforehand what will be queried, then how can we query billions of rows and make lots of JOINs, even using boolean conditions on the query, and still get sub-second responses?

Query example: How many users accessed the pages “/sports” AND “/cars” OR “/sports” AND “/football” more than 2 times in the past 30 days AND that are NOT an existing customer, but has “good” Experian credit score?

If you consider that the “/sports”, “/cars” and “/football” corresponds to a few of the 1,337,000 pages from the same customer that were accessed all together more than 3.2 billion times per month by more than 105 million different users, you realize that you would need to query a “table” containing more than 3.2 billion entries and later JOIN the result with the customer’s CRM table, containing more 5 million entries, before finally JOINing the result with the Experian data table, which contains more than 250 million entries. Don’t forget that this operation must run in sub-second response time on a few dozens of commodity hardware servers for greater cost benefit — an in-memory SAP HANA infrastructure, for instance, would probably cost millions of dollars in this setup.

— The bootstrap startup challenge

Differently from all our DMP competitors, that together received more than $150 million in funding, we were a fully bootstrapped startup from Brazil. That means we couldn’t afford to buy hundreds of servers to process our queries or pay hundreds of thousands of dollars in licensing fees to have well-known performant solutions like HP Vertica. Our infrastructure cost had to be really cheap from day one as we didn’t have the investors money to burn until we reached profitability.

These are some of the main challenges and it’s why we say we didn’t create the SlicingDice core database technology (S1Search) because we thought it would be a great market to attack and compete against really well venture-backed companies. We built because we simply needed it to store our own data and run our queries.

We tried (almost) everything

When we started building the DMP, it was crystal clear to us that we would not be able to use “normal” SQL databases or standard data warehouse solutions, even with clustering and sharding techniques. We knew it wouldn’t produce fast enough query results nor great data insertion throughput.

Unfortunately NoSQL databases were not an option as well since you have to know in advance how your data looks like and how you want to query them in order to get the promised performance. As we discussed above, we had no idea what kind of queries our customers would be making on top of their data.

Our biggest insight during that time was to approach this problem the same way search engines do: by indexing everything. We started using Solr for the task, but soon realized that ElasticSearch was more mature in terms of parent-child support.

Basically our data model was a Parent/Child relation, Parent documents were where we stored the user’s profile and non-time-series data, such as gender or age, for instance. On Children documents we stored all the user’s time-series data, such as web page access, purchase history, etc. This model was suggested to us by ElasticSearch great committers, like Clinton Gormley.

This Parent/Child data model worked really fine for us for a long time, until we reached more than 300 million users (Parent docs) and more than 100 billion user’s data (Children docs). At this stage we had almost 100 powerful ElasticSearch servers (all SSD backed) and a few thousand shards. Even so our count queries were taking more than 10 seconds to complete and simple aggregations frequently crashed the entire cluster due out of memory errors.

Because of this constant instability, we ended up having some members of our small team completely dedicated to monitor and manage ElasticSearch, something that we clearly wasn’t comfortable to do. Our DMP customer paid us to create more DMP features and advanced capabilities, not to put more people to simply keep underlying technology live and stable.

At this point we started wondering if “search engines” were really the best approach to our needs, so we began testing all sorts of databases and technologies.

We tested Google BigQuery and it was great, however response times could take from few seconds to minutes and its pricing was proportional to the storage size and the number of queries executed. We didn’t want to use some technology that would make us think how to reduce our customer’s ability of making more queries because of the costs. Also, our internal commitment was that every query must have sub-second response times, not seconds or minutes.

We tested Amazon Redshift and it seemed promising for some amount of data, but costs started to be increasingly high as the volume of data grown. On top of that, it was basically like managing a cluster of PostgreSQL on steroids in the cloud, still requiring dedicated members of our team to manage it like we did with ElasticSearch. Turned out that managing a Redshift cluster could be even more complicated than managing ElasticSearch, due to all the Redshift vacuum requirement, and many other things.

We tested MemSQL, but due to the amount of data we collected and stored, we would need to have few dozens of MemSQL nodes just to start and its licensing price is not cheap. Also, being a relational database meant we would need to make a lot of workarounds every time we needed to attach a new data source or run periodic jobs that would clearly overload the cluster.

Side note: Although we end up not using MemSQL, the support level provided by their team was simply amazing (principally from Geddes Munson and Nikita Shamgunov, MemSQL founder) and inspired us to do the same at SlicingDice.

We tested HP Vertica and it was incredibly perfect for our needs, but once we received their commercial terms, we understood how much perfection would cost.

After HP Vertica engagement, for obvious reasons we didn’t even consider to test SAP HANA or Oracle Exadata. Until today we wonder what their commercial terms would look like… maybe 80% of our company shares.

They did not know it was impossible so they did it

It was July 2015, and after spending more than 3 months trying to find the silver bullet for our needs, Eduardo Manrique, one of our top (and craziest) engineers during that time suggested that we should try and create our own database, from the ground up, specifically to support our needs, nothing else.

Our entire technical team had 15 people, managing more than 350 servers and 15 billion of new data points per month and dozens of feature requests from our big brand customers. So looking back, this is a clearly impossible scenario to build a database from scratch. Not enough resources, not enough time, not enough knowledge, just a strong necessity.

Even so Eduardo took the challenge and along with one more senior engineer built the first version of S1Search in less than 90 days. To make it clear, it was not a prototype, it was a version that we used to replace a ElasticSearch production cluster with 100 nodes from one week to another.

When developing S1Search, we just agreed in two basic rules:

  1. S1Search had to run on low memory and commodity hardware servers using SSDs;
  2. Because of the high SSD costs, S1Search should use at least 50% less storage space than ElasticSearch, while still maintaining all query capabilities we needed.

Once we finished migrating everything from ElasticSearch to S1Search, our cluster went from 100 ElasticSearch servers (around 136 TB of data) to 15 S1Search servers (20 TB of data), without data loss.

In summary, the infrastructure cost reduction paid off for the S1Search development in the first month.

Of course we had many bugs and some system instability that made Eduardo and other members of the team work late nights, but at that point we had full control of system, we were the committers and we knew what we had to do, a totally different situation than using ElasticSearch — even with the amazing ElasticSearch Gold Support we paid during that time in order to have direct access to their committers.

Moving forward and expanding

Everybody internally was so impressed with our achievement that we decided we would actually double down on S1Search, not only to maintain it but to really expand it.

On November 2015 we decided to hire a bunch of new amazing young developers coming straight from college to help us refactor and enhance S1Search and to also work in a project called SlicingDice, a platform where we could offer S1Search as a service.

Again we were a bit crazy: how could we hire young and inexperienced developers for their first job and expect them to review and enhance a project that really senior developers had a hard time building? But this kind of decision is totally aligned with the company culture we have, as we love to build our own people from the ground up.

Turns out this was one of the best business and technical decisions we ever had! The new team didn’t have a clue about databases development, so they were totally open to think and approach to problems differently, without any formed preconception. For new folks like Juarez Aires and Rafael Telles, it was totally normal and cool to read more than 20 scientific papers about data compression during the weekend in order to decide which technique we would implement on S1Search, something senior developers would almost beg not to do.

Once we had this new amazing and excited team in house, our idea was simple: why not provide to other companies and developers something we had always searched and never found?

As we did with S1Search in the beginning, we established two basic rules for SlicingDice development:

  1. The platform would need to simply work, like magic. Other developers would not need to face the challenges we faced before building S1Search;
  2. We know how it is to be bootstrapped and have limited money, so it would have to be seriously cheap and the pricing model would have to be simple and predictable;

Looking at SlicingDice now, a Serverless Data Warehouse and Analytics Database as a Service, we believe it satisfies these two rules we established initially, as the platform is totally and truly “as a service” and the pricing is really cheap compared to all competitors, besides being fully predictable.

And the first slice of cake goes to…

As the moment of this writing, although we already have several private-beta customers, Simbiose Ventures companies are still the biggest SlicingDice customers, storing more than 500 million entities (users) and indexing around 600 million data points per day.

We think this situation is about to change soon as we believe SlicingDice has a future that might look brighter than the own DMP, that uses it under the hood.

Final remarks

That’s the true history behind the S1Search development and how SlicingDice came into existence.

Although the DMP business continues to be representative inside Simbiose Ventures, all early feedbacks received about SlicingDice indicates that apparently we are on the right track and this business has potential to be huge. Hopefully other companies will enjoy and leverage it the same way we do internally.

We also plan to describe in-depth details on future blog posts about how S1Search works under the hood and all the data compression techniques we are using to support the unlimited storage offering we have on SlicingDice.

Please, feel free to contact us if you have any other questions about the SlicingDice creation or the technical challenges we faced when evaluating all the databases we mention on this post. We still have all details of each one of them.

Have you tried SlicingDice?

You don’t need to create an account, input a credit card or pay beforehand just to get the feeling of how SlicingDice works.

Play with SlicingDice at our website online demo or go deeper using demo API keys by following the 15 minutes quickstart guide as starting point.

Feedbacks and advices are ALWAYS welcome! We still have have a lot to learn and improve!