SlicingDice Uncovered — Part 1 (Infrastructure)

Posted by rodrigomestres on Dec 13, 2018 5:53:26 AM

This is the first post of the SlicingDice Uncovered series, where we intent to give a great and in-depth overview on everything that powers SlicingDice: explaining our infrastructure details (that will be covered on this post), our in-house developed database (S1Search), how we handle integrity and scalability tests and much more. (Update: there is also a third unexpected post describing s1search internals in-depth.)

If at this point you don’t have a clue what is SlicingDice or how and why it was created, this blog post provides a good introduction.

Okay, let’s start.

First of all, yes, we are developers like you and when we find a new and unknown company like ours creating and managing a well-known painful service such as a Database as a Service and at same time making bold promises of unlimited storage like we do, we also feel like this GIF.

That being said, this post aims to provide clarity on our infrastructure and processing flows hoping that, by knowing how we handle your data and queries, you become more comfortable with SlicingDice.


Not IF it will fail, but WHEN it will fail

We all know that technology fails at some point for many reasons, however there are two things that are totally unacceptable for us: losing our customer’s data or having an unauthorized access to it.

With this in mind, since the SlicingDice’s inception we worked hard to have no single point of failure that could affect our ability to store and protect the data.

Let's get into the details on everything we do to protect your data and also provide our data warehouse service.

Datacenter redundancy

We currently have 3 completely independent data centers from different providers in different countries that operate simultaneously in a high-availability configuration. That means that two data centers can fail and our service will continue to support data insertion and querying.

We know that cloud is cool, but, for the kind of service we provide and the performance and stability we expect from servers, cloud ends up being expensive and most of time presenting unpredictable performance and stability. With these limitations, cloud computing does not fit ours needs and we don’t use it. Everything on SlicingDice runs on bare metal dedicated servers provided by well-known infrastructure providers such OVH, Online.net and Hetzner.

Data security and confidentiality

As we developed our own database technology from the ground up, all data is stored in our own encoded format, making it harder for an attacker to get access to the original format of the stored data.

The only way for us to face a data breach would be by having full infrastructure hacking that would allow the attacker to get access to data and the source code of our processing engine, in order to reverse engineer the data encoding logic. Other than that, as we store the data in hashed and encoded binary format, no one would be able to really guess what is inside our files.

From a server perspective, we strictly follow recommended approaches in infrastructure hardening and sensible information management. One example of such recommended approach is that we don’t allow any SSH access to our servers. All servers are exclusively accessed using KVM over IP provided by our infrastructure partners. This is a radical approach, we know, but we take security very seriously.

Although a data breach would not cause a meaningful effect to our customers, as our data store is fully encoded, we clearly understand that security is the base of a Database as a Service offering.

Data redundancy and availability

As we use bare metal dedicated servers for cost and performance reasons, S1Search server nodes can fail at any time, so it’s absolutely necessary for us to have data redundancy and availability.

We currently achieve a high level of redundancy and availability by:

  1. Replicating our customer’s data across datacenters at least 3 times;
  2. Making hourly backups and storing it on our local backup servers;
  3. Storing a full daily copy of our backed up data on a remote backup service;

Besides having all these redundancy measures, we also constantly perform unexpected actions and shutdowns on our production environment, similar to the Netflix Chaos Monkey approach, in order to test the resiliency of our services.


Components and services that powers SlicingDice

SlicingDice service offering is possible by a perfect combination of all these services listed below:

  • S1Search is our in-house developed analytics database technology. More details about the S1Search motivation and history can be found here.
  • SlicingDice API is a Python-based API developed to handle all SlicingDice service orchestration, such as database and column creation, data insertion, querying, etc.
  • Apache Kafka plays a really important role on SlicingDice infrastructure, used as a message broker and buffer for all data received by the API.
  • Aerospike is a amazing key-value NoSQL store and is used internally to store many of the API information that needs blazing fast responses. Examples of such information include valid databases and API tokens, columns created and much more.

Side Note: This is kind of ironic, a database developer using other database under the hood, but in fact our experience with Aerospike from the DMP era has always been amazing, so we believe there is no reason to reinvent the wheel just to satisfy our ego.

  • Redis is used to cache all queries performed on SlicingDice and the results of queries created on our saved queries API endpoint.
  • MySQL is where we store our basic customer’s information, databases and columns created, and also the entire SlicingDice’s user permissions and access groups.
  • Apache ZooKeeper is a distributed tool used for reliably managing shared configurations. Zookeeper is extensively used by Kafka and also by S1Search to store metadata about important configurations, such as columns.

Services redundancy and availability

All the services above run with redundancy in our three datacenters and some of them are in-sync all the time (like MySQL and Aerospike, for example).

This means, for example, that we have three independent Kafka clusters supporting requests from API servers within same datacenter, but also from remote datacenters in the event of service failure of that datacenter.

That is it. By perfectly orchestrating these services above we are able to provide our customers a full Data Warehouse and Analytics Database as a Service. Anyway, not diminishing the importance of the other services, but the real star here is S1Search, without which it would not be possible to create SlicingDice.


Other external services we use on SlicingDice

We also use a few external services and tools to keep SlicingDice live and working fine:

  • CloudFlare is used for DNS, Load Balance and DDoS protection.
  • Pingdom is focused on API and internal services availability monitoring.
  • Runscope has an important role in API regression testing and performance monitoring.
  • Librato is used for servers and services monitoring.
  • Backblaze B2 is the solution we chose for remote backup of all S1Search customer data. It’s much cheaper than AWS S3 and works like a charm.
  • Stripe is the platform we use for billing management. Again, for security reasons we don’t store any payment information from our customers locally.

Fun fact: although we are a customer of these companies above, all of them have full potential of becoming a SlicingDice customers too, as they need to store time-series data for analytics purposes in order to provide their services.


Putting it all together — Data Insertion and Querying

Now let’s make sense of all the services we’ve presented separately and show how our data insertion and querying flow works:

Data Insertion Flow

  1. All data insertion requests first goes through the CloudFlare DNS service where we have a load balancer distributing requests among API servers in our three datacenters. CloudFlare service also filters most of the malicious traffic and DDoS attacks, so most of the requests our API servers receive are legit and correct.
  2. Once an insertion request reaches an API server, it is validated in aspects such as: API key permission and validity, column and Entity ID existence verification and much more. 

    In order to make all these validations, the API server needs to securely connect to other services we use, like Aerospike and MySQL.

    API connections to MySQL with the purpose of updating the API hot cache are handled by a background process, hence not to affecting API performance.
  3. If the API key used in the request is valid, then the API service proceeds to the next verification steps, that are basically the identification or creation of an Entity ID, validation of the columns existence and the values used on the columns.

    If the Entity ID added in the insertion request already exist on SlicingDice, then the API service, using the S1Search client, will identify the shard number based on the S1Search Entity ID and send the message to the correspondent Kafka topic of that shard. Later, one of the S1Search servers can read the messages from the topic and finally store the data. On the other hand, if the Entity ID has never been inserted on SlicingDice, the S1Search client will request a new Entity ID for one of the S1Search servers, so the API service can add it to the insertion message and the process continues as described above.

    Notice that from a S1Search perspective the Entity ID used to insert the data on the S1Search server is the Entity ID generated by S1Search, not the Entity ID sent by the customer on the API request. You are going to understand the reason we do this in our next blog post, when we discuss more details about the S1Search.

    The final step before sending a message to a Kafka topic is to include a record identification on that message, so S1Search server can verify if that message was already processed or not in the event of server failure. Don’t worry, this record identification thing will also be discussed in details on our post about S1Search.
  4. Once the insertion message was sent by the API server to one of the Kafka topics corresponding to a S1Search shard, it’s now responsibility of one of the S1Search servers of the cluster to read the message from the Kafka topic in order to get it stored.

    As S1Search uses a sharding technique to distribute its data, each S1Search node is responsible for a group of shards. So the S1Search node reads the messages from Kafka topics corresponding to the shards that it is hosting at that time, as shards can also move between S1Search nodes when necessary.
  5. Finally a S1Search node receives the message to be processed and stored in one of the shards it hosts. This S1Search node then replicate the same message to some other node for data redundancy and availability, as explained earlier on this post.

    Exactly as it happened with the first insertion message, instead of sending the replica message directly to node that hosts the replica shard, the primary S1Search node sends the message to the Kafka topic of the replica shard. This is later consumed by the S1Search node hosting that shard, along all other insertion messages sent directly by the API.
  6. Although this process seems to be slow, it usually takes just a few milliseconds, from the API receiving the request until the data gets actually processed and stored on a S1Search node.

Important considerations:

  • As you noticed, except by the S1Search Entity ID generation — that just takes place when API service is trying to insert an Entity ID that was never inserted before — , all other steps during the insertion flow are totally asynchronous. This allows the API service to handle huge amounts of requests without problems.
  • All services used by the API on the same data center can be handed over to the same service hosted on another data center, making the API have no single point of failure. For example consider that an API request was forwarded by CloudFlare to an API server from DC1, but the entire Kafka cluster on DC1 is down at that moment. In this case the API server is prepared to use the Kafka from another datacenter as fallback, sending the insertion message to Kafka cluster from DC2, for example.

Data Querying Flow

Now that you have the data stored, it’s time to see the fast query magic.

  1. The initial step is basically the same of the data insertion. All API requests go through CloudFlare DNS service and are load balanced among API servers spread on our three data centers.

    Once a query request reaches an API server, it is validated in aspects such as API key permission and validity and column existence verification.
  2. If the API key used on the request is valid, then the API service proceeds to the next verification steps, such as validating if key has permission to query the referred databases and the columns it used on the query. Data permission is very important to us and SlicingDice allows its customers to easily define fine grained querying permissions.
  3. Considering everything in the query request was valid and all permissions were granted, the next step is where the magic really happens. 

    SlicingDice API server will issue query requests in parallel for each one of the S1Search nodes and asynchronously start collecting all the responses from the S1Search nodes with the individual results.
     
    As you will better understand on our in-depth post about S1Search, each S1Search server is completely independent from another. Since the data is distributed among all servers, for each query request received, the API must query all S1Search nodes independently and in parallel.

    The flow described above was considering a query type that doesn’t require pagination, like a data extraction query, otherwise the process would be a bit different.
  4. In a event of one of the S1Search nodes is unavailable/unresponsive or presenting slow responses for some reason, the API server has two fallback options:
     
    a) Reissue the query to the S1Search node from other datacenter containing the replica shards of the affected node;
     
    b) Opt to make an approximation calculation and infer the result of the missing node based on all the results received from the other nodes. In fact this approach works really well on SlicingDice as the data distribution among S1Search nodes is fairly balanced, so the approximation has a really minimal margin of error.
  5. Finally, once the API server consolidates the response from all S1Search servers, it prepares the response to be sent back to the user that made the request.

    Simultaneously the result of the query is automatically cached to Redis to be used for the next 5 seconds. So unless the user includes as bypass-cache parameter on the same query on subsequent requests, the query result will be the served from cache during the next 5 seconds.

Of course the data insertion and querying processes could be broken-down into more detailed and step-by-step process, but we believe this high level overview is a great starting point for the next series of posts about how S1Search works under the hood.


Being prepared for a huge demand

As explained in the previous post about why we built SlicingDice, as of the moment of this writing, Simbiose Ventures companies are still the biggest individual SlicingDice customer, storing more than 500 million entities (users) and inserting around 600 million data points per day.

Our private-beta customers together already stored around 650 million entities and are sending at least 500 million insertion requests to our API per day.

Even so, as unlimited data storage is one of the main SlicingDice competitive advantages, we have to be prepared to handle data insertion volumes superior to dozens of billions per day. Because of that, our current deployed infrastructure is ready to support up to 90,000,000,000 (90 billion) insertion requests per day, which is equivalent to 1,000,000 (1 million) requests per second.

The way we implemented and deployed our infrastructure allows us to increase our capacity in a few minutes by simply adding more API servers, Kafka and S1Search nodes to the existing clusters.


Conclusion

This post was the first step into a journey of getting to know how SlicingDice works under the hood and how our main data insertion and querying flow works.

We hope that after reading this post you’ve come to a conclusion that SlicingDice infrastructure is ready to support your data insertion and querying needs, doesn’t matter the volume you might have.

In the next blog post we will dig deeper into what is S1Search and how it works under the hood.

Please, feel free to contact us if you have any specific questions about the SlicingDice infrastructure or how we use the services we mentioned on this post in order to provide the data warehouse service.


Have you tried SlicingDice?

Do you feel more confident now? Give us a chance… You don’t need to create an account, input a credit card or pay beforehand just to get the feeling of how SlicingDice works.

Play with SlicingDice at our website online demo or go deeper using demo API keys by following the 15 minutes quickstart guide as starting point.

Feedbacks and advices are ALWAYS welcome! We still have have a lot to learn and improve!