This is the final part of the SlicingDice Uncovered series. The first post was about the SlicingDice infrastructure, the second on how S1Search works, and that one ended up originating an unexpected third post with in-depth details about the S1Search implementation.
On this final post we will cover all the processes and tools we use to make sure SlicingDice and S1Search have a strong data integrity and scale as much as necessary, while maintaining all the expected consistency.
Challenges when testing a database
There are many well-known challenges when building a consistent database and for sure testing is one of the main ones. Below we present some of the challenges we faced.
Integrity and Consistency
Although speed is really important for a database dealing with a huge amount of data like we do, nothing is more important than data integrity and consistency. It doesn’t really matter how fast the database can answer queries if the data is missing or corrupted and the query is answered with the wrong numbers.
Actually, for a service like SlicingDice and an Analytics Database as S1Search, where the main purpose of queries is to analyze the data, wrong or incomplete answers can lead to wrong business decisions, which can end up being really expensive and damaging.
Speed and Scalability
On the other hand, if you have an incredible battle-tested consistent database that can’t handle a huge volume of insertions or queries or even increase its capacity from one moment to another, then it doesn’t really make sense. You will end up having big costs with infrastructure and people dedicated to manage it, not taking into account the software licensing fees depending on what solution you choose.
Our case was even more critical, as we had no clue what kind of volume we could expect and we have bold public commitments, like giving 10 cents discount on each query that takes more than 10 seconds to complete.
After launching, SlicingDice’s growth could be slow or incredibly fast. Thus we had to know in advance, before the platform went public, what huge volume and workload look like.
Creating a database to test a database
SlicingDice’s and S1Search’s code coverage are higher than 98% and we take it very seriously in our development process. However we all know that unit tests and code coverage won’t be enough to get rid of integrity and consistency issues, specially considering a complex, parallel, concurrent and multithreaded system like a database.
We had to had a way of feeling confident on what we were building, so we decided to take a radical approach: build a database testing framework to be used as the source of truth when validating our system.
Remember that S1Search was build to perform analytical queries, so we don’t know in advance what our customer’s queries will look like. For example:
- How many columns they would use in a query.
- What combination of column types they would use in a same query.
- What if they try to make multiple boolean operation on top of multiple time-series columns, also combining non-time-series columns, how the system would behave.
So we decided to build a database testing framework, that is basically a simpler and lighter version of the S1Search database that could generate testing data and also store them for comparison purposes.
This database testing framework works like this:
- You define the types of columns you want to test, how many different values you want to be inserted (whether they will be really used in queries or just be there to stress the system) and finally for how many Entity IDs you want this generated data to be inserted to.
- For each type of column you defined, the database testing framework will first generate all the data and send it to be inserted on S1Search, also storing for itself a copy of the generated data for further comparison purposes.
- Once the all the data was completely inserted on S1Search, the framework will then automatically generate all the possible combinations of supported queries based on the columns you have declared to be tested.
- These queries will then be issued to S1Search and the obtained results compared to the expected results based on the data stored on the test database.
- In order to S1Search version be declared ready for production, it must be tested with all the existing column types and supported query operations. If a single query fails with a difference of even a single ID, we reprove the version until we correct it.
Some numbers of this testing database framework.
Consider this test configuration below:
Entity IDs: 1,000
Matched Values: 1,000
Garbage Values: 1,000
Column Types: All (we currently have 11 column types)
Query Types: All (we currently have 11 query types)
Days: 7 (distributing the generated data in 7 different days, as this affects the time-series queries)
The test configuration above results in:
3,646,986 unique insertion messages sent to S1Search (520,998 messages per day)
45,696 unique queries, each expecting a different result (6,528 queries per day)
Here is an example of the test output after running insertion for 1 day:
========== Insertion Statistics ==========<br>INFO: Quantity of insertion commands: 520998<br>INFO: Quantity of columns inserted: 4164994<br>INFO: Quantity of columns per type:<br>string_test_column: 440000<br>time_series_decimal_test_column: 494998<br>time_series_string_test_column_2: 16000<br>boolean_test_column: 456000<br>decimal_not_overwrite_test_column: 4000<br>time_series_decimal_test_column_2: 16000<br>time_series_numeric_test_column: 494998<br>bitmap_test_column: 120000<br>numeric_not_overwrite_test_column: 4000<br>numeric_test_column: 482000<br>string_not_overwrite_test_column: 4000<br>time_series_string_test_column: 464998<br>decimal_test_column: 258000<br>range_test_column: 456000<br>uniqueid_test_column: 208000<br>date_not_overwrite_test_column: 4000<br>date_test_column: 222000<br>time_series_numeric_test_column_2: 16000<br>bitmap_not_overwrite_test_column: 4000
We insert data and run queries for multiple days because between them we also test other things that could affect consistency, such as: restarting the server, moving shards between nodes, killing the process unsafely (kill -9) and so on.
We believe creating the database testing framework was one of the best technical decisions we ever had. Although it took a long time to build it and make it stable and trustworthy, it saved us hundreds of hours during the development and, more important, gave us the necessary confidence we need to create a platform like SlicingDice.
Making it scale
Cool, now we had a consistent database on our hands, but testing with 1,000 Entity IDs is a completely different thing than testing with 100,000,000 Entity IDs.
In fact, we learned that some problems just happened when we had a huge amount of data stored. This kind of learning is indispensable when building a platform like SlicingDice, where we expect each of our customers to have millions of entities and billions of data points stored.
The first version of the database testing framework was perfect for a developer running locally, but for scalability testing purposes, we needed more. We again did our homework and added all the necessary capabilities for the database testing framework run in a distributed environment with multiple servers.
It took us a few weeks to accomplish it, but once it was done, we could then test S1Search with really heavy workloads. For instance, our tests accounted with 16 billion insertion request, distributed within 90 days and for more than 90,000,000 of Entity IDs, running on few test servers.
This enhancement was indeed very important not just for the consistency test, but we ended up using it for generating load while profiling the code.
Replicating what we have learned to SlicingDice
Having learned the benefits of developing the database testing framework for the S1Search testing, we decided to apply the same concept while developing SlicingDice. Hence we developed an entire testing framework to replicate all kinds of tests we had implemented for S1Search, but now pointing them directly to the SlicingDice API.
Besides all unit tests and great code coverage, nowadays SlicingDice also has a test framework that can perform all the types of data insertion and querying supported by the platform in large scale.
This is the framework we use to check the SlicingDice’s API consistency and integrity besides making load tests in order to be sure all our system can handle our customer’s load.
We have also added part of the insertion data and query examples used to check the API to all our SlicingDice Clients, as you can see on their respective Github repository.
Improving the development process
Our final step to make integrity and consistency a high level standard on all SlicingDice’s processes was to include both testing tools, S1Search’s testing database and SlicingDice’s testing framework to CircleCI, our continuous integration tool.
Nowadays every single developer commit gets automatically tested by these testing frameworks. That allows us to be confident and move forward faster to support our customers needs, even though our team is small compared to other big players.
We learned to love tests and we keep adding more as we develop new features, data types and so on. Other than that, one of the main things we wish to do in a near future is to integrate S1Search, our database testing framework and a command-line code profiling tool all together.
This would allow all our code to be constantly and automatically profiled on every commit, so we can find bottlenecks or optimization opportunities faster and easier.
No test is better than a great developer
Although we do our best to find all possible problems and bugs before they reach our customers, sometimes we fail.
If you find a bug on any part of SlicingDice, please, let us know and we will do our best solve it quickly and compensate you somehow.
Have you tried SlicingDice?
Do you feel more confident about SlicingDice’s consistency? Give us a chance… You don’t need to create an account, input a credit card or pay beforehand just to get the feeling of how SlicingDice works.
Feedbacks and advices are ALWAYS welcome! We still have have a lot to learn and improve!