ScyllaDB vs Cassandra: towards a new myth?

Disclaimer : all the tests described in this article were performed on ScyllaDB 0.10 and might not be relevant for recent versions. For a more up-to-date description, go to the official website http://www.scylladb.com/

On September 22th 2015, a community of developers announced having designed and released a new database management system described as the fastest in the world. This system, named ScyllaDB is part of the NoSQL world whose ambitions are:

  • Design scalable systems by distributing the workload and the storage over multiple machines.
  • Design fault tolerant systems
  • Provide higher throughputs, larger storage with lower latencies

In this very competitive environment, ScyllaDB presents an interesting characteristic: all its structures and mechanics are copied from the very popular database: Cassandra. Main difference announced: ScyllaDB is written in C ++ when Cassandra is in Java.

Using the same structures allows for the users to use any of those two managers indifferently in their cluster. Cassandra developers don’t need to rethink their data structures and can apply them directly to ScyllaDB. The learning curve is reduced to a minimum for regular Cassandra users.

The first tests made by the ScyllaDB team report rates from 10 up to 20 times higher than Cassandra. Thus, they present themselves as a direct competitor.

All this just moving from Java to C++ ?

Not only. The advantage of having an application written in C ++ is actually to reduce CPU usage by avoiding the program to be loaded into a JVM. Scylla also provides a custom network management that minimizes resource by bypassing the Linux kernel. No system call is required to complete a network request. Everything happens in the userspace, limiting the number of expensive switches with the kernel space.

Another advantage with C++ is the ability of   having a finer but more complex memory management. Indeed, in Java, the garbage collector takes care of regularly browsing  the allocated memory to release the unused space. This step is extremely costly in terms of processor cycles and it can stop the application up to several seconds on large memories, it is also known as “stop the world”.

Although these times of garbage collection tend to decrease with new, more efficient algorithms, it is now recommended not to allocate more than 8GB of memory to the JVM when running Cassandra, even if your machine has over 100 GB of memory (something more and more common with the decrease in the cost of RAM). ScyllaDB boasts to fully use the hardware resources of your machines. In particular, when Cassandra is primarily based on caching offered by the operating system, a new cache system specific to Scylla was set up to compress and store recently read data in memory.

Our tests

Following these attracting statements on paper, we wanted to get to the bottom and check by ourselves, to see if the facts would finally concur with the theory. Scylla’s ambitions maysound too great to be true: having a 10 fold faster database without having to change anything in hardware and code, as claimed by the homepage of the site scylladb.com (see image below).

scylla black friday

scylladb.com : according to their website, it would be possible to stop Cassandra and to start Scylla without any problem on any machine of the cluster, and then multiply our throuhput by 10

We ran some tests on Amazon Web Services EC2 following the guidance we found on ScyllaDB website. It provides an image (AMI) to be loaded on an EC2 instance, on which Scylla is already fully installed on Fedora 22 OS.

First test

We set up three clusters :

  • A 3 nodes ScyllaDB cluster
  • A 3 nodes Cassandra cluster
  • 12 shooter nodes that stress the two clusters sending requests for a short period of time

All Scylla and Cassandra nodes use the same type of instance the same machines: m3.xlarge.

All Shooters nodes are also the same: c3.8xlarge.

The ScyllaDB cassandra-stress tool is used to benchmark the clusters. We got the following results, more detailed in the table below:

  • 215,000 writes/second on ScyllaDB
  • 113,000 writes/second on Cassandra

scylla vs cassandra

First comment: Scylla is sharply better than Cassandra at write throughput and write latency.

Some cautions, however:

  • Scylla runs on Fedora and Cassandra runs on Ubuntu
  • We used pre-configured AWS AMIs without modifying them

Second test

We decided to leave these results besides and realize new tests. This time:

  • Scylla and Cassandra run on the same machines, on Fedora and with a 10Gbit network.
  • We have now 17 shooters
  • We used default configurations

The goal of this second test is to push Scylla to its limit to check the maximum throughput it can handle.

And the result was surprising: Scylla can’t handle the throughput and only returns timeouts to the shooters, this means that the load is not absorbed. When reducing the number of shooters, we actually get back a great throughput without any timeout. Indeed, Scylla can’t handle the load whereas Cassandra doesn’t have any problem under high loads: when Cassandra is overloaded, the shooter’s throughput automatically decreases (because queries are synchronous) without any timeout. This means that Cassandra successfully pulls up the pressure to clients. In Scylla’s case, we’re exposed to tuples loss when we query it at a throughput that is over its limit.

In retrospect, when having a closer look at the results of the first test (see table above), we also note higher performance disparities on Scylla than on Cassandra:

  • On write latency, there is a 11% standard deviation on Scylla, against 1% on Cassandra
  • On write throughput, there is again a 11% standard deviation on Scylla, against 1% on Cassandra

The standard deviation allows us to characterize the database management systems’ stability. In our case, Scylla seems really less stable than Cassandra. This means that besides the doubts on its capability to adapt to high loads, it can be tricky to precisely predict a Scylla’s cluster performances.

We should obviously investigate further on these instabilities to establish their causes. Moreover, the tests should be run several times on correctly configured machines to be worthly benchmarks. Nevertheless, they allow us to have a preview on Scylla’s potential and status.

Beyond those experimental notes, we can notice several things about Scylla:

  • Scylla uses CQL 3.0 (against CQL 3.2 for Cassandra 2.1 and CQL 3.3 for Cassandra 2.2) as query language. The CQL ALTER function is currently not supported, like counters that have not been implemented yet. Because of this, Scylla’s cassandra-stress is slightly different from Cassandra cassandra-stress. The Cassandra version of this tool doesn’t work with Scylla.
  • Scylla uses its own gossip protocol between nodes, that makes impossible to run a cluster containing both Cassandra and Scylla nodes.
  • Scylla doesn’t support SSL secured communication yet (node to node and node to client).

Conclusion

The results we draw from these experiences are that it is possible to improve Cassandra’s performances with a finer memory management and by saving some processor cycles. Scylla seems to handle a throughput at least 2 times higher than Cassandra can (let’s recall that Scylla’s team announces a 10 factor). However, the program is far from Cassandra’s state of maturity and Scylla currently misses many essential features. Today, Scylla is in beta version and further tests should be made after its first stable release that is planned in January 2016, so we can assess this new competitor’s seriousness.

Will Scylla preserve its advance in terms of latency when it will be more stable and it will have implemented all of the missing features? Moreover, Cassandra is a proven technology, running on clusters containing more than 1000 nodes (see this presentation). Will Scylla be able to reach such a level of maturity?

Thank you OCTO guys for having contributed to this article.

6 commentaires sur “ScyllaDB vs Cassandra: towards a new myth?”

  • Thanks for kicking the Scylla tires! I'm glad that you find Scylla 2X faster in throughput and latency than Cassandra. Usually we test on much larger machines than m3.xl which only has 4 vcpus and capped networking (no enhanced networking) and disks. In general, we get awesome out-of-the-box performance on physical machines on Softlayer and Rackspace. On Amazon, we recommend the i2 family with good ssd:vcpu ratio. Even there the storage is relatively slow and Scylla doesn't handle it that well. There is an active development around it by Glauber Costa which should be ready for merge in 1-2 weeks. One comment about the latency - It's advised not to test latency any device under 100% load. Better test it under the desired normal operation mode. In your test, take for instance half of the load that Cassandra can handle, around 50k OPS and measure the latency for both databases. About the timeouts with the larger machine, it probably was caused by lack of back-pressure (for slow disks and also for RF=3, CL=1). We have fixes in place, some are part of 0.13 and the rest will follow in 0.14. Alter table and encryption/authentication are being developed while we speak. We're still in beta but do try to move as fast as we can both in terms of closing the gap and in terms of stabilization. Do continue to provide feedback and we on our side will try to be as transparent as possible. Cheers, Dor
  • Hi Dor, Thanks for your interesting comment. We made those tests just to make sure ScyllaDB was a real Cassandra’s competitor in terms of performance. And it actually is. I believe the configurations we used were not suitable to proclaim that this is a serious benchmark : we just aimed to get a feel of Scylla’s position. We keep watching Scylla’s status. Once a stable version will be released, we will probably run new tests on a typical production configuration and try not to be biased by a weak hardware. We’ll also consider your recommendation and maybe run these tests under different workloads. Good luck for Scylla’s development, Thomas
  • Thanks Thomas, the tests are definitely helpful and the second test exposes an issue we have and is partially fixed (on the way for a full fix). We'll release another 1-2 beta versions and then a 1.0, stay tuned :) Thanks for the translation to English, the post receives more audience and better than google translate's version of 'AMI' (friend according to them..)
  • Hi Thomas, Scylla 1.0 is out for some time and we are navigating 1.2. Some serious works has been done to alleviate the AWS SSD timeout issues before releasing 1.0. Best Benoît
  • "Average latency" in isolation is a poor metric for measuring latency; more useful is e.g. 95th percentile and 99th percentile. Thanks
  • Very useful.hopes giving a latest version test result.
    1. Leave a Reply

      Your email address will not be published. Required fields are marked *


      This form is protected by Google Recaptcha