Jonathan Ellis from DataStax just posted a critique of one of our benchmarks. Most crucially, he argues that in one of our test scenarios we measured different things for different databases, comparing databases that keep most of their data in memory when Cassandra does not. This criticism is fairly misleading, as it in no way changes the hypotheses or conclusions of the study. The methodology of the benchmark was sound, and the results were consistent with what we see in real-world production environments.
Different databases have different architectures
Ellis’s main point is that the test scenario was run using different durability models for different databases. That is true, but this is not so much about the test but about the fact that Cassandra, Couchbase, Aerospike, and MongoDB are architected very differently. This is a pretty complex discussion, and we discussed the durability question explicitly in the second part of our study. Fundamentally, these databases work in different ways and are optimized for different things. The trick was to create a baseline that compares them in a useful way.
In our tests, we attempted to measure two things — the highest throughput we could get through a system under any conditions, and the highest throughput assuming a reasonable degree of fault tolerance. Different databases achieve reliability by different means, so we chose a blend of consistency and durability settings that we felt met this definition of “reasonable”.
This is one of those things that sounds simple, but wasn’t. We put the better part of four months into designing this. Our choices were certainly judgment calls, and one of the reasons we were so transparent about our tests is to invite people to repeat them with different assumptions or models and publish their results. This does not mean, however, that the test design was flawed. To see why, we should look at what the benchmark was and was not trying to measure.
Let’s take a step back to why we would want to run a benchmark in the first place. A benchmark is a synthetic thing, in a controlled environment, using a specialized and artificially designed workload. The only reason to do such a thing is if we hope to learn something about the real world.
Let’s say we have an application that needs to hit the database hundreds of thousands of times per second, with some data loss or inconsistency being acceptable. This is a huge workload, but a business need that we encounter often for many kinds of applications, including ad serving, profile matching, and session stores. The business may have additional requirements about what happens to this data in the event a node fails, how up to date the data must be, or what kind of hardware strategy to use in scaling (RAM, SSD, rotational disk, etc.). But it does not have explicit low-level technical requirements about how often that data is synced to disk. We’re interested in how to serve the workload, not necessarily what the database does to serve it.
If we create a purely artificial test that attempts to have every database sync to disk at the same intervals, we’ll be able to determine things like which database can write to disk the fastest. But it tells us nothing about how we might solve a business problem in the real world. Indeed, it means forcing the databases to run in non-recommended configurations that would never be deployed in production.
For example, we can tell Couchbase to flush data to disk more frequently than it does – as often as every write. And yes, that causes Couchbase to run much slower than Cassandra. But that’s also completely abusing how Couchbase is meant to be deployed. Couchbase achieves its maximum speed by sacrificing durability, and Cassandra by sacrificing consistency. Our “fast” scenario makes the appropriate assumption for each database. We also test a “reliable” scenario that adjusts for these assumptions and runs with durability and consistency set much higher. (For what it’s worth, Couchbase can make the converse claim that we were measuring Couchbase in consistent mode against Cassandra in an inconsistent mode. Also true, and also not particularly relevant to our conclusions.)
It’s absolutely worth stating that the benchmark in question was requested by Aerospike. It’s also worth stating that Thumbtack designed the test and conducted the measurements independently. The fact that Aerospike’s numbers look strong is an artifact not of a flaw in the methodology but of the fact that scaling K/V storage on SSD is the very thing their database was designed to do. We have seen similar numbers moving highly tuned Cassandra clusters to Aerospike in production. In other words, the benchmark accurately reflects what we see in the real world. Exactly what we would hope for in a benchmark.
Does this mean Cassandra is slow?
Heck no! We are avid users of Cassandra, and have deployed it with great success in production. It is fast, reliable, easy to understand, and has a set of properties that make it incredibly useful for a wide range of problems. We’ve had particular success with time series data and aggregated statistics.
It’s disappointing that these benchmarks can be interpreted to mean far more than they are actually testing. For serving key-values out of RAM or SSD at the highest speed with the kinds of reliability constraints we describe, our numbers are accurate. But we say nothing more than this. It’s unfortunate that these graphs take on a life of their own.
We’re looking to create other kinds of benchmarks that measure more complex application scenarios. For example, Cassandra becomes more attractive when the size of data is much larger than the amount of RAM. That’s absolutely worth measuring. Not to mention how we might measure all the power that a columnar database provides. We’re big fans of Cassandra, and the last thing we want to do is produce reports that imply otherwise.
Where we could improve
Ellis makes a couple of lesser points that are worth addressing. One of which is that we didn’t properly isolate the benchmarks with warm-up periods and didn’t manage compactions, thus distorting the results. This is actually false, though we did a poor job documenting it. We ran the workloads in different orders and using different warm-up periods, and took efforts to capture only steady state numbers. We also ran with compactions enabled and disabled and didn’t see much difference. However, we don’t mention these explicitly in the paper, and the automated scripts don’t reflect this substantial work. This is something we should have made transparent and reproducible.
The other objection is that for the Cassandra version we used, we should have set populate_io_cache_on_flush to fully optimize for the “fast” (i.e. predominantly in RAM) scenario. Fair enough. The reason we are transparent in our methodology is to solicit feedback exactly like this. However, there are two reasons to believe that this does not alter our conclusions in a significant way. One is that Ellis himself describes this as a negligible factor. The second is that we’ve repeated similar tests with newer versions of Cassandra and gotten results that were substantially unchanged.
As Ellis writes, there are a lot of really bad benchmarks out there. Controlling for speed versus reliability across databases that are architected completely differently is a very difficult task. By releasing the source code and implementation details of our studies we invite the world to jump in and release their own views on our work. There are a ton of things that can be measured here, and we welcome refinements and other kinds of tests to measure different things. It’s certainly no surprise that Ellis was able to create different numbers using the same basic framework. We’d love to understand better his assumptions and methodology and see what additional conclusions we can make from them.
It is also perfectly fair to say that the benchmark being discussed does not reflect the situations where Cassandra shines the most. We completely agree. There are a number of studies we have planned that can do more than test how fast a database can serve value out of RAM or off SSD. Doing them correctly takes a large amount of time and money, so this backlog is large. One of the reasons we make the test plan, scripts, and YCSB code public is to encourage others to explore further.
We absolutely welcome vendor feedback on our work. While we reached out to all the vendors for their feedback on our configurations, it’s inevitable that there were specific additional optimizations that could be made to each database. Indeed, other vendors have come forth with recommendations as well. It’s worth mentioning that we’re now working with DataStax much more closely on our current benchmarks than we have in the past. Our goal is to give every vendor the fairest possible shake.
We stand by our benchmark methodology, because we believe it measures exactly what we claim it measures. It also matches our experience in real-world production use, which is after all the whole point.