I interviewed Thumbtack’s CEO, Ben Engber, about the what and why of conducting benchmarks – the discussion was recorded, transcribed, and edited down a little for length.
E: Why does Thumbtack do benchmark testing?
B: First, there’s a whole new class of databases emerging that are all designed differently and meant for very different types of things, and the only real way to understand how they’d actually behave, aside from putting them in production, is benchmarking, which gives you a realistic view of the kinds of things you can expect to happen. Second, because we’re focused on highly scalable development, it’s in our interest to have as deep a knowledge as we possibly can of all the products in the marketplace.
E: What are the key challenges in benchmarking these NoSQL databases?
B: The challenge in benchmarking any database is to try and make what is essentially an artificial configuration of machines under very artificially constrained circumstances, behave like it would in the wild. It’s obviously going to behave differently but there are assumptions and configurations you can make to generate meaningful approximations. Secondly, NoSQL is such a broad term, and in our recent benchmark in particular, we have three databases architected in three completely different ways – so, in order to create a meaningful baseline, a lot of judgement calls have to come into play. We discuss what we’re doing and the assumptions we’re making with each vendor, but there’s always going to be questions about how each database handles consistency, how each database handles durability, how they replicate, and so forth — we try to pick something that reflects not so much a technical parity, because they simply operate in different ways, but parity along a concrete business use case.
E: How do sponsorships come into play?
B: These benchmarks sort of sound easy to do but to do them properly is very time-consuming and very expensive. For this reason we do some of our own benchmarking, but we also do benchmarks that are sponsored by database vendors. When that happens, we are very explicit about maintaining full control over the design and execution of the tests and do everything we can to make the tests fair. However, you need to be realistic in looking at it that a vendor is not going to ask us to benchmark something they’re not good at. So there is a selection bias in the tests: the tests are fair and accurate, but they reflect use cases that a particular vendor wants to showcase.
E: Thumbtack works closely with all these vendors – how do these benchmarks impact these relationships?
B: Ha. It’s hard, but listen, we publish these benchmarks to bring useful, actionable knowledge to the community, and I believe that they do. Everyone acknowledges that if Thumbtack does not remain neutral and objective, these tests serve no purpose. However, it’s important to remember that each benchmark measures very specific things. In the case of the benchmark we just announced, it measures how quickly you can conceivably store keys and values into a database where you have enough RAM to hold the vast majority of your data. It’s a perfectly reasonable assumption to make, because there are plenty of use cases that do this – a session store is the perfect use case of where you might want to do something like this. However, it’s not necessarily the best use case for Cassandra or for MongoDB, and, of course, they want to be represented on the stuff that they are best for. It’s incredibly important that we be precise about exactly what we’re measuring and what we’re not, because not only do we have close relationships with these companies, we sincerely believe each has great value to offer.
E: MongoDB did not come out strongly in this latest benchmark. If you were to design a test that played to MongoDB’s strengths what would that test be?
B: I wouldn’t go that far. We had MongoDB generating approximately 60,000 operations per second on a small cluster with modest hardware. That’s a lot of traffic — just not as much as Couchbase. It’s worth repeating that what this benchmark is, is it tests how fast you can store pieces of data is to the entire system and that’s repeated to the storage engine when reliability is not your highest concern and durability is not your highest concern. That is a use case that Couchbase is, frankly, very, very good at. They’re designed to handle these things in really enormous volumes. MongoDB, honestly doesn’t claim to be the fastest database at this sort of thing. We use MongoDB often, and where it really shines is being a very flexible database with a very rich set of features that makes developing powerful applications very easy. One thing we’ve had great success for with MongoDB is designing applications that store a wide variety of objects that share certain characteristics and don’t share other characteristics. Examples we’ve built: Art collections, which contain things ranging from historical letters to statues. Metadata about publishers’ content, including writing and video. Also, real-time analytics, in a whole variety of use cases. It’s a powerful database that makes it easy to quickly develop very powerful applications.
E: And Cassandra?
B: There are a number of points to make about Cassandra. One is that Cassandra is designed to scale well for when the amount of data exceeds RAM by a large amount. That is not covered by this test. Another thing that is interesting about Cassandra is that it’s unusual in that it is optimized for writes. You can see this in our results, but there are other tests that can show this even more. For instance, for writing time series data, we found it to be very effective. The other thing that’s very appealing about Cassandra is that you can determine the level of consistency or durability you want really on a query-by-query basis which gives you a very rich model for your applications. And, unlike the other two databases we’re talking about here, Cassandra does not have any specific node which is a master for anything, so it’s a very resilient kind of architecture.
E: If you were to do this benchmark without any sponsorship, would you design it differently?
B: The answer to that is no, but I would also want to say that this is not necessarily the benchmark we would be most interested in running. There are a few things we’re really interested in exploring further like when the data set is much, much larger than RAM, because, of course, you can get a lot of operations through there, but it’s an expensive scaling strategy.
The second part of the current study, which we haven’t released yet, is scaling properties of these databases as we add more nodes. I think that’s very interesting, too. Another thing we’ll be exploring is secondary index support, meaning, all these databases do more than just put a piece of data in and take it out, and for a database like MongoDB, it does a whole lot more and that’s what emphasizes the richness of their database. Designing a test that explores the more sophisticated things you can do with different kinds of databases is extremely difficult, but very interesting to do.
E: How can the community distinguish between all the bad information that’s out there and the reports that are reputable?
B: One good rule of thumb is: does the study provide everything you need to reproduce the tests? If not, the results are probably meaningless. But it’s more complex than that, going back to the fact that these databases are designed very differently. One characteristic of the bad benchmarks is that they ignore this discussion altogether. And end up running databases in completely non-comparable configurations. As a case in point, Couchbase is a database which is designed to be highly consistent, but with the potential of losing some data if a node goes down. Couchbase allows you to adjust this setting to slower behavior for increased durability. Cassandra, on the other hand, starts out eventually consistent, but somewhat more durable than Couchbase. In fact, it allows you to reduce the speed a bit to gain that consistency on a case by case basis. So the tradeoffs being made here are along different dimensions — often more complex than the CAP Theorem indicates. Every time we set up a test, we need to decide the right place to be in that speed-versus-reliability spectrum, and there’s a lot of legitimate debate you can have about that.
E: How do these benchmarks fit in with Thumbtack’s overall business?
B: These benchmarks are a very small piece of our business. We don’t do sponsored benchmarks to earn profit. In fact, I don’t think we’ve ever earned a profit on any of these benchmarks. We view these benchmarks as central to our strategy of where we want to take our business; central to being a thought leader in terms of NoSQL and other scalable storage platforms; and our clients look to us to have deep knowledge of emerging technologies and so these studies are just part and parcel of the research we’re always involved in anyway.
E: Right. Good. Thanks Ben for taking the time to do this.