Anton Yazovskiy, Solutions Architect
July 9, 2014. This post has been edited from its original version to better articulate Thumbtack’s position: The recent YCSB benchmark test we performed provides critical data for selecting a NoSQL database. We absolutely stand by the preliminary results that were released and by the value the benchmark provides to the community. We assume and expect our audience to be sophisticated enough to understand that when we say there are other criteria that come into play when evaluating solutions, that it does not diminish the obvious value of the benchmark.
Here at Thumbtack, we recently released preliminary results for a benchmark which tests how a few major NoSQL databases scale workloads that fit primarily in memory. This is a critical piece of information that’s broadly applicable. This does not preclude however that there are other kinds of value not easily measured in benchmarks worth consideration for particular use cases.
My colleagues and I are strong supporters of MongoDB, its community, and the infrastructure around it. The main reason is that it is difficult to beat in time to market.
MongoDB allows us to bring production-quality applications to market very quickly and inexpensively. We are believers in lean innovation and work in a completely agile environment. The core lean principles of producing a minimum viable product and frequent iteration generate serious competitive advantage for our clients whether it’s being first to market, adapting to change faster, incorporating customer learning back into an app easily and often, or pivoting an entire product in a new direction. For me, MongoDB’s well-documented and extensive feature set makes it an excellent choice for lean development for a wide array of cases.
Here are a few features which we’ve found to be enormously valuable that MongoDB handles exceptionally well, allowing us to change our focus away from managing storing data toward building business functionality.
Document Data Model with Rich Data Types
The mismatch between what business entities are and how they’re stored is one of the common reasons to choose a NoSQL database over a relational one. NoSQL solutions provide various data models that developers can apply in order to reflect business entities and processes. While anything could theoretically be modeled as a pure key/value problem, there is a lot of application overhead in writing a system this way. And while all the major databases offer a variety of secondary indexing options on top of their platform, MongoDB’s options are the most extensive.
MongoDB, like other document databases, provides the ability to create deeply nested (and indexed) subfields within a document. This allows virtually anything to be modeled quickly and efficiently. Having array fields in the data model itself is particularly useful. This can be used for things like embedding comment threads directly within an object, or to quickly implement a reasonably efficient search engine for scenarios where it doesn’t make sense to incorporate a whole new platform like Lucene. MongoDB is not unique in document functionality, but its support is quite mature which allows for things like editing those fields in place rather than round-tripping the document between the databases and your app.
Natural Support for Polymorphism
Schemaless storage is specially valuable when the business logic involves manipulating many types of objects of different kinds with shared characteristics. While most NoSQL databases offer various kinds of schemaless functionality, we’ve found the way MongoDB dynamically indexes on the fly to be a real boon for rapid software development.
For example, we had a use case where we had over 100 different types of art to model, each with shared characteristics but its own unique set of attributes and behaviors. This is something that object-oriented programming handles very naturally by using polymorphism, but anyone who’s ever tried to map such a thing to a relational database (directly or through ORM) can tell you how painful this can be to manage. With MongoDB this task was absolutely trivial. Not only was there virtually no database overhead in adding new types to the system, those types could be queried by indexed custom fields trivially and with effectively no new code. We estimate using MongoDB to code this application reduced our development expense by at least 50%.
Seeing it happen in production
As a case in point, we had a client that was building a mobile imaging service somewhat similar to Instagram. They came to us because they wanted to have a large scale rollout, and needed to be sure that it would be able to handle heavy traffic at launch. Based on their requirements, we suggested MongoDB for the backend. But the client expressed reservations based on rumors they’d heard, so a sharded RDBMS solution was proposed instead.
The pace of prototyping was very fast, and in order to keep all the components in line we agreed to do a quick MongoDB-backed reference implementation for each new feature, while the fully architected production version would follow. New features tended to add new fields or restructure data, but because of the schemaless architecture, changing the reference implementation to support them could be made in hours, while the backlog on the final backend continued to grow.
After a point, however, it became clear that the “full” implementation was badly lagging the reference implementation and falling further and further behind. Moreover, MongoDB still performed excellently under load. So we simply dropped the “production implementation” and routed all traffic through the reference implementation.
We’ve had great success using MongoDB for real-time analytics for various products. The workflow looks something like this:
accept a write-heavy workload
perform basic aggregation queries on the fly
There are lots of ways to build these kinds of systems, and we’ve done so with many different NoSQL databases, often in conjunction with Hadoop. However, MongoDB has an especially rich set of tools out of the box that enables significant functionality to be moved to production quickly.
How to measure?
YCSB benchmarks give important information, and we use the results to inform development of many different kinds of systems. It’s critical for us to know raw K/V performance, but it’s not the only criterion we use in choosing a database. It’s worth trying to measure these other criteria; a future post will discuss some of the issues involved.