Polyglot Persistence vs. Multi-API/Multi-Model: Which One For Multi-Cloud?
Modern app architectures rely on data with different models and access patterns. Polyglot persistence, first introduced in 2011, states that each such data model should be powered by an independent database that is purpose-built for that model. The original intent was to look beyond relational/SQL databases to the emerging world of NoSQL.
The Messy Reality of Polyglot Persistence
Polyglot persistence is not free of costs — it leads to increased complexity across the board. Using multiple databases during development, testing, release and production can be overwhelming for many organizations. While app developers must learn efficient data modeling with various database APIs, they cannot simply ignore the underlying storage engine and replication architectures for each of the databases. Operations teams are now forced to understand scaling, fault-tolerance, backup/restore, software upgrades and hardware portability for multiple databases. And of course the onerous part of “operationalizing” these databases across multiple different cloud platforms.
AWS To The Rescue, But Buyer Beware
AWS CTO Werner Vogels describes how AWS views the polyglot persistence problem in his post from last week titled A One Size Fits All Database Doesn’t Fit Anyone .
As the world’s leading IaaS platform, AWS fully embraces Polyglot-Persistence-as-a-Service (no surprise!). App developers get the database API/model of their choice and the Operations teams don’t have to manage the multitude of databases picked by the developers. However, this point of view is self-serving to say the least. AWS charges top $ for its managed database services and gets the most effective form of lock-in (operational data!). If Google Cloud, Microsoft Azure or a private cloud were to provide a lower cost solution at a higher performance, then good luck moving out of the AWS database services. At small scale these issues may be immaterial but mid-to-large enterprises running their entire business on the cloud would be short-sighted to ignore the economic benefits of the multi-cloud era we live in.
Learning From Other Internet-Scale Companies
Microsoft, Facebook and Apple ran internet-scale data services earlier than most enterprises and were quick to realize the true cost of the complexity resulting from polyglot persistence. They have since moved towards a multi-API, multi-model approach which involves supporting multiple APIs on a common storage engine.
Microsoft Azure Cosmos DB
Cosmos DB started as an internal project at Microsoft called DocumentDB and later became generally available in 2017 as a managed service. It is Azure’s multi-API/multi-model, globally distributed proprietary database. 5 different APIs including read-only SQL (writes through a custom Document API), MongoDB, Cassandra, Gremlin (Graph), and Table (Key-Value) are supported on a common storage engine.
Facebook MyRocks & Rocksandra
Facebook open sourced MyRocks in 2015 as the RocksDB-based storage engine for MySQL. In March 2018, it open sourced Rocksandra as the RocksDB-based storage engine for Apache Cassandra. In both cases, Facebook improved the read performance (in terms of higher throughput and lower latency) when compared with the default engines. Even though MyRocks and Rocksandra are implemented as separate projects, Facebook showed that a popular SQL (MySQL) and a popular NoSQL (Cassandra) can indeed be run on the same core Log Structured Merge (LSM) Tree storage engine (RocksDB) .
FoundationDB, open sourced by Apple in April 2018, takes a very different approach to the multi-API/multi-model problem through its Layers concept. Applications interact either directly with FoundationDB or with a layer, a user-written module that can provide a new data model. In all cases, data is stored in a single engine via an ordered, transactional key-value API. In other words, FoundationDB is an engine for building a multi-model database but not a multi-model database itself.
Common Storage Engine Is Good, But What About Performance?
The well-accepted wisdom in database design is that a common storage engine for multiple data models/workloads equals poor performance. Deeper analysis shows that such wisdom is too simplistic — it is possible to design a common storage engine to serve multiple data models without compromising performance. The Microsoft, Facebook and Apple examples from the previous section are early proofs. Let’s look at the top 3 desired characteristics of such a common storage engine.
1. Storage Engine Type: LSM Tree Based
The table below maps common data models to the corresponding most popular database choice (as per DB-Engines ranking). Note that the top spots are still held by open source databases (with the exception of Oracle) and AWS managed service offerings haven’t reached there yet.
LSM engines are more general purpose than B-Tree engines given their ability to power multiple data models and that too with high read performance. LSM databases such as Cassandra can be used very efficiently for persistent key-value and graph (using JanusGraph) use cases. Even MongoDB’s WiredTiger can be run in an LSM configuration as opposed to the default B-Tree configuration. DocDB, YugaByte DB’s storage engine, is also built on a custom version of the LSM-based RocksDB.
2. Consistency Model: Strong
Another database design myth is that eventual consistency delivers better performance than strong consistency. As shown in the YugaByte DB vs. Apache Cassandra YCSB benchmark comparison, eventual consistency puts enormous burden on the database in the form of quorum reads, inefficient read-modify-writes and background repairs with the net result of a significantly slower system. Google Cloud Spanner’s post titled Why You Should Pick Strong Consistency, Whenever Possible lays down some excellent arguments in the additional context of developer productivity.
3. Transactions & Concurrency: ACID & MVCC
Single-row operations dominate enterprise workloads as Werner Vogels highlights in his Oct 2017 DynamoDB 10yr-anniversary post.
A deep dive on how we were using our existing databases revealed that they were frequently not used for their relational capabilities. About 70 percent of operations were of the key-value kind, where only a primary key was used and a single row would be returned. About 20 percent would return a set of rows, but still operate on only a single table.
However, about 30% of the time there is a need for multi-row ACID (aka distributed) transactions which require touching multiple rows of a single table or multiple rows of multiple tables. First generation NoSQL databases like AWS DynamoDB avoided strong consistency and distributed transactions altogether by arguing that an eventually consistent solution to the “70% workload problem” was of higher priority. This argument in turn forced organizations to run transactional RDBMS alongside their NoSQL databases.
Next generation databases such as YugaByte DB and FoundationDB are breaking the mold by bringing strong consistency, ACID transactions and MVCC to the NoSQL world. Our Yes We Can! Distributed ACID Transactions with High Performance post highlights YugaByte DB’s approach — some tables serve the “70% workload problem” similar to high-performing NoSQL but with strong consistency and some other tables serve the remaining 30% workload similar to a transactional SQL but with linear write scalability.
Comparing Multi-Model Databases
The following table compares 8 multi-model databases that use a single storage engine across multiple APIs. 4 of these are proprietary — Microsoft Azure Cosmos DB, AWS DynamoDB, DataStax Enterprise (DSE) and MarkLogic. And 4 of these are open source — ArangoDB, OrientDB, Couchbase and YugaByte DB.
Cosmos DB and YugaByte DB are the two that stand out as the ones with APIs that are compatible with existing popular databases.
The Smart Enterprise Choice — Multi-API/Multi-Model Meets Multi-Cloud
Over the last 10+ years, enterprises had little choice but to adopt the AWS Polyglot-Persistence-as-a-Service approach either directly with AWS or by running multiple databases themselves. AWS is now widely considered 4th largest database vendor by revenue after the big 3 of Oracle, Microsoft (SQL Server) and IBM. Today Google Cloud and Microsoft Azure have become credible alternatives to AWS for running mission-critical services. Google Cloud and Azure also provide a set of proprietary database services on their own, but enterprises are well advised not to fall into the lock-in trap again.
So how does one run truly cloud-independent data services while also avoiding the complexity of operating multiple databases? The answer lies in open source, cloud native, multi-API/multi-model databases. The underlying storage engine of such a database would ideally have the characteristics previously highlighted.
- Multi-API/Multi-model ensures high developer productivity — get the same benefits as polyglot persistence without any of the costs.
- Common high-performance storage engine ensures simplified operations — enables operations engineers to run a single database service.
- Open source ensures no vendor lock-in —future-proofs migration to infrastructure with better cost-performance tradeoffs.
- Cloud native ensures multi-cloud portability — allows on-demand, completely online migration to new cloud(s) using orchestration tools such as Kubernetes.
Enterprises are adopting multi-model databases at a rapid pace. The ones that also support multiple APIs compatible with existing popular databases preserve the polyglot part of polyglot persistence while removing the operationally complex multiple persistence part. Simplified operations combined with open source and cloud native principles make multi-API/multi-model databases perfect for today’s multi-cloud era, where the underlying cloud infrastructure needs to be ever-ready for change.Time to bid goodbye to proprietary services of a single cloud platform!