Practical Tradeoffs in Google Cloud Spanner, Azure Cosmos DB and YugaByte DB
The famed CAP Theorem has been a source of much debate among distributed systems engineers. Those of us building distributed databases are often asked how we deal with it. In this post, we dive deeper into the consistency-availability tradeoff imposed by CAP which is only applicable during failure conditions. We also highlight the lesser-known-but-equally-important consistency-latency tradeoff imposed by the PACELC Theorem that extends CAP to normal operations. We then analyze how modern cloud native databases such as Google Cloud Spanner, Azure Cosmos DB and YugaByte DB deal with these tradeoffs.
Consistency vs. Availability
Distributed systems must choose between Consistency and 100% Availability in the presence of network Partitions.
A deeper dive into the definitions is in order.
Every read receives either the most recent write or an error. In other words, all members of the distributed system have a shared understanding of the value of a data element from a read standpoint. This guarantee is also known as linearizability or linearizable consistency.
Every read or update receives a non-error response. Since every operation succeeds without errors, the availability of the system is 100%.
These are not necessarily restricted to the network per se but refer to the general class of hardware and software failures (including message losses) that force one or more members of a distributed system to deviate from other members of the system.
Given that tolerance to network partitions (i.e. P) is a must-have in any desirable distributed system, such a system can be either a CP system (that chooses Consistency over Availability) or can be an AP system (that chooses Availability over Consistency). The most important caveat here is that no such choice is needed in absence of network partitions — a system can be simultaneously Consistent and Available in the absence of failures. As discussed in the next section, the tradeoff between consistency and latency becomes more important during such normal operations.
Since AP systems always provide a non-error response to reads and writes, they seem like the obvious choice in user-facing applications. However, there are 2 key concerns to be aware of.
- Application development complexity can be onerous in AP systems. The loss of the ability to clearly reason about the current state of the system makes developers add compensating application logic for safe handling of stale reads (resulting from recent writes not yet available on the serving node) and dirty reads (resulting from the lack of rollback of a failed write on the serving node). Hence the recommendation to choose a CP system whenever possible.
- The operational cost of achieving true 100% availability is high in terms of the infrastructure and labor effort necessary. More importantly, user-facing applications have failure points outside of the database that can lead to the application tier getting partitioned away from the database tier. In such cases, having a 100% database availability doesn’t help in achieving 100% application availability. As we will see below, giving up on a mere 0.01% availability leads to more practical and productive ways of building applications.
Consistency vs. Latency
if there is a partition (P), how does the system trade off availability and consistency (A and C); else (E), when the system is running normally in the absence of partitions, how does the system trade off latency (L) and consistency (C)?
PACELC extends CAP to normal conditions (i.e. when the system has no partitions) where the tradeoff is between latency and consistency. Systems that allow lower latency along with relaxed consistency are classified as EL while systems that have higher latency along with linearizable consistency are classfied as EC.
PACELC gives distributed database designers a more complete framework to reason about the essential tradeoffs and therefore avoid building a more limiting system than necessary in practice.
From CAP standpoint, Google Cloud Spanner, Azure Cosmos DB and YugaByte DB are all CP databases that provide very high availability. From PACELC standpoint, all are EL databases that allow lower latency operations by tuning consistency down. While both Google Spanner and Azure Cosmos DB are proprietary managed services, YugaByte DB is an open source, cloud native DB that is distributed under Apache 2.0 license.
Google Cloud Spanner
Google Cloud Spanner divides data into chunks called splits, where individual splits can move independently from each other and get assigned to different nodes. It then creates replicas of each split and distributes them among the nodes using Paxos distributed consensus protocol. Within each Paxos replica set, one replica is elected to act as the leader. Leader replicas are responsible for handling writes, while any read-write or read-only replica can serve timeline-consistent (i.e. no-out-of-order) read requests without communicating with the leader.
Since it runs on Google’s proprietary network and hardware infrastructure (including the globally synchronized TrueTime clock), Spanner is able to limit network partitions significantly and even guarantee availability SLAs of 99.999% even in multi-region clusters. In the event of network partitions, the replicas of the impacted splits (whose leaders got cut off from the remaining replicas) form two groups: a majority partition that can still establish a Paxos consensus and one or more minority partitions that cannot establish such a consensus (given the lack of quorum). Read-write replicas in the majority partition elect a new leader while all replicas in the minority partitions become read-only. This leads to high write availability on the majority partition with minimal disruption (which is essentially leader election time) and uninterrupted read availability.
Azure Cosmos DB
Azure Cosmos DB auto partitions data into multiple physical partitions based on the configured throughput. Each partition is then replicated for high availability. Developers now choose between five consistency models along the consistency spectrum — strong (linearizable), bounded staleness, session, consistent prefix, and eventual (out-of-order). The exact replication protocol used is not publicly documented. Since strongly consistent reads always involve a majority quorum of replicas and since there is support for an eventual consistency option, we can infer that a Paxos-like distributed consensus protocol (with the notion of a leader replica in a replica set) is not used.
For writes at strong consistency level, network partitions will indeed lead to unavailability and hence the CP classification. However, given the leader-less approach to replicas, the system may not auto heal till new replicas are bootstrapped from existing ones. This means more prolonged write unavailability than the likes of Google Spanner and YugaByte DB (see below). Cosmos DB aims to avoid such scenarios as far as possible by limiting accounts with strong consistency only to a single Azure region. For the other 4 consistency levels, multi-region deployments are allowed with lower latency and relaxed consistency levels.
YugaByte DB’s sharding and replication architecture is very similar to that of Google Spanner. It auto shards data into a configurable number of tablets and distributes those tablets evenly among the nodes of the cluster. Additionally, each tablet is replicated to other nodes for fault-tolerance using the Raft distributed consensus protocol which is widely considered to be more understandable and practical than Paxos. One replica in every tablet’s replica set is elected as the leader and the other replicas become followers. Writes have to go through the tablet leader while reads can go through on any member of the replica set depending on the read consistency level (default is strong). During network partitions (including node failures), a majority partition and a minority partition similar to that of Google Spanner get created. This same approach is used for both single region as well as globally consistent multi-region deployments.
As a resilient and self-healing system, YugaByte DB ensures that the replicas in the majority partition elect a new leader among themselves in a few seconds and accept new writes immediately thereafter. The design choice here is to give away availability in favor of consistency on failure occurrence while also limiting the loss of availability to only a few seconds till the new tablet leaders get elected. The leader replicas in the minority partition lose their leadership in a few seconds and hence become followers. At this point, the majority partition is available for both reads and writes while minority partitions are available only for timeline-consistent (i.e. no-out-of-order) reads but not for writes. After the failure is corrected, the majority partition and minority partition heal back into a single Raft consensus group and continue normal operations. Similar to Cosmos DB, YugaByte DB offers multiple consistency levels but only for read operations. However, it doesn’t support an eventual consistency level for reads. Write operations are always strongly consistent.
The consistency and latency tradeoffs for the 3 cloud native CP databases can be summarized as below.
The write consistency levels highlighted above are defined in our previous post on the CAP Theorem. For Cosmos DB, write operations with Bounded-staleness, Session and Consistent-prefix consistency level seem to be using strongly consistent replication in the originating region along with timeline-consistent replication across multi-regions.
We at YugaByte are excited about the work that the Spanner and Cosmos DB teams are doing with respect to building truly global-scale operational databases. However, as enterprises move to an increasingly multi-cloud and hybrid cloud era, we do not believe they should get locked into expensive, proprietary managed services. As can be seen from the comparisons above, YugaByte DB offers strong write consistency (with optional timeline-consistent, read-only replicas) and tunable read consistency in both single region and multi-region deployments. It does so on an open source core with open key-value and flexible schema APIs (compatible with Redis and Apache Cassandra). We believe these are exactly the right ingredients to power cloud native business-critical applications in the enterprise.
In a future post, we will explore the depth of ACID transaction support in these databases. Meanwhile, get started with a local cluster using Docker or our macOS/Linux binaries.