How a Fortune 500 Retailer Weathered a Regional Cloud Outage with YugabyteDB
On the heels of an incredibly difficult 2020, my fellow Texans and I had the misfortune of experiencing a once-in-a-lifetime snowstorm. To be true, at the time, it felt like much more than a snowstorm. It wasn’t just snowfall in Texas. It was a snow-pocalypse.
However, once the storm settled, I explored how the rest of Texas had fared. I discovered some of the companies who run applications in production on YugabyteDB were also affected. But their experience was much less painful.
A Fortune 500 retailer was able to take advantage of YugabyteDB’s continuous availability. As a global retailer that operates department and grocery stores, it avoided application downtime by geo-distributing its YugabyteDB cluster. In this blog post, we’ll take a deeper dive into this company’s deployment. We’ll also learn how it sustained application availability during a regional failure.
Building a Multi-Region Architecture
One of the less talked about benefits of YugabyteDB is its flexibility of deployment. There are numerous ways to set up a geo-distributed topology with YugabyteDB, as discussed in this blog post. You have the choice of running on bare metal servers, virtual machines, or container technologies like Kubernetes. You can also stand up clusters in your own data center or the cloud. These clusters can be across availability zones, regions, or even cloud providers. By default our clusters replicate data using synchronous replication, but you can also set asynchronous read replicas or xCluster replication to additional clusters. Depending on your workload you can use the SQL API (YSQL) that is feature-compatible with Postgres, or our NoSQL API (YCQL) that is wire-compatible with Cassandra.
These options provide users with the flexibility to choose their deployment options based on the requirements of their use case. In this scenario there were two clusters, both of which had stringent requirements for the database to serve low latencies and be highly redundant across regions.
Displayed above is Cluster X which has a total of 24 nodes spread across multiple regions. The setup has a replication factor (RF) of 3, with 8 nodes in each region. This use case calls for low latency batch inserts of data overnight (e.g., product updates). The core requirements for this use case are zero downtime, handling massive concurrency and parallelism, as well as the ability to scale linearly. We used the concept of preferred leader regions to move all of the main reads and writes to US-South. The remaining regions—US-West and US-East—are used for high availability.
A second production cluster—Cluster Y—is in a similar configuration. The same regions are used with a total of 18 nodes, 6 in each region. Each node has 16 cores for a total of 288 cores. Combined, these two clusters run 42 nodes comprising 672 total cores. US-South was again used as the preferred leader region, serving reads and writes.
This use case serves downstream systems in the retail systems portfolio with a requirement to serve single digit low latency on a per region basis. We implemented follower reads to meet this requirement. These allow you to read off of the local nodes without having to go to the preferred leader region. This provides more read IOPS with low latency but might have slightly stale yet timeline-consistent data (i.e., no out of order data is possible).
Given these configurations, both workloads are true multi-region cloud native deployments with the ability to scale up and down based on consumer demand. When it scales its app tier, it can scale the backend along with it. It could now stay protected from a natural disaster that would cause its application to lose connection to its preferred leader region.
Under the Hood: A Story of Resilience and High Availability
If this retailer had a multi-AZ setup within the US-South region instead—which is a very common deployment—it would have been in big trouble. The two workloads mentioned above are crucial microservices in its application landscape. Cluster X acts as the store of record for their product catalog of 1.6 billion products. Cluster Y handles product mapping for 250k transactions per second at very low latency.
As the snow began to fall and Texas started to freeze over, companies that had data centers in Central Texas experienced connectivity issues as electrical blackouts became frequent. How could these companies reconfigure their systems to make their data available to users?
Thankfully for our retailer, when the US-South region went down, its YugabyteDB cluster stayed up. High availability and auto-sharding are core to YugabyteDB’s design.
Let’s take a closer look at how YugabyteDB was able to handle this disaster.
Tables in YugabyteDB decompose into multiple sets of rows according to a specific sharding strategy. These shards—or tablets in YugabyteDB —are automatically distributed across the nodes in the cluster. Each tablet replicates across the cluster an X amount of times based on the value that you set for the replication factor.
This replication factor is directly correlated to the fault tolerance of the system. In the scenario visualized below, a cluster with 3 nodes and an RF of 3 can sustain losing a single node.
Maintaining data consistency
The Raft protocol is the consensus protocol that YugabyteDB uses to ensure ACID compliance across the cluster. Replicated tablets organize into a set of tablet-peers. They also form a strongly consistent Raft group composed of as many tablet-peers as the replication factor (i.e., if you have set RF=3 then your Raft group will have 3 tablet-peers). The tablet-peers get hosted on different YB-Tservers, and through the Raft protocol perform actions such as leader election, failure detection, and replication of write-ahead logs.
On start-up, one of the tablet-peers is automatically elected as the tablet leader. This tablet leader is the main tablet in the tablet-peer group. It serves all reads and writes for those specific rows.
Tablet followers are the remaining tablet-peers of the Raft group. Tablet followers replicate data and are available as hot standbys that can take over quickly in case the leader fails. Although only the tablet leader can process reads and writes, YugabyteDB offers reading from tablet followers for use cases that allow for relaxed consistency guarantees in exchange for lower latencies.
Tying it all together
Both of these workloads required the tablet leaders to be placed in the US-South region. This means US-South served both reads and writes across all tablets. However, Cluster Y also required low latency local reads. As a result, both US-West and US-East serve follower reads for local users.
YugabyteDB automatically handled tablet leader re-election using the Raft protocol when the retailer’s application went down. Tablet-leader failover takes approximately 3 seconds. During this time, YugabyteDB does all of the hard work, using the YCQL API’s smart capability to connect to additional regions without any manual intervention.
This retailer had set a user-specified data placement policy for both the preferred leader region and follower region. As a result, US-West became the new preferred leader region, and the tablets in that region became the new tablet-leaders. There was also no data loss in the failover to US-West. This was because the data existed in all three regions at the time of the outage (i.e., the cluster had a RF=3). Once the downed nodes were back online, YugabyteDB automatically detected stale data in the newly-back-online nodes. The database then performed a remote bootstrap to update data from the leader nodes.
This natural disaster was a worst-case scenario for a global retailer. However, being proactive left it prepared. With YugabyteDB, its service remained resilient and available, even though all tablet-leaders were in a single region. This database handles sharding, replication, and failover transparently and performs all of the heavy lifting on your behalf.
The retail industry, especially e-commerce, has brutal repercussions for downtime. Estimates show that revenue loss can be into the millions of dollars per hour of downtime. YugabyteDB’s continuous availability with immediate hands-free failover keeps these numbers to a minimum. The loss of this retailer’s preferred leader region did not cause an application outage. And the YugabyteDB cluster healed itself when the nodes came back online.