Low Latency Reads in Geo-Distributed SQL with Raft Leader Leases

Founder & CTO

August 20, 2019

Note: This post contains interactive animations that explain how some of these complex algorithms work. Please view this post in a suitable media (at least 1000px by 600px screen resolution) for best results.

In this blog post, we are going to dive deep into the read performance of Raft – why read performance can take a hit and how it can be improved using leader leases. Additionally, we will also look at how to make the correctness guarantees around leader leases stronger. Note that this post assumes some basic familiarity with the Raft algorithm and builds on top of this excellent, animated explanation of the Raft algorithm.

This is the first of a multi-part series about ways to improve the performance of a distributed SQL database using Raft consensus algorithm. Many of these techniques to improve Raft performance are implemented in YugabyteDB – an open-source, distributed SQL database with high performance and scalability.

Raft for Distributed Consensus

Correctness cannot be compromised in a distributed SQL database. There are different levels of this intuitive correctness, which are represented by different consistency models. Linearizability is one of the strongest single-key consistency models, and implies that every operation appears to take place atomically and in some total linear order that is consistent with the real-time ordering of those operations. In other words, the following should be true of operations on a single key:

Operations can execute concurrently, but the state of the database at any point in time must appear to be the result of some totally ordered, sequential execution of operations.
If operation A completes before operation B begins, then B should logically take effect after A.

The Raft algorithm solves the problem described above. Raft is a consensus algorithm that allows the members of a distributed system to agree on a sequence of values in the presence of failures and is formally proven safe. Raft is similar to Paxos, but much easier to understand and offers some additional features such as the ability to dynamically change membership.

YugabyteDB, designed for full ACID compliance in cloud-native environments and geo-distributed deployments, uses Raft consensus to achieve single-key linearizability and implements leader leases to achieve high read performance.

For the remainder of this post, we are going to look at just this one snippet of the Raft paper!

Read-only operations can be handled without writing anything into the log. However, with no additional measures, this would run the risk of returning stale data, since the leader responding to the request might have been superseded by a newer leader of which it is unaware. Linearizable reads must not return stale data, and Raft needs two extra precautions to guarantee this without using the log. First, a leader must have the latest information on which entries are committed. The Leader Completeness Property guarantees that a leader has all committed en-tries, but at the start of its term, it may not know which those are. To find out, it needs to commit an entry from its term. Raft handles this by having each leader com-mit a blank no-op entry into the log at the start of its term. Second, a leader must check whether it has been de-posed before processing a read-only request (its information may be stale if a more recent leader has been elected). Raft handles this by having the leader exchange heart-beat messages with a majority of the cluster before responding to read-only requests. Alternatively, the leader could rely on the heartbeat mechanism to provide a form of lease [9], but this would rely on timing for safety (it assumes bounded clock skew).

To make things simple, we will focus on one Raft group with a replication factor of 3 (which represents a tablet of a auto-sharded table in YugabyteDB). We will examine the scenario where different members of the Raft groups are located in multiple geographies, as would be the case in a geo-distributed deployment.

Default Read Latency is High

Below is a snippet from the Raft paper that highlights how linearizable reads are served.

Raft handles this by having the leader exchange heartbeat messages with a majority of the cluster before responding to read-only requests.

Raft requires the leader to exchange a heartbeat with a majority of peers before responding to a read request. By requiring these heartbeats, Raft introduces a network hop between peers in a read operation, which can result in high read latencies. The latencies get much worse in the case of a multi-region, geo-distributed SQL database where the nodes are located physically far apart. The actual read operation would look similar to what is shown in the animation below.

Let us look at why this extra heartbeat round is required. Once again, our snippet from the Raft paper says the following:

Second, a leader must check whether it has been deposed before processing a read-only request (its information may be stale if a more recent leader has been elected).

Let us see what would go wrong if the leader serves read requests without exchanging heartbeats. Consider the following sequence of events:

Assume all nodes A, B and C of the cluster have a key k whose value is set to V1. C is the leader to start with.
Now imagine C gets network partitioned from the other nodes A and B (but not from the client). This results in A and B electing a new leader, say A while C still thinks it is the leader.
The client connects to A to perform an operation to update the value of key k to V2. This succeeds because the operation gets replicated to a majority of nodes, A and B.
The client now tries to read the value of key k from node C, which continues to think it is the leader and responds with the value V1 which is a stale read from the client’s point of view.

This sequence is shown in the animation below.

Therefore, without exchanging heartbeats before a read operation, a Raft leader that is partitioned away from the rest of the peers might serve stale reads, which would violate linearizability.

To summarize, Raft does provides linearizability on reads without any reliance on clocks, but this requires an additional round-trip to the majority of replicas on every read operation, which would result in an unacceptably low performance, especially for geo-distributed SQL database clusters.

Leader Leases to the Rescue

The Raft paper snippet mentions that it is possible to rely on a form of time-based lease for Raft leadership that gets propagated through the heartbeat mechanism. And if we have sufficiently well behaved clocks, it is possible to obtain linearizable reads without paying a round-trip latency penalty.

Alternatively, the leader could rely on the heartbeat mechanism to provide a form of lease [9], but this would rely on timing for safety (it assumes bounded clock skew).

In YugabyteDB, a newly elected leader cannot serve reads (or initiate writing a no-op Raft operation which is a prerequisite to accepting writes) until it has acquired a leader lease. During a leader election, a voter must propagate the longest remaining duration time of an old leader’s lease known to that voter to the new candidate it is voting for. Upon receiving a majority of votes, the new leader must wait out the old leader’s lease duration before considers itself as having the lease. The old leader, upon the expiry of its leader lease, steps down and no longer functions as a leader. The new leader continuously extends its leader lease as a part of Raft replication. Typically, leader leases have a short duration, for example the default in YugabyteDB is 2 seconds.

The sequence of steps is as follows:

A leader computes a leader lease time interval
The timer for this time interval starts counting down on the existing leader
The leader sends the lease time interval to the followers using an RPC as a part of Raft replication. The RPC message itself incurs a time delay.
Only after the RPC delay do the followers receive the lease time interval and start their countdown.
Each node performs a countdown based on its monotonic clock.
It is easy to adjust the time delta by the max rate of clock drift on each node to take care of the variation between monotonic clock rates on different nodes.

This is shown in the animation sequence below.

Overall, this sequence creates a time window where the old leader steps down and the new leader does not step up, causing an unavailability window. In practice, however, this may not be hugely impactful since the unavailability window occurs only during failure scenarios (which are comparatively rare events) and the time window itself is quite “small” as observed by the end user. The unavailability window is bounded by the following equation:

max-unavailability = max-drift-between-nodes * leader-lease-interval + rpc-message-delay

Making Leader Leases Safe

This post covered the performance aspect of read operations in Raft algorithm, which relies on bounded clock-skew. But given that clocks can instantaneously jump, a clock-skew based mechanism may not be robust enough in practice. To get around this issue, YugabyteDB relies instead on bounded clock drift (as opposed to clock skew) as applied to a time interval (as opposed to a time deadline) and uses the monotonic clock (as opposed to the real-time clock) to measure the interval.

The broader question here is how to make mechanisms that rely on the notion of time in distributed systems reliable. We will dive into these aspects in a follow-up post.