Yugabyte DB’s Distributed SQL API Jepsen Test Results
Note: You can join the discussion on Hacker News here.
We are very excited to announce that the SQL API of Yugabyte DB v22.214.171.124 passed Jepsen testing performed by Kyle Kingsbury [Edit] (with the exception of transactional DDL support, which almost no other distributed SQL database vendor supports, and we plan to support soon. The real-world impact of this open issue is really small as it is limited to cases where DML happens before DDL has fully finished). YSQL will become generally available for production usage in the much anticipated Yugabyte DB v2.0 release, which is just around the corner.
We can now safely say (pun intended) that Yugabyte DB supports serializable and snapshot isolation for transactions. The v1.2 release from earlier this year shipped with the Jepsen verification of Yugabyte’s Cassandra-inspired, semi-relational YCQL API. Given that DocDB, Yugabyte DB’s underlying distributed document store, is common across both the YCQL and YSQL APIs, it was no surprise that YSQL passed official Jepsen run safety tests relatively easily [Edit] (with the exception of transactional DDL support, which almost no other distributed SQL database vendor supports, and we plan to support soon. The real-world impact of this open issue is really small as it is limited to cases where DML happens before DDL has fully finished).
The primary focus of the Jepsen testing in this go around was to test the new serializable isolation level for distributed transactions where isolation stands for the “I” in ACID. As a fully-relational SQL API, YSQL supports both serializable and snapshot isolation while the semi-relational YCQL API supports only the snapshot isolation level.
Passing this Jepsen test gives Yugabyte DB the distinction of being the first database to pass Jepsen twice for two separate APIs! Kyle’s official report can be found here. In this post we’ll summarize the highlights from his report.
Accelerated Failure Testing
Jepsen tests accelerate the failures that would be observed in production systems by constantly and frequently introducing faults. The Jepsen testing methodology notes that bugs reproduced in Jepsen are observable in production, not theoretical. Jepsen tests construct random operations, apply them to the system, and construct a concurrent history of their results. That history is checked against a model to establish its correctness.
The report outlines a variety of faults that are introduced while performing operations on a Yugabyte DB cluster, including:
- Crashes of the various processes (yb-master and yb-tserver)
- Network partitions (single-node, majority-minority and non-transitive)
- Process pauses
- Instantaneous and stroboscopic changes to clocks, up to hundreds of seconds
- Combinations of the above events
Jepsen tests have proven to be very effective in catching issues. In fact, since correctness is paramount to us as an OLTP database, we run Jepsen tests as a part of our nightly CI/CD pipeline on the Yugabyte DB release builds!
Safety Issues Identified
The array of tests that Kyle Kingsbury ran uncovered two safety issues:
- G2-item (anti-dependency) cycles in transactions
- Improperly initialized DEFAULT NOW() columns
Let us look at both of these briefly.
Item cycles in transactions
The append test caught a serializability violation under a rare situation. It was discovered after almost 100 hours of testing by inducing yb-master process crashes. The yb-master process is responsible for storing system metadata (such as shard-to-node mappings) and coordinating system-wide operations, such as automatic load balancing, table creation, or schema changes. It does not handle queries issued by application clients. The scenario under which the failure occurs is rather obscure, so hats off to Kyle for catching this!
Here is the relevant sequence of events under which the bug shows up:
- Let’s say a new master gets elected as Raft leader. We will call this the master leader.
- Right after the master gets elected as a new Raft leader, it has an empty capabilities set for the tablet servers in the cluster. The capabilities set describes the set of features supported by the tablet servers, and is used instead of version numbers in order to preserve backward compatibility for rolling upgrades.
- The tablet servers start to send heartbeats to the master leader. This happens relatively soon (less than 500ms with default settings).
- The capabilities set in the master leader gets updated as soon as it receives heartbeats from the tablet servers.
The bug occurs if the YSQL layer queries the master leader and receives an empty tablet server capabilities set in the window between steps #2 and #3 above. This empty set causes any subsequent transaction RPC to include a read time field. This read time should be ignored by the tablet server in the case of serializable transactions (it is an optimization that is valid only for snapshot isolation levels). However, if the field was set, the serializable transaction would end up using the read timestamp, eventually resulting a serializability violation. You can read more details about this issue, which was promptly fixed by this commit.
Improperly initialized DEFAULT NOW()
YSQL does not support transactional DDL statements yet. In simple terms, this means that the multiple steps required to perform an operation such as creating a table with indexes are not executed in an atomic manner. This test failure, where columns with a default value of NOW() would sometimes be initialized to NULL, is a symptom of that underlying issue. Let’s dive right into how this happens.
The underlying table was defined as shown below.
CREATE TABLE foo (
k TIMESTAMP DEFAULT NOW(),
) IF NOT EXISTS;
The default column value NOW() is described in the official PostgreSQL docs for date and time functions as follows:
In order to create the above table foo in YSQL, a number of discrete steps need to be performed. Some of the steps relevant to this issue are shown below:
- Write the table schema into the YSQL system catalog without the DEFAULT column value.
- Add the
pg_class / pg_attributeentries.
- Modify the entries to set the DEFAULT column value.
Assume the above steps are being performed by a client. The table becomes operational after step #2 and an independent client can successfully insert data before step #3 is complete where the default column value is set. Any inserts that occur before step #3 would see NULL values instead of NOW() for the column k. In summary, this issue turned out to be not related to the core design/implementation of Yugabyte’s transaction layer that supports serializable and snapshot isolation levels, but rather due to the fact that the implementation of the create table sequence is not yet atomic. A simple, short-term workaround is to wait for the table creation to succeed before starting the workload against the table.
Kudos from the Jepsen Report
We have worked very hard to make distributed transactions in Yugabyte DB robust to all kinds of failures, including clock skews. We are very proud to see Kyle recognize this in his report. Below are a couple of observations worth calling out.
Robust to Clock Skews
Yugabyte DB relies on the CLOCK_MONOTONIC_RAW for leader leases instead of CLOCK_REALTIME. In simple terms, this means that Yugabyte DB uses the underlying hardware clock and not the clock that displays the current time on a node. The upshot is that Yugabyte DB is pretty resistant to clock skews, which is great news for the users. Look for an upcoming post that will describe a lot of these in detail.
Rare Occurrence of Causal Reversal
Causal reverse is the phenomenon where the order of writes to independent keys is reversed in the serial order. For example, let’s say there is an update to key X in a user’s application which results in an update to an unrelated key Y. Clearly, from the application’s point of view, X precedes Y in time. However, from the point of view of Yugabyte DB, these operations are unrelated to each other and act on non-overlapping set of keys, and hence their order could get reversed in time. Causal reverse in itself is not a violation of serializable isolation, and applications typically do not rely on the ordering of unrelated operations. That said, it is encouraging to see that this phenomenon was rare to reproduce experimentally, as this is another indicator of the fact that Yugabyte DB is somewhat robust to clock skew.
- We would love to work on a formal model for Yugabyte DB covering aspects such as the transaction algorithms and their time-dependence. If you are interested in working with us on this, just holler!
- Download the official Jepsen report.
- Join us on Oct 16 for the “Reviewing Jepsen Test Results for Correctness in Yugabyte DB 2.0” webinar with Kyle Kingsbury.
- Attend the Distributed SQL Summit, where both builders and practitioners alike will be talk about cloud-native, distributed databases.