The Distributed SQL Blog

Thoughts on distributed databases, open source and cloud native

YugabyteDB’s Distributed SQL API Jepsen Test Results

Founder & CTO

Note: You can join the discussion on Hacker News here.

We are very excited to announce that the SQL API of YugabyteDB v1.3.1.2 passed Jepsen testing performed by Kyle Kingsbury [Edit] (with the exception of transactional DDL support, which almost no other distributed SQL database vendor supports, and we plan to support soon. The real-world impact of this open issue is really small as it is limited to cases where DML happens before DDL has fully finished). YSQL will become generally available for production usage in the much anticipated YugabyteDB v2.0 release, which is just around the corner.

YugabyteDB is a high-performance, distributed SQL database built on a scalable and fault-tolerant design inspired by Google Spanner. Yugabyte’s SQL API (YSQL) supports most of PostgreSQL’s functionality and is wire-protocol compatible with PostgreSQL drivers.

We can now safely say (pun intended) that YugabyteDB supports serializable and snapshot isolation for transactions. The v1.2 release from earlier this year shipped with the Jepsen verification of Yugabyte’s Cassandra-inspired, semi-relational YCQL API. Given that DocDB, YugabyteDB’s underlying distributed document store, is common across both the YCQL and YSQL APIs, it was no surprise that YSQL passed official Jepsen run safety tests relatively easily [Edit] (with the exception of transactional DDL support, which almost no other distributed SQL database vendor supports, and we plan to support soon. The real-world impact of this open issue is really small as it is limited to cases where DML happens before DDL has fully finished).

The primary focus of the Jepsen testing in this go around was to test the new serializable isolation level for distributed transactions where isolation stands for the “I” in ACID. As a fully-relational SQL API, YSQL supports both serializable and snapshot isolation while the semi-relational YCQL API supports only the snapshot isolation level.

Passing this Jepsen test gives YugabyteDB the distinction of being the first database to pass Jepsen twice for two separate APIs! Kyle’s official report can be found here. In this post we’ll summarize the highlights from his report.

Accelerated Failure Testing

Jepsen tests accelerate the failures that would be observed in production systems by constantly and frequently introducing faults. The Jepsen testing methodology notes that bugs reproduced in Jepsen are observable in production, not theoretical. Jepsen tests construct random operations, apply them to the system, and construct a concurrent history of their results. That history is checked against a model to establish its correctness.

The report outlines a variety of faults that are introduced while performing operations on a YugabyteDB cluster, including:

  • Crashes of the various processes (yb-master and yb-tserver)
  • Network partitions (single-node, majority-minority and non-transitive)
  • Process pauses
  • Instantaneous and stroboscopic changes to clocks, up to hundreds of seconds
  • Combinations of the above events

Jepsen tests have proven to be very effective in catching issues. In fact, since correctness is paramount to us as an OLTP database, we run Jepsen tests as a part of our nightly CI/CD pipeline on the YugabyteDB release builds!

Safety Issues Identified

The array of tests that Kyle Kingsbury ran uncovered two safety issues:

  • G2-item (anti-dependency) cycles in transactions
  • Improperly initialized DEFAULT NOW() columns

Let us look at both of these briefly.

Item cycles in transactions

The append test caught a serializability violation under a rare situation. It was discovered after almost 100 hours of testing by inducing yb-master process crashes. The yb-master process is responsible for storing system metadata (such as shard-to-node mappings) and coordinating system-wide operations, such as automatic load balancing, table creation, or schema changes. It does not handle queries issued by application clients. The scenario under which the failure occurs is rather obscure, so hats off to Kyle for catching this!

Here is the relevant sequence of events under which the bug shows up:

  • Let’s say a new master gets elected as Raft leader. We will call this the master leader.
  • Right after the master gets elected as a new Raft leader, it has an empty capabilities set for the tablet servers in the cluster. The capabilities set describes the set of features supported by the tablet servers, and is used instead of version numbers in order to preserve backward compatibility for rolling upgrades.
  • The tablet servers start to send heartbeats to the master leader. This happens relatively soon (less than 500ms with default settings).
  • The capabilities set in the master leader gets updated as soon as it receives heartbeats from the tablet servers.

The bug occurs if the YSQL layer queries the master leader and receives an empty tablet server capabilities set in the window between steps #2 and #3 above. This empty set causes any subsequent transaction RPC to include a read time field. This read time should be ignored by the tablet server in the case of serializable transactions (it is an optimization that is valid only for snapshot isolation levels). However, if the field was set, the serializable transaction would end up using the read timestamp, eventually resulting a serializability violation. You can read more details about this issue, which was promptly fixed by this commit.

Improperly initialized DEFAULT NOW()

YSQL does not support transactional DDL statements yet. In simple terms, this means that the multiple steps required to perform an operation such as creating a table with indexes are not executed in an atomic manner. This test failure, where columns with a default value of NOW() would sometimes be initialized to NULL, is a symptom of that underlying issue. Let’s dive right into how this happens.

The underlying table was defined as shown below.

The default column value NOW() is described in the official PostgreSQL docs for date and time functions as follows:

now() is a traditional PostgreSQL equivalent to transaction_timestamp().

In order to create the above table foo in YSQL, a number of discrete steps need to be performed. Some of the steps relevant to this issue are shown below:

  • Write the table schema into the YSQL system catalog without the DEFAULT column value.
  • Add the pg_class / pg_attribute entries.
  • Modify the entries to set the DEFAULT column value.

Assume the above steps are being performed by a client. The table becomes operational after step #2 and an independent client can successfully insert data before step #3 is complete where the default column value is set. Any inserts that occur before step #3 would see NULL values instead of NOW() for the column k. In summary, this issue turned out to be not related to the core design/implementation of Yugabyte’s transaction layer that supports serializable and snapshot isolation levels, but rather due to the fact that the implementation of the create table sequence is not yet atomic. A simple, short-term workaround is to wait for the table creation to succeed before starting the workload against the table.

Kudos from the Jepsen Report

We have worked very hard to make distributed transactions in YugabyteDB robust to all kinds of failures, including clock skews. We are very proud to see Kyle recognize this in his report. Below are a couple of observations worth calling out.

Robust to Clock Skews

Whatever the case, this is a good thing for operators: nobody wants to worry about clock safety unless they have to, and YugabyteDB appears to be mostly robust to clock skew. Keep in mind that we cannot robustly test YugabyteDB’s use of CLOCK_MONOTONIC_RAW for leader leases, but we suspect skew there is less of an issue than CLOCK_REALTIME synchronization.

YugabyteDB relies on the CLOCK_MONOTONIC_RAW for leader leases instead of CLOCK_REALTIME. In simple terms, this means that YugabyteDB uses the underlying hardware clock and not the clock that displays the current time on a node. The upshot is that YugabyteDB is pretty resistant to clock skews, which is great news for the users. Look for an upcoming post that will describe a lot of these in detail.

Rare Occurrence of Causal Reversal

YugabyteDB was relatively robust in our transactional tests. Although it claims to provide only serializability, and theoretically allows non-linearizable phenomena like causal reverse, these anomalies were rare in our testing.

Causal reverse is the phenomenon where the order of writes to independent keys is reversed in the serial order. For example, let’s say there is an update to key X in a user’s application which results in an update to an unrelated key Y. Clearly, from the application’s point of view, X precedes Y in time. However, from the point of view of YugabyteDB, these operations are unrelated to each other and act on non-overlapping set of keys, and hence their order could get reversed in time. Causal reverse in itself is not a violation of serializable isolation, and applications typically do not rely on the ordering of unrelated operations. That said, it is encouraging to see that this phenomenon was rare to reproduce experimentally, as this is another indicator of the fact that YugabyteDB is somewhat robust to clock skew.

What’s Next?

Related Posts

Founder & CTO