The YugaByte Database Blog

Thoughts on open source, cloud native and distributed databases

YugaByte DB 1.2 Passes Jepsen Testing

Founder & CTO

You can join the discussion about the results on HackerNews here.

Last year we published our DIY Jepsen testing results – including the tests and failure modes implemented as well as the bugs found. We recently engaged Kyle Kingsbury, the creator of the Jepsen test suite, for an official analysis and are happy to report that YugaByte DB 1.2 formally passes Jepsen tests using the YCQL API. To quote from the report:

YugaByte DB now passes tests for snapshot isolation, linearizable counters, sets, registers, and systems of registers, as long as clocks are well-synchronized.

Kyle found three safety issues, all of which are fixed as of YugaByte DB 1.2:

One issue caused a violation of snapshot isolation under huge clock fluctuations. The underlying problem was that our locking subsystem, used to ensure transactions do not execute concurrently, did not propagate the status of trying to acquire the locks. This resulted in the higher layer assuming the lock was successfully acquired within a deadline, even when the lock acquisition timed out. The fix was simply to propagate the status of the lock operation.

A second issue revealed that in rare conditions involving multiple network partitions and membership changes, successfully acknowledged writes could be discarded. The underlying issue was that the Raft implementation in YugaByte DB would accept votes from any peer (even non-voting) in the active config, causing the write to get acknowledged to the client with insufficient replication. This happens when a failure (such as a network partition) happens when the Raft membership is changing (for example, during a load balancing operation to ensure even distribution of queries across nodes). This has been fixed by accepting votes only from voting members in the active config.

A third issue (which was already identified and fixed) caused reads to violate snapshot isolation ultimately leading to data corruption. This issue happened because a conflicting transaction’s committed status could be missed by the conflict resolver at the moment when that conflicting transaction’s provisional records were being converted to permanent records. This would result in an undetected conflict and a write based on a stale read. This bug was fixed in version 1.1.10. If you’re wondering about why an already uncovered issue made its way into this list, read on to the end!

Download the official Jepsen report and sign-up for the “Reviewing Jepsen Test Results for Correctness in YugaByte DB 1.2” webinar on April 30 hosted by Kyle Kingsbury and Karthik Ranganathan.

The rest of this blog post adds some behind-the-scenes color in terms of the good, the bad and the ugly (or in other words, bloopers) parts of this effort.

The Good

YugaByte DB passed Jepsen testing, which is probably the best thing in itself. YugaByte DB uses a unique design to achieve linearizability with high performance while being very resilient to errors under clock skews. We are stoked to see that Kyle really liked this unique design.

Below are some of the positive things Kyle had to say (note that the quotes slightly paraphrased below for better readability).

Pushing the boundary on tolerating clock skew

YugaByte DB uses a combination of the monotonic and the HLC to make sure it is much more resilient to clock skew. Note that the —max_clock_skew_usec config flag controls the maximum clock skew allowed between any two nodes in the cluster and has a default value of 50ms. Kyle calls this out in his report (emphasis is mine in these quotes):

YugaByte DB uses Raft, which ensures linearizability for all (write) operations which go through Raft’s log. For performance reasons, YugaByte DB reads return the local state from any Raft leader immediately, using leader leases to ensure safety. Using CLOCK_MONOTONIC for leases (instead of CLOCK_REALTIME) insulates YugaByte DB from some classes of clock error, such as leap seconds.

Between shards, YugaByte DB uses a complex scheme involving Hybrid Logical Clocks (HLCs). YugaByte couples those clocks to the Raft log, writing HLC timestamps to log entries, and using those timestamps to advance the HLC on new leaders. This technique eliminates several places where poorly synchronized clocks could allow consistency violations.

High-performance multi-region transactions

One of the core principles when designing YugaByte DB was offering high-performance without sacrificing correctness. Kyle calls this out as well.

In the general case, where the leaders of each Raft group might be in different datacenters, YugaByte DB requires at least three cross-datacenter round-trips per cross-shard update transaction.

In the happy case, read transactions can be significantly faster. Thanks to YugaByte DB’s use of leader leases for reads, they can bypass normal Raft consensus in each shard, and return the state from a Raft leader without a round trip to its followers. In the uncontested case, reads can complete in as few as one cross-datacenter round trip.

Move fast with stable infrastructure

For me personally, the thing that stands out the most is Kyle’s comment about the YugaByte engineering team. Work is a lot more fun and satisfying when an incredibly talented yet humble team works in an open, fast-paced environment.

YugaByte’s engineers were quick to address safety and availability issues, and we’re pleased to see the results of their hard work.

The Bad

Nothing in this world is perfect. There is no such thing as a bug-free application. Of course, correctness bugs are always bad, but these bugs along with the fixes are discussed above. So what else was bad, and specifically what are the lessons that we are taking back from this effort?

New user experience

We have seen our users get started with YugaByte DB on a local cluster in minutes. However, when working with Kyle, we realized that the multi-node distributed cluster was not very intuitive to setup because it requires our users to understand the underlying architecture first. To make matters worse, there are a number of advanced flags that are available to tune the system. Not all of which are documented clearly, and trying to discover these flags from the software is next to impossible.

This is clearly an area of improvement for YugaByte DB, and we are taking these lessons to heart. Stay tuned for a post that will talk about this aspect in detail.

Improving partition tolerance

YugaByte DB is designed to be a highly available system. Between the database servers, after a network partition heals, Kyle pointed out that YugaByte DB could sometimes take over 60 seconds to recover. This has now improved to take just 25 seconds to recover.

However, there is room for improvement between the client and the server. While existing clients cache the information on how the data is partitioned and replicated across cluster nodes, new application clients are required to read this information from one node (the YB-Master leader). While this has not been an issue in practice, if that one node is partitioned, this causes a perceived unavailability from the application. Kyle calls this out in his report as well.

If enough masters are down, or partitioned from one another, or partitioned from a tablet server, clients will be unable to connect to that tablet server until conditions improve. This reduces YugaByte DB’s availability: that tablet could be writable, or at least readable, but if it can’t reach the masters, only clients with an existing connection will be able to perform work. As clients fail over from other nodes, or rotate connections for performance reasons, more and more clients can find themselves stuck.

Kyle also suggests the fix to solve this issue, and this is an issue we are working on.

Since clients cache the peers table locally, stale reads of the system.peers table are just fine for this use case. We recommended that YugaByte DB cache this table on every node to improve availability.

Taking documentation (even) more seriously

We take documentation very seriously. But apparently, that is not quite enough. Kyle brought a whole new level of thoroughness into this process. He went through the website and our docs finding a bunch of issues. Of course, they all seem pretty obvious once pointed out. To name a few in his own words:

YugaByte DB has updated their documentation and marketing material. The deployment checklist now offers a comprehensive discussion of clock skew and drift tolerance, and the relevant command line parameters. Clock skew options are now a part of the standard CLI documentation. The SQL page no longer claims that YSQL is verified by Jepsen; that claim should only have applied to the YCQL interface.

The Ugly

During the Jepsen testing effort, there were some of those situations where Murphy’s law kicked in, and we were not sure if it was appropriate to laugh or cry.

YCQL API – end-user safety vs Jepsen testability

During the testing of transactional guarantees, Kyle wanted to test cases when he could query keys across tables in one transaction, or perform reads and writes as a part of the same transaction. While these were features we could implement, the question was one of if we should. Kyle writes about this in the report:

The tests we wrote for YugaByte DB are less complete than those we’ve performed against similar systems in other Jepsen analyses. We have no way to transactionally query multiple tables in the bank test, so we measure multiple keys in the same table. We cannot combine read and write transactions in multi-key linearizability tests, so we stick to strictly read-only or write-only queries. We cannot express queries which would let us observe predicate anti-dependency cycles.

We created the YCQL API to be used at Snapshot isolation level, with the primary goal of dealing with massive scale (say 100s of nodes) at very low latencies. In the spirit of keeping the API foolproof for the end user, we decided against adding these query types since it could cause more confusion. Kyle notes this as well.

Since YugaByte DB’s YCQL interface proscribes these kinds of queries, there is no safety risk if they don’t work correctly. However, we cannot speak to YugaByte DB’s transactional isolation mechanisms in general—we can only describe how they handle the limited queries we can express in YCQL.

The full serializable isolation is now supported in the YSQL API. We plan to do Jepsen testing of the YSQL API this summer as part of its general availability release.

Do NOT hand over that release!

We knew the 1.1.9 release had an issue which had already been fixed in 1.1.10. So we wanted to Kyle to start the Jepsen testing on the 1.1.10 release but hadn’t yet updated the download and install documentation on the site. And Kyle (rightfully so) ended up starting with the 1.1.9 release!

Our first test run with 1.1.9 revealed a significant problem: under normal operating conditions, transactional reads in YugaByte DB frequently exhibited read skew: seeing part, but not all, of other transactions’ effects. As it turns out, YugaByte had independently discovered this problem: when transactions included read-modify-write operations, they failed to inform the conflict checker that those transactions read data, and consequently acquired only write locks, rather than read-write locks.

That library? Never used, only to debug failures!

In order to enable debugging certain failure scenarios, we used the libbacktrace library to print stack traces of code-paths. When adding this feature, we thought that this library would rarely be exercised. Only when there is a failure scenario (which is rare) and it needs to be debugged by turning on very verbose logging (which is even rarer). But Kyle hit this precise issue, which he writes about:

In our tests of YugaByte DB 1.1.10 through 1.1.13.0-b2, with verbose logging enabled, roughly one master per day encountered a catastrophic memory leak, allocating several gigabytes per second, until the OOM killer (or operators) intervened.

This was easy enough to fix by switching to a different library.

Valgrind identified libbacktrace as the cause of this leak, and YugaByte believes the problem stems from a thread safety bug in libbacktrace. As of 1.1.15.0-b16, YugaByte DB now uses Google’s glog for stack traces instead, and we have not observed memory leaks since.

Future Work

Continuous improvements to YCQL features is an on-going effort. Additionally, we plan to run Jepsen tests on the PostgreSQL-compatible YSQL API, in preparation for its general availability in the 2.0 release (Summer 2019). YSQL is also built on the same underlying distributed document store as YCQL. Kyle has all of this in his report:

Once YugaByte DB’s SQL layer is ready, we would like to return and test more generalized transactions. Both YSQL and YCQL are built on the same underlying transactional mechanism, so the YCQL tests give us some degree of confidence in the behavior of YSQL as well.

What’s Next?

  • Download the official Jepsen report as well as sign-up for the “Reviewing Jepsen Test Results for Correctness in YugaByte DB 1.2” webinar on April 30 hosted by Kyle Kingsbury and Karthik Ranganathan.
  • Compare YugaByte DB in depth to databases like CockroachDB, Google Cloud Spanner and MongoDB.
  • Get started with YugaByte DB on macOS, Linux, Docker, and Kubernetes.

Related Posts

Founder & CTO