Yes We Can! Distributed ACID Transactions with High Performance
ACID transactions are a fundamental building block when developing business-critical, user-facing applications. They simplify the complex task of ensuring data integrity while supporting highly concurrent operations. While they are taken for granted in monolithic SQL/relational DBs, distributed NoSQL/non-relational DBs either forsake them completely or support only a highly restrictive single-row flavor (see sections below). This loss of ACID properties is usually justified with a gain in performance (measured in terms of low latency and/or high throughput).
YugaByte DB is a high-performance distributed database that supports fully distributed ACID transactions across multiple rows, multiple shards and multiple nodes at any scale. Just as an inspirational leader once said, Yes We Can! Additionally, YugaByte DB’s open API layer supports both NoSQL (Cassandra QL & Redis) and SQL (PostgreSQL as Beta) APIs. This post highlights the state of ACID transactions in distributed databases today and also explains how YugaByte DB makes distributed ACID transactions work without compromising on high performance.
Defining ACID Transactions
A transaction is a sequence of operations performed as a single logical unit of work. A transaction has four key properties — Atomicity, Consistency, Isolation and Durability — commonly abbreviated as ACID.
- Atomicity refers to all the work in a transaction being treated as one atomic unit — either all of it is performed or none of it is.
- Consistency ensures that the database is always in a consistent internal state. For example, in the case of tables with secondary indexes, the primary table and all the index tables should be consistent after an update.
- Isolation determines how/when changes made by one transaction become visible to the other. Serializable and Snapshot Isolation are the top 2 isolation levels from a strictness standpoint.
- Durability ensures that the results of the transaction are permanently stored in the system. The modifications must persist even in case of power loss or system failures.
In the context of distributed databases, ACID transactions can be internally classified into the following 3 flavors.
Transactions where all the operations impact only a single row (aka key) are called as single-row ACID transactions. Since the data for a single row typically doesn’t cross the boundary of a single node in most distributed DBs, single-row ACID transactions are easier to achieve in the distributed DB world. However, most NoSQL DBs cannot support even this flavor of transactions because of their eventually-consistent storage engines that has no inherent guarantee on the correctness of the data read. Notable exception is MongoDB, which claims support for single-document transactions.
A marginal improvement over single-row ACID is single-shard ACID where all the rows involved in the operations of a transaction are located in a single shard of a distributed database. Since a single shard is always located inside a single node, this flavor too doesn’t involve coordinating transaction operations across multiple nodes and hence is easier to implement in a distributed DB. E.g. MongoDB recently announced the intent to support single-shard transactions in a future release.
In auto-sharded distributed DBs like YugaByte DB, Google Cloud Spanner and Azure Cosmos DB, shards are spread across multiple nodes by design on cluster creation. A transaction that impacts a set of rows distributed across shards on multiple nodes is called a distributed ACID transaction. Implementing distributed ACID transactions in scale-out DBs requires the use of a transaction manager that can coordinate the various operations and then commit/rollback the transaction as needed. Popular NoSQL databases are designed to avoid this additional complexity in the fear that they will have to compromise on performance (in the form of increase in write latency and decrease in linear scalability). As we see below, this fear is not grounded in reality and it is indeed possible to increase application dev agility through a single distributed DB that has both distributed ACID transactions and high performance. Apple’s recently open sourced FoundationDB follows a similar design philosophy where transactions are handled using a transactional authority.
Distributed ACID Transactions in YugaByte DB
YugaByte DB has a sophisticated design powering distributed transactions to achieve high performance without compromising correctness. It efficiently detects and optimally handles different scenarios involving single-row, single-shard and distributed ACID transactions. But for the sake of simplicity, we are going to look at the general case of how distributed transactions operate on multiple keys spread across many nodes.
The following steps outline the lifecycle of a transaction in YugaByte DB.
Step 1. Pick a transaction manager
YugaByte DB has a built-in transaction manager to manage the lifecycle of transactions. The transaction manager runs on every node (as a part of the tablet server process) in the cluster and automatically scales out as nodes are added.
Since the transaction manager is stateless, incoming transactions can be routed to any node. Any transaction manager in the cluster can manage incoming transaction. In order to optimize the performance of the transaction, the YugaByte Query Layer (YQL) tries to schedule the transaction on the tablet server that owns most of the data accessed by the transaction.
Step 2. Create an entry in transaction status table
Transactions need to be tracked in a reliable and fault-tolerant manner. This is essential since a transaction might attempt to update multiple keys, but could fail or get aborted at any intermediate step.
In order to track the transaction, a new entry is created for the transaction in a transaction status table. The following information is stored as a part of this transaction entry.
- Transaction ID which is a UUID that uniquely identifies the transaction.
- Status which can be one of pending, committed or aborted. All transactions start out in the pending status, and progress to the committed or aborted status, in which they remain permanently until cleaned up.
- Commit hybrid timestamp which is the commit timestamp for the transaction. This is the final timestamp used for multi-version concurrency control (MVCC) to serve the various updates made by the transaction if it gets committed.
- List of ids of participating tablets which are the final set of tablets that the transaction has written to. Read more about this in step #6 which describes how provisional writes are cleaned up.
Step 3. Write provisional records
Even at this point in the transaction’s lifecycle, it is not possible to predict if any of the updates will conflict with the updates of another transaction. Therefore, YugaByte DB writes provisional records to all tablets responsible for the keys the transaction is trying to modify. These records are provisional because they are invisible to readers until the transaction commits.
In addition to storing the data being updated by the transaction, the provisional records also act as a persistent revocable lock. These locks, represented by provisional records, can be revoked by another conflicting transaction. The conflict resolution subsystem makes sure that for any two conflicting transactions, at least one of them is aborted.
Furthermore, transaction metadata records are also stored in order to efficiently find the following information for a given transaction:
- the transaction coordinator
- the priority, to decide which of the conflicting transactions to abort
- all provisional records written as a part of the transaction in this tablet
This is done in order to increase the performance of the commit step of the transaction discussed in Step 4b below.
Step 4a. Handle conflicts
When multiple transactions run concurrently, they may try to update the same set of keys. When this happens, there is a likelihood that these updates might violate the isolation guarantees of the running transaction unless conflicts are detected and handled. The set of conflicting updates depend on the operation (read vs. write) and the isolation level guarantee (Serializable vs. Snapshot Isolation).
The table below lays out a simple view of the conflicts that must be detected and resolved. The resolution is generally to internally restart one of the conflicting transactions at a later point of time when the conflict can be resolved.
YugaByte DB currently detects and handles write-write conflicts at Snapshot Isolation level automatically. The other conflicting operations will be handled and resolved automatically as part of the upcoming Serializable isolation level support.
Step 4b. Commit the transaction
Once the transaction manager has successfully written all the provisional records, it proceeds to commit the transaction by sending an RPC request to the transaction status tablet. The commit operation will only succeed if the transaction has not yet been aborted due to conflicts. The atomicity and durability of the commit operation itself is guaranteed by the transaction status tablet’s Raft group. Once the commit operation is complete, all provisional records immediately become visible to clients.
The commit timestamp is chosen as the current hybrid time at the transaction status tablet at the moment of appending the transaction committed entry to its Raft log. It is then used as the final MVCC timestamp for regular records that replace the transaction’s provisional records when provisional records are being applied and cleaned up.
Step 5. Send acknowledgement back to client
Once the transaction has been committed, the transaction manager acknowledges back to the client that the transaction was successful.
Step 6. Clean up provisional records
The transaction status tablet sends cleanup requests to each of the tablets that participated in the transaction. This can be done efficiently since the list of ids of participating tablets is stored as a part of the transaction entry.
The clean up request contains a special apply record with the transaction id and commit timestamp. Upon receiving this request, the tablet removes the provisional records belonging to the transaction, and writes permanent records with the correct commit timestamp.
Once all participating tablets have successfully processed these apply records, the status tablet can delete the transaction status record. The deletion of the status record happens by writing a special applied everywhere entry to the Raft log of the status tablet. Raft log entries belonging to this transaction will be cleaned up from the status tablet’s Raft log as part of regular garbage-collection of old Raft logs soon after this point.
Now the data is ready to be served as documented in Transactional Read path.
Achieving High-Performance in Distributed Transactions
There are a number of techniques used in YugaByte DB in order to achieve high performance while retaining distributed ACID transactional guarantees. A few of these are outlined below.
1. Aggressively cache transactions in progress
A given transaction that is in progress may want to look up its own metadata (such as the tablets it has updated). Other transactions may also want to lookup information about conflicting transactions and abort them depending on the relative priority. YugaByte DB aggressively caches information for transactions that are in progress to make the lookups very fast and efficient.
2. Fine-grained locking
YugaByte DB supports fine-grained locking in order to perform conflict resolution. This is essential to handle distributed transactions with performance in a document-oriented database (which YugaByte DB is at its core). Without fine-grained locks, transactions that update non-overlapping attributes in a document may end up contending on one another.
The image below shows the types of fine-grained locks YugaByte DB has implemented.
For example, if a transaction modifies a single column within a row in a table, the following fine-grained locks might be taken:
- A weak lock is taken on the row (document root) to ensure that it cannot be simultaneously deleted by another transaction, but allows transactions that update columns to continue in parallel.
- A strong lock is taken on the column being updated to serialize the updates to that column in order to guarantee consistency.
3. Safe time
Every read request in YugaByte DB is assigned a particular hybrid time (for MVCC) — the read hybrid timestamp. This allows write operations to the same set of keys to happen in parallel with reads, ensuring high performance.
It is crucial, however, that the view of the database as of this read hybrid timestamp is not updated by concurrently happening writes. This is done to ensure that successive reads of the data at a particular timestamp do not see different results if the read operation is retried. In order to achieve this, YugaByte DB uses the concept of hybrid time leader leases. YugaByte DB handles the intricate dependencies between the following operations to ensure correctness:
- a single row update for a key
- a provisional record written for a key before a transaction is finalized
- reading the value of the key from the DB
4. Auto-detect and optimize single-row ACID vs distributed ACID
YugaByte DB has been specifically designed to be able to detect two flows without compromising correctness:
- The single-row ACID transactions are optimized to have low latencies when there is no conflicting operation.
- In case there are conflicting operations, or the query is attempting a distributed/multi-row transactional operation where provisional records need to be written, YugaByte DB automatically switches to preserving correctness even if it means higher latencies.
Thus, YugaByte DB allows batching of write operations with single-row ACID semantics to achieve very high performance on streaming ingest, while at the same time it allows transactional updates to a set of keys with consistency.
Transactions are an essential feature when developing user-facing web/mobile applications. Typically, a large percentage of workloads in such an application requires high performance with single-row ACID guarantees, while a small percentage of the workloads require distributed ACID transactions with extreme data integrity. YugaByte DB is designed to strike a good balance in addressing both scenarios. The end goal is radical simplification of application development through a unified data platform that not only brings NoSQL & SQL together at the API layer but also brings ACID transactions & high-performance together at the core DB layer. Two areas we continue to work on are Jepsen testing for data correctness validation under failure conditions and additional performance benchmarks for workloads with distributed transactions. Stay tuned for results.
If our approach to database design sounds exciting, try out YugaByte DB — it is Apache 2.0 open source! Install it today, and check out how distributed ACID transactions work on your laptop in a few minutes. Also, don’t forget to support our GitHub project with your stars, watches and forks 🤓