YugaByte DB Architecture: Diverse Workloads with Operational Simplicity
YugaByte DB is a transactional, high performance, geo-distributed operational database that converges multiple NoSQL and SQL interfaces into an unified solution. The v0.9 public beta of YugaByte DB is compatible with Apache Cassandra (CQL) and Redis APIs, with PostgreSQL under development. A fundamental design goal for YugaByte DB has been to provide the same transactional, performance and operational simplicity guarantees irrespective of the API used. One might expect that these guarantees would come at the expense of higher underlying implementation complexity in the database. However, the modular architecture of YugaByte DB allows to limit most of the differences between client APIs to only the query layer (“YQL”), while the underlying data replication and storage layer (“YBase”) is largely agnostic of the data model. This post highlights the architectural considerations that went into the design of the YBase and YQL layers.
A YugaByte DB cluster is known as a universe. It’s is installed on a collection of nodes that may be located in one or more regions/AZs or even multiple clouds. All nodes run YB Tablet Server processes, which are responsible for the majority of data storage, replication, and querying tasks. Data is organized into tables (multiple Cassandra tables and one common table for all Redis data), and every table is automatically split into shards we call “tablets”. Each tablet is automatically assigned to multiple (usually three) tablet servers, and these nodes form a Raft group associated with that tablet, elect a leader, and replicate updates to the tablet’s data.
In addition, a small number of nodes run YB Master processes. YB Masters are responsible for storing metadata and coordinating universe-wide actions, such as automatic load balancing, table creation, or schema changes. Masters also comprise a Raft replication group, store a special “master tablet” with system metadata, and elect a leader. The master tablet contains a system table that knows about objects such as tables, tablets, tablet servers, and CQL keyspaces. The master leader runs various background operations and respond to queries about metadata, while master followers stand by ready to take over in case the leader is unavailable.
To see this architecture in action, let’s look at how a write operation travels through YugaByte DB after it hits the Redis or Cassandra API endpoint of a YB node.
- 1. Query language processing. The YQL engine interprets the statement, identifies the key involved, and finds the set of YB tablet servers responsible for storing that key. It might need to send a request to the master leader if it does not have that information cached.
- 2. Forwarding to the tablet leader. The request gets forwarded to the leader of the tablet containing the key. In many cases, this forwarding will be purely local, because both CQL and Redis Cluster clients are capable of sending requests to the right server and avoiding an additional network hop.
- 3. Preparing the batch for replication. The tablet leader performs any reads necessary for read-modify-write transactions, converts the logical write operation into a batch of underlying key/value records, and appends an update record to the tablet’s Raft log. Some more logic specific to the data model (CQL or Redis) is invoked in order to prepare this key/value batch.
- 4. Raft replication. The tablet leader writes the update to its log, sends it to its followers (that also write the update to their logs), and waits for acknowledgments from the majority of them. For example, with replication factor of three, only one node other than the leader has to acknowledge the write for it to be considered committed.
- 5. Applying the update to the storage engine on the leader. The update is then applied to the local storage engine that we call DocDB, and a successful response is sent to the client.
- 6. Responding to the client. The tablet leader sends a reply to the YQL engine, which then replies to the client.
- 7. Applying the update to the storage engine on followers. The tablet’s Raft group leader sends a heartbeat message to the followers to tell them that the record is committed. The followers also apply the update to their local DocDB engines.
As we can see, only two of the above steps are specific to Cassandra or Redis data models. In addition to the handling of reads and writes coming from the client, there are also various background operations running in a YugaByte DB universe. Here are a couple of examples:
- Flushes and compactions. Each tablet server periodically writes (flushes) accumulated in-memory updates to sorted files known as SSTables. These files are later combined into larger files via a process called “compaction” typical of LSM (Log-Structured Merge Tree) storage engines. This is needed to reduce the number of files that need to be looked up on every read operation. As every tablet server hosts replicas of multiple tablets, it is important to avoid “compaction storms”, and compactions are throttled in a way that limits the total number of compactions in progress at any given time.
- Load balancing. The YB master leader is responsible for balancing the load across the cluster. When a new node joins the cluster, the master gradually assigns more and more work to that node by selecting some tablets and copying those tablets’ data to the new node. Once the initial copying process is complete, the new node can join Raft quorums of its assigned tablets, and start getting Raft updates. At the same time, the master chooses some nodes that have extra copies of these tablets and tells them to leave the respective Raft quorums and stop hosting those tablets, thus balancing the load across the cluster in terms of CPU, memory, and disk space simultaneously. The load balancer also ensures that each tablet server has approximately the same number of tablet leaders, as leaders handle all writes and strongly consistent reads and require more resources compared to followers.
Common Storage Engine
The background operations described above are done the same way for Cassandra and Redis data. However, even within the same API / data model, there could be multiple access patterns / workloads in a client application. Let’s consider these two:
- A random-access key-value workload. Keys and values are being written and then read multiple times. This could be an authentication and user profile service of a web or mobile application.
- A time-series workload. New data keeps getting written, and recent data is accessed much more frequently than old data. There is typically a timestamp column in the user table’s schema, with frequent range queries on this timestamp. In this workload the data could be ever-growing, or it could be written with a TTL (time-to-live) parameter specifying the retention period.
The above workloads require different styles of background compactions. We are using an approach where files of approximately the same size are compacted together, similarly to Apache HBase’s default compaction algorithm. This helps avoid write amplification that would result from always compacting a little bit of new data and a large existing file. However, in the time series use case with TTL turned on, it makes sense to not compact files larger than a certain size at all, because once all keys in a given file expire as prescribed by their TTL settings, the entire file can be deleted. Also, in the time series workload, range queries on the time series column are very common. YugaByte DB optimizes these queries by maintaining per-file metadata specifying the range of Cassandra “clustering columns” (columns supporting range queries) in each file.
Tunable Read Consistency
Another dimension of workload variation is the consistency level. All writes to YugaByte DB happen with strong consistency, which means they have to be committed to Raft before being acknowledged to the client. However, a read request can choose between the following consistency levels:
- Strong consistency (latest, most up-to-date data)
- Bounded staleness (return a value that is no older than X milliseconds)
- Session (a value that is at least as fresh as the last value seen by the client)
- Consistent prefix (a potentially stale value taken from a consistent view of the database at some moment in the past. In other words, if A is updated and then B is updated, and the read request sees the new value of B, it must also see the new value of A).
Lower consistency levels can be achieved at lower latency by reading from the replica closest to the client. For example, a financial application might be writing stock quotes from a Tokyo datacenter to a YugaByte DB universe replicated across datacenters in Japan, Europe, and East Coast of the US, and a client app in New York might read quotes from the local replica with low latency. The client can specify the consistency level using the standard Cassandra client interface, or through a YugaByte DB extension to the Redis protocol and client driver.
With the ability to support multiple data models, access patterns, and consistency levels, YugaByte DB can replace multiple systems (for example, sharded Redis + sharded SQL) in an application stack. This significantly reduces the burden on the operations team, as dealing with a unified data platform is easier than handling high availability, replication, and failover for multiple data stores. Here are some aspects of YugaByte DB’s operational simplicity:
- Auto sharding. YugaByte DB automatically shards every table into tablets by a part of the primary key called a “partition key”. This eliminates the need for manual sharding, which often makes it very difficult to run apps built on top of sharded SQL or Redis in production.
- Auto load balancing. When a new YB tablet server is added, a fraction of data is automatically moved there, equalizing the load across the cluster without downtime. See our video for a detailed example. Achieving this in manually sharded environments frequently requires undivided attention of the operations team for many days or weeks. Also, because of YugaByte DB’s use of the Raft protocol, copying data to a new node is very efficient compared to eventually consistent systems like Cassandra. In those systems, a logical scan of multiple replicas has to be performed in order to recover the latest state of one shard.
- Zero downtime infrastructure portability. It is very easy to fully replace all nodes in a YugaByte DB universe with nodes of a different configuration (e.g. a different cloud instance type), while keeping the system online at all times. As with automatic load balancing, nodes can be added one at a time, with some tablet replicas assigned to them, and old nodes decommissioned in a similar way. This orchestration is performed automatically in YugaWare, the management console that ships with the Enterprise Edition of YugaByte DB. This allows to simplify upfront capacity planning and easily adapt the infrastructure footprint to the application workload.
In this blog post, we have reviewed the architecture of YugaByte DB and saw how it is applicable to a wide range of use cases and workloads that vary across multiple dimensions. For a more in-depth overview of the architecture, see YugaByte DB documentation. If you are working on an application utilizing Apache Cassandra or Redis APIs, please take a look at YugaByte DB Quick Start tutorial. Once you’ve tried to set up a local cluster and send a few queries, feel free to post your feedback and suggestions on YugaByte Community Forum. Also, YugaByte DB Community Edition is open source under Apache 2.0 license and is available on GitHub. We look forward to hearing from you!