The Distributed SQL Blog

Thoughts on distributed databases, open source and cloud native

Achieving Sub-ms Latencies on Large Datasets in Public Clouds

One of our users was interested to learn more about YugaByte DB’s behavior for a random read workload where the data set does not fit in RAM and queries need to read data from disk (i.e. an uncached random read workload).

The intent was to verify if YugaByte DB was designed well to handle this case with the optimal number of IOs to the disk subsystem.

This post is a sneak peak into just one of the aspects of YugaByte DB’s innovative storage engine, DocDB, that supports very high data densities per node, something that helps you keep your server footprint and AWS (or relevant cloud provider) bills low! If you’re interested in the internals of DocDB, you can check out our github repo.

Summary

We loaded about 1.4TB across 4 nodes, and configured the block cache on each node to be only 7.5GB. On four 8-cpu machines, we were able to get about 77K ops/second with average read latencies of 0.88 ms.

More importantly, as you’ll see from the details below, YugaByte DB does the optimal number of disk IOs for this work load.

This is possible because YugaByte’s highly optimized storage engine (DocDB) and its ability to chunk/partition and cache the index & bloom filter blocks effectively.

Read latency during random reads from disk
Read ops/sec during random reads from disk

Details

Setup

4-node cluster in Google Compute Platform (GCP)

Each node is a:

  • n1-standard-8
  • 8 vcpu; Intel® Xeon® CPU @ 2.60GHz
  • RAM: 30GB
  • SSD: 2 x 375GB

Replication Factor = 3
Default Block Size (db_block_size) = 32KB

Load Phase & Data Set Size

  • Number of KVs: 1.6 Billion
  • KV Size: ~300 bytes
    ** Value size: 256 bytes (deliberately chosen to be not very compressible)
    ** Key size: 50 bytes
  • Logical Data Size: 1.6B * 300 = 480GB
  • Raw Data Including Replication: 480GB * 3 = 1.4TB
  • Raw Data per Node: 1.4TB / 4 nodes = 360GB
  • Block Cache Size: 7.5GB

Loading Data

This was run using a sample load tester bundled with YugaByte DB.

Running Random Read Workload

We used 150 concurrent readers; the reads use a random distribution across the 1.6B keys loaded into the system.

Disk Utilization

Sample disk IO on one of the nodes during the “random read” workload is shown below. The disk stats below show that the workload is evenly distributed across the 2 available data SSDs on the system. Each of the four nodes is handling about 16.4K disk read ops/sec (for the 77K user read ops/sec cluster wide).

The average IO size is 230MB/s / 8.2K reads/sec = 29KB. This corresponds to our db_block_size of 32KB (it’s slightly smaller because while keys in this setup are somewhat compressible, the bulk of the data volume is in the value portion, which has deliberately been picked to be not very compressible).

iostat (disk utilization) during random read workload

Bloom Filter Efficiency

The index blocks and bloom filters are cached effectively, and therefore all the misses, about 8.2K per disk (as shown above) are for data blocks. The amount of IO is about 230MB/s on each disk for the 8.2K disk reads ops/sec.

In addition, the bloom filters are highly effective in minimizing the number of IOs (to SSTable files in our Log-Structure-Merge organized storage engine), as shown in the chart below.

Bloom filter’s effectiveness in YugaByte DB

Summary

YugaByte DB performance on large data sets

This post highlights YugaByte DB’s ability to support sub-ms read latencies on large data sets in a public cloud deployment. We will cover several additional aspects of YugaByte DB’s storage engine in subsequent posts — topics ranging from optimizations for fast-data use cases such as “events” / “time organized” data, modeling of complex types such as collections & maps with minimal “read and write-amplification”, and so on.

If you have questions or comments, we would love to hear from you in our community forum.

If you want to try this on your own, you can get started with YugaByte DB in just 5 mins.

Related Posts