The YugaByte Database Blog

Thoughts on open source, cloud native and distributed databases

Introducing YSQL: A PostgreSQL Compatible Distributed SQL API for YugaByte DB

YugaByte’s mission from day one has been to simplify operational database infrastructure. We are doing so by bringing together the best aspects of SQL and NoSQL into a single transactional, high-performance database. I am pleased to announce a key milestone in our mission with the formal introduction of YSQL, YugaByte DB’s PostgreSQL-compatible distributed SQL API, as part of the recent 1.1 release. YSQL (currently in beta) becomes the third member of our multi-API family that previously included two NoSQL APIs, specifically the Cassandra-compatible YCQL and the Redis-compatible YEDIS.

 

YugaByte DB Architecture

In this post we’ll highlight the emerging application and infrastructure needs that motivated us to build YSQL. We also discuss the use cases where YSQL co-exists with our NoSQL APIs.

Why Distributed SQL?

YSQL is a distributed SQL implementation on top of the YugaByte DB core engine that offers PostgreSQL functionality, but runs across multiple nodes.

This architecture has a number of benefits such as enabling developer agility, feature velocity and operational simplicity. Let’s review some of the key benefits of such an architecture.

Scale out on-demand

There is often a need to add resources to a database due to an increase in the number of writes/reads per second (aka throughput) or growth in dataset sizes. This is very easy to do in YugaByte DB by simply adding more nodes to an existing cluster.

For traditional SQL deployments, this is not possible. One of the following has to be done:

  • Scale Up – Single Node SQL Deployments: The only option here is to scale up, meaning increase the node size (with respect to CPU, memory and storage). This option can quickly get expensive, and also hits a max size after which scaling is not possible and needs a full re-architecture of the app.
  • Scale Out – Master-Slave SQL Deployments: A mitigation is to add slaves to serve reads to increase read scalability. It is not possible to scale out the writes in this setup (since only the SQL master handles all writes). Additionally, reads from the slaves are stale data. Once again, if the service outgrows the resource requirements of the master in terms of CPU or storage, it will result in a full re-architecture of the app.
  • Sharded SQL Deployments: The app re-architecture mentioned in the previous points often result in a sharded SQL deployment, where the app stores different shards of the data on different SQL nodes, each with their own slave replica for high availability. This is a very complex and tedious undertaking, and forsakes useful SQL features such as global secondary indexes, transactions and cross-shard joins. In this setup, re-sharding, when demand increases, tends to be even more complex.

Hence, all of these scenarios are very limiting for fast growing online services that require developer agility and feature velocity. A lot of these limitations are remedied by distributed SQL, along with bringing other desirable features into the deployment. Next, let’s look at some of the advantages.

Resilient and self-healing

YugaByte DB has been built ground-up to tolerate node/disk/region failures without any data loss or need for manual intervention while offering the full SQL functionality. Since YugaByte DB uses active-active replication, the failover of queries from the failed node to a replica happens in a matter of seconds, thereby ensuring high availability. And since the replication uses Raft consensus, there is no data loss even on a node failure or a network partition.

Let us contrast this with traditional SQL database deployments. First, they need to be setup in a master-slave replication configuration. Second, this requires constant operational monitoring to make sure the slaves are not falling behind the master too much. Upon the failure of the master, there is a need to decide if the loss of some recently written data to the slave is acceptable. Additionally, promoting the slave to a new master, and rebuilding a new slave is a manual process.

High performance at massive scale

The YugaByte DB core today supports massive write scalability and low-latency reads in order to support NoSQL workloads live at multiple production deployments. This allows it to handle millions of read/write requests and many TBs of data per node. The core has been stress tested heavily and is being used by many customers for serving their business-critical workloads. It is this same core that powers YSQL, thus bringing the performance and scale to SQL workloads as well.

Multi-zone and geographically replicated deployments

With the rapid adoption of the public clouds, there is a strong desire to have data replicated across availability zones in order to survive a zone failure. YugaByte DB offers a strongly consistent multi-zone deployment that can support apps requiring SQL. Additionally, YugaByte DB is designed to run well in a multi-region setup where the nodes in a cluster are geographically dispersed. This allows globally consistent writes and potentially low-latency access from different geographies. Such “planet-scale” architectures are increasingly becoming common given the need to move data closer to the end user for low latency access, or compliance requirements such as GDPR.

Cloud native and Kubernetes ready

Cloud native deployments are becoming the norm. Many deployments today are multi-cloud and hybrid cloud. Enterprises are increasingly leveraging Kubernetes to run even stateful workloads such as operational databases. YugaByte DB works natively in Kubernetes out of the box. For a scale out database to work well in Kubernetes, there are a number of features needed such as rapid, efficient scalability when the number of pods is increased or decreased. Another requirement is working with node names (or public IPs) and pod IPs (or private IPs) depending on the deployment.

Why PostgreSQL Compatibility?

As firm believers in open source and open APIs, building a proprietary SQL API was not even an option for us. However, which existing SQL flavor to build compatibility with became a non-trivial question given the abundance of SQL flavors in the market. In the end, we selected PostgreSQL as our SQL of choice. We were truly impressed by the openness, maturity and feature completeness of PostgreSQL. These attributes have created a highly passionate and thriving community of users. Industry analysts are also observing these same trends that have led to a big resurgence of PostgreSQL especially over the last 5 years. Matt Asay’s recent post “Why Oracle’s missteps have led to PostgreSQL’s ‘moment’ in the database market” analyzes the trends in a fair amount of detail.

We are excited at the opportunity to serve the PostgreSQL community with YSQL.

Co-existence with Transactional NoSQL

SQL has been the lingua franca of the database world for almost 4 decades. The ability to store data in a normalized schema and then ask an ever-changing set of questions without changing the schema is immensely powerful. In the mid-2000s when workload sizes started growing beyond what could be stored in a single SQL node, database designers invented the NoSQL paradigm. NoSQL runs on scale-out architectures with denormalized schema while delivering higher performance and higher availability than traditional monolithic SQL. However, this design came with compromises. Strong consistency and ACID transactions were no longer allowed. Additionally, NoSQL languages focused on single-key access patterns and did not have SQL-like JOINs that allows multiple keys to be involved in a query.

As we have previously highlighted in “A Primer on ACID Transactions,” the loss of strong consistency and ACID transactions was a design error in first generation NoSQL databases such as Apache Cassandra, Amazon DynamoDB and MongoDB. In the second generation of NoSQL, databases such as YugaByte DB are correcting this error by bringing in the full spectrum of ACID transactions into NoSQL APIs. This means that with YugaByte DB, the debate of SQL vs. NoSQL is simply reduced to the characteristics of your majority workload:

  • If the majority of the workload is multi-key operations with JOINs, then choose SQL while remaining cognizant that your keys may be distributed across multiple nodes leading to higher latency and/or lower throughput than NoSQL.
  • If the majority of the workload is single-key operations without JOINs, then choose NoSQL and get the higher performance benefits resulting from queries primarily being served from one node at a time.

In the following section we’ll highlight a few specific use cases where a transactional NoSQL API can perform better than SQL.

Event processing

There are a number of microservices or services that need to deal with event data that is ever growing in nature. There are a number of examples of such data:

  • Location tracking use-cases that track shipment locations or location of a driver
  • User recommendation apps that track the most popular products in different categories
  • Stock market ticker data that model the performance of a portfolio over time
  • IoT applications to monitor the health of components in a manufacturing pipeline

For each of the above applications, there is often the need for a write-optimized NoSQL store for the rapidly growing sets of data, while the SQL functionality is essential for the other microservices that deal with billing and reporting. Using multiple databases makes it hard to handle scaling, failures, data coherence and other Day 2 operations such as backups and restores. YugaByte DB is a single database that can serve both theses NoSQL and SQL needs.

Real-time analytics – ML and AI workloads

Many apps now internally use machine learning and artificial intelligence to enrich recommendations, alert end users, personalize experience for end users, and so on. There is a need to run these analytics in real time and report the results to an end user. Such use cases often require NoSQL databases in order to handle the scale and analyze the data with tools such as Apache Spark, Presto or Kafka KSQL. Examples of such use cases include:

  • Fraud detection in a variety of online apps involving users and payments
  • Personalization of a user’s online experience based on their preferences and shopping history
  • A number of security related use-cases such as network intrusion detection

YugaByte DB’s Cassandra-compatible YCQL API is natively integrated with analytics frameworks such as Apache Spark and Presto to address such use cases.

Accommodating alternate access patterns

There is often a need for a number of alternate access patterns and features that are optimized for NoSQL APIs.

Modeling dynamic data structures such as sets. While it is possible to model sets in SQL, it is inefficient to get the cardinality of the set (especially for large sets). The Redis-compatible YEDIS API in YugaByte DB makes this very efficient. Example use-cases of sets and sorted sets include:

  • Tracking the number of people that liked a post in a social network
  • Modeling a gaming leader board, where we want to display the top scoring players along with the total number of players

Automatic data expiry using time-to-live (TTL) feature. There are use-cases where a data record needs to be deleted automatically in a certain time period after insertion. Examples include:

  • Deleting the one minute stock ticker values older than a month, since the end user may not care about it in certain applications
  • Deleting the raw data collected from an IoT device after a week of it getting inserted into the database

Unifying SQL and NoSQL – Database Consolidation

Piecing together many different databases (SQL, NoSQL and distributed caches) is a major contributor to slow app development, and leads to operational complexity. Thus, a growing enterprise needs a distributed multi-model database that can be adopted as the primary operational data platform. But app developers quickly run into the need for both SQL and NoSQL. The need for SQL is often very clear, given the unparalleled flexibility it brings, and the deep understanding of SQL that most developers possess. But over time, NoSQL features become important as well.

YugaByte DB is a multi-model, multi-API database with a common distributed storage engine to power both SQL and NoSQL APIs. The biggest benefit of this convergence is that enterprises can now prevent database sprawl by consolidating their operational data tier in a single database. We have an exciting roadmap in store for each of our APIs including the general availability of YSQL next year. Databases won’t look the same thereafter .. we promise!

What’s Next?

Karthik Ranganathan

Founder & CTO