Distributed SQL Tips and Tricks – May 17th, 2021

May 17, 2021

Welcome back to our bi-weekly tips and tricks blog where we recap some distributed SQL questions from around the Internet. We’ll also review upcoming events, new documentation, and blogs that have been published recently. Got questions? Make sure to ask them on our YugabyteDB Slack channel, Forum, GitHub, or Stack Overflow. Let’s dive in:

Setting Firewall Rules for Cloud Deployments

When deploying a YugabyteDB cluster on a cloud, users occasionally miss a vital step: making sure the nodes in the cluster can communicate. In order for the database nodes to communicate with each other there must be firewall rules to allow for internode RPC communication. In fact, there are default ports you need to be aware of required for client APIs, the admin web servers, as well as Prometheus monitoring. At a minimum, however, the cluster will require internode RPC traffic to be allowed. You can set up your firewall rules using your cloud providers firewall configuration documentation. Here is an example of what that looks like with GCP. Additionally you can edit these rules by going to your GKE console and selecting the Clusters tab. From here you would select your cluster and scroll down to the Networking section and select your Network. Navigate to the Firewall Rules tab and click on the rule you need to change. At the top there should be an Edit option where you can add or remove any ports needed from the default ports above.

Does YugabyteDB Require Vacuuming like PostgreSQL?

YugabyteDB does not require vacuuming. YugabyteDB uses DocDB as its storage engine, which stores data differently than Postgres does. The Postgres/YSQL layer in YugabyteDB uses Postgres as an API, and changes postgres format data on the fly to DocDB format when it gets stored, and changes it back when it is read, and therefore at the Postgres/YSQL layer always uses a clean, non bloated version of the data. Therefore YugabyteDB does not require vacuuming, and does not impose the PostgreSQL problem of bloat, and the risk of running out of transaction IDs when not vacuuming.

YugabyteDB performs a seemingly equivalent process of compaction done automatically in the background, which is inherent to the LMS tree based storage. There are no operational risks for using YugabyteDB and the timing of the compaction, but obviously it is extra work done.

In order to further provide the user with flexibility YugabyteDB allows you to set flags for tuning how often these compactions run. Overall we have multiple workloads where YugabyteDB is deployed for large datasets, multiple TBs, and the issue with compactions has not surfaced. This is because our approach to collecting data is different. This does not affect foreground ops/sec or p99 latencies. Our founders had done a lot of work ensuring this at Facebook with HBase, and a lot of that is implemented in YugabyteDB as well. We continue to make improvements in order to constantly improve efficiency, such as in Github Issue #7937 and #7047 where we committed a few improvements for indexes.

Versioning Differences Between Stable and Latest Releases

Users occasionally wonder whether they need to upgrade to every new YugabyteDB release as it becomes available. To better understand this you must understand our versioning system. Yugabyte adopted a new versioning system this past year, one that we hope makes it much easier to ensure Yugabyte continues to bring new features to the market at a rapid pace, while also communicating to our users about stable releases, our maintenance policy, and when it is time to upgrade. As a part of this change we broke out the stable and latest releases. Stable releases are suitable for production level deployment. These are denoted by MAJOR.EVEN versioning (2.4, 2.6, …). Latest releases are development releases that introduce new features and changes that are suitable for testing and proof of concept. They are denoted by MAJOR.ODD versioning (2.5, 2.7, …). This is why your Yugabyte Cloud instance may be running one version while the version you just downloaded is another.

YSQL vs YCQL

A significant benefit of YugabyteDB over other distributed SQL databases is its multi-API approach. With YugabyteDB not only can you support an application that requires a fully-relational SQL API with YSQL, but you can also support workloads that require a semi-relational SQL API with YCQL. If you start playing around with either you will see that they very much have the look and feel of two open-source database technologies in PostgreSQL and Cassandra, respectively. This is because we re-used the PostgreSQL query layer for YSQL, and YCQL itself has its roots in the Cassandra Query Language. In order to make it simple for our users to get started with YugabyteDB we felt it was important to make the learning curve as easy as possible. What is easier for developers than to keep the same syntax available in YugabyteDB? However, a question that arises is when do we use one over the other. At a high level YSQL is best fit for RDBMS workloads that need horizontal write scalability, and global data distribution, while also using relational modeling features such as JOINS and foreign keys. YCQL, however, should be used when you do not require fully-relational data modeling constructs or need TTL-driven automatic data expiration. Note that strongly-consistent secondary indexes and unique constraints are supported by YCQL. Uniquely, both API query layers run on YugabyteDBs distributed document store DocDB. As such, the way transactions, sharding, and replication are handled by the storage layer remains consistent. Please be aware that today these APIs are isolated and independent of one another. Meaning that data inserted and managed by one API cannot be queried by the other API. If you are looking to access the data across APIs external frameworks such as Presto can help for simple cases. The main point to note is that you need to select an API first before undertaking detailed database schema/query design and implementation. If you are unsure which API to use, share your use case with us in our Community Slack or the Yugabyte Forum and the community can help you decide the best approach.

Usage of SparkSQL Queries Against YSQL and YCQL Tables

One of YugabyteDB’s numerous Ecosystem integrations is Apache Spark, which you can integrate with both the YCQL and YSQL APIs. We have our connector for Apache Spark for YCQL, while for YSQL you can use the PostgreSQL/JDBC connector. If you are looking to play around and learn more, check out this real-world example tutorial that is a blueprint for an IoT application built on top of YugabyteDB (using the Cassandra-compatible YCQL API) as the database, Confluent Kafka as the message broker, KSQL or Apache Spark Streaming for real-time analytics and Spring Boot as the application framework.

New Documentation, Blogs, Tutorials, and Videos

New Blogs

New Videos

Upcoming Events

May 20, 2021 – Catalysts of Change: Retail Fireside Chat May 20, 2021
May 26, 2021 – What’s New in YugabyteDB 2.7
May 27, 2021 – YugabyteDB DBA Fundamentals Training and Certification
June 24, 2021 – YugabyteDB YCQL Fundamentals Community Training and Certification
June 30, 2021 – High Performance Design & Architecture in a Distributed SQL Database

New and Updated Docs

We’re continually adding to and updating the documentation to give you the information you need to make the most out of YugabyteDB. We had so many new and updated docs for you, that we wrote a blog post this month to cover recent content added, and changes made to the YugabyteDB documentation.