The YugaByte Database Blog

Thoughts on open source, cloud native and distributed databases

Data Modeling Basics – PostgreSQL vs. Cassandra vs. MongoDB

Application developers usually spend considerable time evaluating multiple operational databases to find that one database that’s best fit for their workload needs. These needs include simplified data modeling, transactional guarantees, read/write performance, horizontal scaling and fault tolerance. Traditionally, this selection starts out with the SQL vs. NoSQL database categories because each category presents a clear set of trade-offs. High performance in terms of low latency and high throughput is usually treated as a non-compromisable requirement and hence is expected in any database chosen.

This post aims to help application developers understand the choice of SQL vs. NoSQL in the context of the data modeling needs of an application. We use one SQL database, namely PostgreSQL, and 2 NoSQL databases, namely Cassandra and MongoDB, as examples to explain data modeling basics such as creating tables, inserting data, performing scans and deleting data. In a follow-on post, we will cover advanced topics such as indexes, transactions, joins, time-to-live (TTL) directives and JSON-based document data modeling.

How NoSQL Differs from SQL in Data Modeling?

SQL databases increase application agility through ACID transactional guarantees as well as with their the ability to query data using JOINs in unforeseen ways on top of existing normalized relational data models.

Given their monolithic/single-node architecture and the use of a master-slave replication model for redundancy, traditional SQL databases are missing two important capabilities — linear write scalability (i.e. automatic sharding across multiple nodes) and automatic/zero-data loss failover. This means data volumes stored cannot exceed the capacity of a single node (1TB or less) and some temporary data loss should be expected on failover. Zero downtime upgrades are also very difficult to achieve in the SQL database world.

NoSQL DBs are usually distributed in nature where data gets partitioned or sharded across multiple nodes. They mandate denormalization which means inserted data also needs to be copied multiple times to serve the specific queries you have in mind. The overarching goal is to extract high performance by explicitly reducing the number of shards accessed during the read time. Hence the statement that NoSQL requires you to model your queries while SQL requires you to model your data.

NoSQL’s focus on achieving high performance on a distributed cluster is stated as the primary rationale for multiple data modeling compromises that include loss of ACID transactions, JOINs and consistent global secondary indexes.

The general perception is that even though NoSQL databases provide linear write scalability and high fault tolerance, the loss of transactional guarantees make them unfit for mission-critical data.

The following table details how NoSQL data modeling differs from that of SQL.

FeaturesSQLNoSQL
ExamplesPostgreSQLCassandra, MongoDB
Tables or Collections
Primary Key
Partition Key
Foreign Key
Global Secondary Indexes
Integrity Constraints
Single-Key & Multi-Key ACID Transactions
JOINs
Data Auto-Expiry
Data Volume StoredSmall
(No Linear Write Scaling)
Large
(Linear Write Scaling)
AggregationsBuilt-inExternal Frameworks
Data Types StoredLess Flexible Schema with Structured DataMore Flexible Schema with Unstructured Data
Data OrganizationNormalizedDenormalized
Fault ToleranceManual FailoverAutomatic
Use CaseComplex Relational (i.e. Majority Multi-Key) Access Simpler Non-Relational (i.e. Majority Single-Key) Access

SQL & NoSQL: Why You Need Both?

Most real-world applications with engaging user experiences such as Amazon.com, Netflix, Uber and Airbnb are internally powered by a complex mixture of multiple workloads. E.g. an e-commerce application like Amazon.com needs to store low-volume, highly-mission-critical data such as users, products, orders, invoices alongside high-volume, less-mission-critical data such as product reviews, helpdesk messages, user activity, user recommendations. Naturally, these applications rely on at least one SQL database alongside at least one NoSQL database. In multi-region and global deployments, the NoSQL database also acts as a geo-distributed cache for the data stored in the source of truth, the SQL database running in a single region.

How YugaByte DB Brings Together SQL & NoSQL on a Common Database Core?

Built on a unique combination of log-structured merge storage engine, auto-sharding, per-shard distributed consensus replication and distributed ACID transactions (inspired by Google Spanner), YugaByte DB is world’s 1st open source database that is both NoSQL (Cassandra & Redis compatible) and SQL (PostgreSQL compatible) at the same time. As shown in the table below, YCQL, YugaByte DB’s Cassandra compatible API, adds the notion of single-key and multi-key ACID transactions and global secondary indexes to NoSQL APIs thus ushering in the era of Transactional NoSQL. Additionally, YSQL, YugaByte DB’s PostgreSQL compatible API, adds the notions of linear write scaling and automatic fault-tolerance to a SQL API thus bringing forth the world of Distributed SQL. Since YugaByte DB is transactional at the core, even the NoSQL APIs can now be used in the context of mission-critical data.

FeaturesYSQL - PostgreSQL Compatible API (Beta)YCQL - Cassandra Compatible API
Tables
Primary Key
Partition Key
Foreign Key
Global Secondary Indexes
Integrity Constraints
Single-Key & Multi-Key ACID Transactions
JOINs
Data Auto-Expiry
Data Volume StoredLarge
(Linear Write Scaling)
Large
(Linear Write Scaling)
AggregationsBuilt-inExternal Frameworks
Data Types StoredLess Flexible Schema with Structured DataMore Flexible Schema with Unstructured Data
Data OrganizationNormalizedDenormalized
Fault ToleranceAutomaticAutomatic
Use CaseComplex Relational (i.e. Majority Multi-Key) AppsSimpler Non-Relational (i.e. Majority Single-Key) Apps

As previously described in Introducing YSQL: A PostgreSQL Compatible Distributed SQL API for YugaByte DB, the choice of SQL vs. NoSQL in YugaByte DB depends entirely on the characteristics of the majority workload.

  • If the majority workload is multi-key operations with JOINS, then pick YSQL with the understanding that your keys may be distributed across multiple nodes leading to higher latency and/or lower throughput than NoSQL.
  • Otherwise, pick either of the two NoSQL APIs with the understanding that you will get higher performance benefits resulting from queries primarily being served from one node at a time. YugaByte DB can serve as the unified operational database for complex real-world apps that usually have multiple workloads to manage at the same time.

The data modeling lab in the next section is based on YugaByte DB’s PostgreSQL and Cassandra compatible APIs as opposed to the original databases. This approach highlights the simplicity of interacting with two different APIs (on two different ports) of the same database cluster as opposed to using completely independent clusters of two different databases.

In the next sections we’ll walk through a data modeling hands on lab to illustrate many of the differences and a few commonalities between different databases.

Data Modeling Lab

Install Databases

Given the focus on data modeling (and not on complex deployment architectures), we will install the databases in Docker containers on our local machines and then interact with them using their respective command-line shells.

YugaByte DB, a PostgreSQL & Cassandra compatible database

On the YugaByte DB cluster we created, YSQL  and YCQL are available on ports 5433 and 9042 respectively.

MongoDB

Access using Command Line Shell

Next let’s connect to the databases using the command line shells for the respective APIs.

PostgreSQL

psql is a command line shell for interacting with PostgreSQL. For ease of use, YugaByte DB ships with a version of psql in its bin directory.

Cassandra

cqlsh is a command line shell for interacting with Cassandra and its compatible databases through CQL (the Cassandra Query Language). For ease of use, YugaByte DB ships with a version of cqlsh in its bin directory.

Note that CQL is heavily inspired by SQL with similar notion of tables, rows, columns and indexes. However, as a NoSQL language, it adds a specific set of restrictions most of which we will review during our blog series.

MongoDB

mongo is a command line shell for interacting with MongoDB. It can be found in the bin directory of a MongoDB installation.

Create a Table

We can now interact with the database for various operations using the command line shell. Let’s start with creating a table that stores information about songs published by artists. These songs are sometimes part of an album. The other optional attributes of a song are year released, price, genre and critic rating. We need account for additional attributes that we may need in the future through a ‘tags’ field that can store semi-structured data as key-value pairs.

PostgreSQL

Cassandra

Create table in Cassandra is very similar to that of PostgreSQL. One big difference is the lack of integrity constraints (such as NOT NULL) which is the responsibility of the application and not the database in the NoSQL world. The primary key is comprised of the partition key (the Artist column in the example below) and a set of clustering columns (the SongTitle column in the example below). The partition key determines which partition/shard to place the row in and the clustering columns specify how the data inside a given shard should be organized.

MongoDB

MongoDB organizes data in Databases (equivalent to Cassandra Keyspace) that have Collections (equivalent to Tables) that have Documents (equivalent to a Row in a Table). As a “schemaless” database, the definition of the schema ahead of time is not necessary in MongoDB. The “use database” command shown below instantiates a database the very first time it is called along with the change of context to the newly created database. Even collections do not need to be created explicitly but are rather created automatically by simply inserting the first document into a new collection. Note that MongoDB’s default database is test so any collection-level operation done without specifying the database will be done in this default context.

Get Information About a Table

PostgreSQL

Cassandra

MongoDB

Insert Data into a Table

PostgreSQL

Cassandra

The Cassandra INSERT statements look very similar to that of PostgreSQL in general. However, there is one big difference in semantics. INSERT is actually an upsert operation in Cassandra where the row is updated with the latest values in case the row already exists.

MongoDB

Even though MongoDB is also a NoSQL database similar to Cassandra, its insert operation does not have the same semantic behavior as Cassandra. MongoDB insert() has no upsert possibility which makes it similar to PostgreSQL. The default insert behavior with no _id specified will lead to a new document added to the collection.

Query a Table

Arguably the most significant difference between SQL and NoSQL in terms of modeling queries is on the use of the FROM and WHERE clauses. SQL allows FROM clause to include multiple tables and WHERE clause to be of arbitrary complexity (including JOINs across tables). However, NoSQL tends to put a hard restriction on the FROM clause to have only one table specified and the WHERE clause to always have the primary key specified. This is because of NoSQL’s high performance focus we discussed earlier that aims to reduce any cross-table and cross-key interaction. Such interaction may introduce high latency cross-node communication into the query response time and hence is best avoided altogether. E.g. Cassandra requires that queries be restricted by operators (only =, IN, <, >, =>, <= are allowed) on partition keys except when querying a secondary index (where only = operator is allowed).

PostgreSQL

Following are 3 types of queries that can be served easily by a SQL database.

  • Return all of the songs by an artist
  • Return all of the songs by an artist, matching first part of title
  • Return all of the songs by an artist, with a particular word in the title but only if the price is less than 1.00

Cassandra

Of the PostgreSQL queries listed above, only the first one will work with Cassandra unmodified since LIKE operator is not allowed on clustering columns such as SongTitle. Only = and IN operators are allowed in this case.

MongoDB

As shown in the previous examples, the primary method for querying MongoDB is the db.collection.find() method. This method is qualified by the collection name (music in the example below) to be queried very explicitly so querying across collections is explicitly disallowed.

Read All Rows From a Table

Reading all rows is simply a special case of the generic query pattern we observed earlier.

PostgreSQL

Cassandra

MongoDB

Modify Data in a Table

PostgreSQL

PostgreSQL provides the UPDATE statement for modifying data. It does not allow any upsert possibility so the statement will fail if the row does not exist in the database already.

Cassandra

Cassandra also has an UPDATE statement similar to PostgreSQL. UPDATE also same upsert semantics as that of the INSERT statement.

MongoDB

MongoDB’s update() operation can update an existing document entirely or can update only specific fields. By default, it updates only one document with upsert semantics off. Multi-document updates and upsert behavior can be turned on by setting additional flags on the operation. E.g. the example below updates the genre of a specific artist across of the artist’s songs.

Delete Data from a Table

PostgreSQL

Cassandra

MongoDB

MongoDB has two types of operations to handle document deletions — deleteOne()/deleteMany() and remove(). Both delete document(s) but have different return results.

Remove a Table

PostgreSQL

Cassandra

MongoDB

Summary

The SQL vs. NoSQL debate has been raging over a decade now. There are 2 aspects of this debate: the database core architecture (monolithic, transactional SQL vs. distributed, non-transactional NoSQL) and the data modeling approach (model your data in SQL vs. model your queries in NoSQL).

With a distributed, transactional database such as YugaByte DB, the database core architecture part of the debate can be put to rest easily. As data volumes grow beyond what can be written into a single node, a fully distributed architecture that enables linear write scalability with automatic sharding/rebalancing becomes a must-have. Additionally, as described in this post from Google Cloud, transactional, strongly consistent architectures are now widely accepted to deliver higher developer and operations agility than non-transactional, eventually consistent architectures.

Coming to the data modeling debate, it is fair to say that both the SQL and NoSQL data modeling approaches are essential for any complex real-world application. SQL’s model-your-data approach allows the developers to cater to changing business requirements more easily while NoSQL’s model-your-queries approach enables the same developers to manage large data volumes with low latency and high throughput. This is precisely the reason YugaByte DB implements both SQL and NoSQL APIs on the common core instead of promoting that one approach is strictly better than the other. Additionally, by ensuring wire compatibility with popular database languages including PostgreSQL and Cassandra, YugaByte DB ensures that developers do not have learn another language in order to benefit from the distributed strongly-consistent database core.

This post helped us understand how data modeling basics differ between PostgreSQL, Cassandra and MongoDB. In the next posts in the series, we will dive into advanced data modeling concepts such as indexes, transactions, JOINs, TTL directives and JSON documents.

What’s Next?

Sid Choudhury

VP, Product