Data Modeling Basics – PostgreSQL vs. Cassandra vs. MongoDB

SVP, Product

November 21, 2018

Application developers spend a considerable amount of time evaluating multiple operational databases to find the one that best fits their workloads’ needs. These needs include simplified data modeling, transactional guarantees, read/write performance, horizontal scaling, and fault tolerance. Traditionally, this selection process starts off by exploring the SQL vs. NoSQL database categories because each category presents a clear set of trade-offs. High performance in terms of low latency and high throughput is usually treated as a mandatory requirement and hence is expected in any database chosen.

This post aims to help application developers understand the choice of SQL vs. NoSQL in the context of the data modeling needs of an application. We use one SQL database, namely PostgreSQL, and 2 NoSQL databases, namely Cassandra and MongoDB, as examples to explain data modeling basics such as creating tables, inserting data, performing scans and deleting data. In a follow-on post, we will cover advanced topics such as indexes, transactions, joins, time-to-live (TTL) directives and JSON-based document data modeling.

How NoSQL Differs from SQL in Data Modeling?

SQL databases increase application agility through ACID transactional guarantees as well as with their the ability to query data using JOINs in unforeseen ways on top of existing normalized relational data models.

Given their monolithic/single-node architecture and the use of a master-slave replication model for redundancy, traditional SQL databases are missing two important capabilities — linear write scalability (i.e. automatic sharding across multiple nodes) and automatic/zero-data loss failover. This means data volumes ingested cannot exceed the max write throughput of a single node. Additionally, some temporary data loss should be expected on failover (on shared nothing storage architectures) given that the recent commits would not have showed up at the slave replica yet. Zero downtime upgrades are also very difficult to achieve in the SQL database world.

NoSQL DBs are usually distributed in nature where data gets partitioned or sharded across multiple nodes. They mandate denormalization which means inserted data also needs to be copied multiple times to serve the specific queries you have in mind. The overarching goal is to extract high performance by explicitly reducing the number of shards accessed during the read time. Hence the statement that NoSQL requires you to model your queries while SQL requires you to model your data.

NoSQL’s focus on achieving high performance on a distributed cluster is stated as the primary rationale for multiple data modeling compromises that include loss of ACID transactions, JOINs and consistent global secondary indexes.

The general perception is that even though NoSQL databases provide linear write scalability and high fault tolerance, the loss of transactional guarantees make them unfit for mission-critical data.

The following table details how NoSQL data modeling differs from that of SQL.

Features	SQL	NoSQL
Examples	PostgreSQL	Cassandra, MongoDB
Tables or Collections
Primary Key
Partition Key
Foreign Key
Global Secondary Indexes
Integrity Constraints
Single-Key & Multi-Key ACID Transactions
JOINs
Data Auto-Expiry
Data Volume Stored	Small (No Linear Write Scaling)	Large (Linear Write Scaling)
Aggregations	Built-in	External Frameworks
Data Types Stored	Less Flexible Schema with Structured Data	More Flexible Schema with Unstructured Data
Data Organization	Normalized	Denormalized
Fault Tolerance	Manual Failover	Automatic
Use Case	Complex Relational (i.e. Majority Multi-Key) Access	Simpler Non-Relational (i.e. Majority Single-Key) Access

SQL & NoSQL: Why You Need Both?

Most real-world applications with engaging user experiences such as Amazon.com, Netflix, Uber and Airbnb are internally powered by a complex mixture of multiple workloads. E.g. an e-commerce application like Amazon.com needs to store low-volume, highly-mission-critical data such as users, products, orders, invoices alongside high-volume, less-mission-critical data such as product reviews, helpdesk messages, user activity, user recommendations. Naturally, these applications rely on at least one SQL database alongside at least one NoSQL database. In multi-region and global deployments, the NoSQL database also acts as a geo-distributed cache for the data stored in the source of truth, the SQL database running in a single region.

How YugabyteDB Brings Together SQL & NoSQL on a Common Database Core?

Built on a unique combination of log-structured merge storage engine, auto-sharding, per-shard distributed consensus replication and distributed ACID transactions (inspired by Google Spanner), YugabyteDB is world’s 1st open source database that is both NoSQL (Cassandra compatible) and SQL (PostgreSQL compatible) at the same time. As shown in the table below, YCQL, YugabyteDB’s Cassandra compatible API, adds the notion of single-key and multi-key ACID transactions and global secondary indexes to NoSQL APIs thus ushering in the era of Transactional NoSQL. Additionally, YSQL, YugabyteDB’s PostgreSQL compatible API, adds the notions of linear write scaling and automatic fault-tolerance to a SQL API thus bringing forth the world of Distributed SQL. Since YugabyteDB is transactional at the core, even the NoSQL APIs can now be used in the context of mission-critical data.

Features	YSQL – PostgreSQL Compatible API (Beta)	YCQL – Cassandra Compatible API
Tables
Primary Key
Partition Key
Foreign Key
Global Secondary Indexes
Integrity Constraints
Single-Key & Multi-Key ACID Transactions
JOINs
Data Auto-Expiry
Data Volume Stored	Large (Linear Write Scaling)	Large (Linear Write Scaling)
Aggregations	Built-in	External Frameworks
Data Types Stored	Less Flexible Schema with Structured Data	More Flexible Schema with Unstructured Data
Data Organization	Normalized	Denormalized
Fault Tolerance	Automatic	Automatic
Use Case	Complex Relational (i.e. Majority Multi-Key) Apps	Simpler Non-Relational (i.e. Majority Single-Key) Apps

As previously described in Introducing YSQL: A PostgreSQL Compatible Distributed SQL API for YugabyteDB, the choice of SQL vs. NoSQL in YugabyteDB depends entirely on the characteristics of the majority workload.

If the majority workload is multi-key operations with JOINS, then pick YSQL with the understanding that your keys may be distributed across multiple nodes leading to higher latency and/or lower throughput than NoSQL.
Otherwise, pick YCQL with the understanding that you will get higher performance benefits resulting from queries primarily being served from one node at a time. YugabyteDB can serve as the unified operational database for complex real-world apps that usually have multiple workloads to manage at the same time.

The data modeling lab in the next section is based on YugabyteDB’s PostgreSQL and Cassandra compatible APIs as opposed to the original databases. This approach highlights the simplicity of interacting with two different APIs (on two different ports) of the same database cluster as opposed to using completely independent clusters of two different databases.

In the next sections we’ll walk through a data modeling hands on lab to illustrate many of the differences and a few commonalities between different databases.

Data Modeling Lab

Install Databases

Given the focus on data modeling (and not on complex deployment architectures), we will install the databases in Docker containers on our local machines and then interact with them using their respective command-line shells.

YugabyteDB, a PostgreSQL & Cassandra compatible database

mkdir ~/yugabyte && cd ~/yugabyte
wget https://downloads.yugabyte.com/yb-docker-ctl && chmod +x yb-docker-ctl
docker pull yugabytedb/yugabyte
./yb-docker-ctl create --enable_postgres

On the YugabyteDB cluster we created, YSQL and YCQL are available on ports 5433 and 9042 respectively.

MongoDB

docker run --name my-mongo -d mongo:latest

Access using Command Line Shell

Next let’s connect to the databases using the command line shells for the respective APIs.

PostgreSQL

psql is a command line shell for interacting with PostgreSQL. For ease of use, YugabyteDB ships with a version of psql in its bin directory.

docker exec -it yb-postgres-n1 /home/yugabyte/postgres/bin/psql -p 5433 -U postgres

Cassandra

cqlsh is a command line shell for interacting with Cassandra and its compatible databases through CQL (the Cassandra Query Language). For ease of use, YugabyteDB ships with a version of cqlsh in its bin directory.

Note that CQL is heavily inspired by SQL with similar notion of tables, rows, columns and indexes. However, as a NoSQL language, it adds a specific set of restrictions most of which we will review during our blog series.

docker exec -it yb-tserver-n1 /home/yugabyte/bin/cqlsh

MongoDB

mongo is a command line shell for interacting with MongoDB. It can be found in the bin directory of a MongoDB installation.

docker exec -it my-mongo bash 
cd bin
mongo

Create a Table

We can now interact with the database for various operations using the command line shell. Let’s start with creating a table that stores information about songs published by artists. These songs are sometimes part of an album. The other optional attributes of a song are year released, price, genre and critic rating. We need account for additional attributes that we may need in the future through a ‘tags’ field that can store semi-structured data as key-value pairs.

PostgreSQL

CREATE TABLE Music (
    Artist VARCHAR(20) NOT NULL, 
    SongTitle VARCHAR(30) NOT NULL,
    AlbumTitle VARCHAR(25),
    Year INT,
    Price FLOAT,
    Genre VARCHAR(10),
    CriticRating FLOAT,
    Tags TEXT,
    PRIMARY KEY(Artist, SongTitle)
);

Cassandra

Create table in Cassandra is very similar to that of PostgreSQL. One big difference is the lack of integrity constraints (such as NOT NULL) which is the responsibility of the application and not the database in the NoSQL world. The primary key is comprised of the partition key (the Artist column in the example below) and a set of clustering columns (the SongTitle column in the example below). The partition key determines which partition/shard to place the row in and the clustering columns specify how the data inside a given shard should be organized.

CREATE KEYSPACE myapp;

USE myapp;

CREATE TABLE Music (
    Artist TEXT, 
    SongTitle TEXT,
    AlbumTitle TEXT,
    Year INT,
    Price FLOAT,
    Genre TEXT,
    CriticRating FLOAT,
    Tags TEXT,
    PRIMARY KEY(Artist, SongTitle)
);

MongoDB

MongoDB organizes data in Databases (equivalent to Cassandra Keyspace) that have Collections (equivalent to Tables) that have Documents (equivalent to a Row in a Table). As a “schemaless” database, the definition of the schema ahead of time is not necessary in MongoDB. The “use database” command shown below instantiates a database the very first time it is called along with the change of context to the newly created database. Even collections do not need to be created explicitly but are rather created automatically by simply inserting the first document into a new collection. Note that MongoDB’s default database is test so any collection-level operation done without specifying the database will be done in this default context.

use myNewDatabase;

Get Information About a Table

PostgreSQL

\d Music

 Table "public.music"
    Column    |         Type          | Collation | Nullable | Default 
--------------+-----------------------+-----------+----------+--------
 artist       | character varying(20) |           | not null | 
 songtitle    | character varying(30) |           | not null | 
 albumtitle   | character varying(25) |           |          | 
 year         | integer               |           |          | 
 price        | double precision      |           |          | 
 genre        | character varying(10) |           |          | 
 criticrating | double precision      |           |          | 
 tags         | text                  |           |          | 
Indexes:
    "music_pkey" PRIMARY KEY, btree (artist, songtitle)

Cassandra

DESCRIBE TABLE MUSIC;

CREATE TABLE myapp.music (
    artist text,
    songtitle text,
    albumtitle text,
    year int,
    price float,
    genre text,
    tags text,
    PRIMARY KEY (artist, songtitle)
) WITH CLUSTERING ORDER BY (songtitle ASC)
    AND default_time_to_live = 0
    AND transactions = {'enabled': 'false'};

MongoDB

use myNewDatabase;
show collections;

Insert Data into a Table

PostgreSQL

INSERT INTO Music 
    (Artist, SongTitle, AlbumTitle, 
    Year, Price, Genre, CriticRating, 
    Tags)
VALUES(
    'No One You Know', 'Call Me Today', 'Somewhat Famous',
    2015, 2.14, 'Country', 7.8,
    '{"Composers": ["Smith", "Jones", "Davis"],"LengthInSeconds": 214}'
);

INSERT INTO Music 
    (Artist, SongTitle, AlbumTitle, 
    Price, Genre, CriticRating)
VALUES(
    'No One You Know', 'My Dog Spot', 'Hey Now',
    1.98, 'Country', 8.4
);

INSERT INTO Music 
    (Artist, SongTitle, AlbumTitle, 
    Price, Genre)
VALUES(
    'The Acme Band', 'Look Out, World', 'The Buck Starts Here',
    0.99, 'Rock'
);

INSERT INTO Music 
    (Artist, SongTitle, AlbumTitle, 
    Price, Genre, 
    Tags)
VALUES(
    'The Acme Band', 'Still In Love', 'The Buck Starts Here',
    2.47, 'Rock', 
    '{"radioStationsPlaying": ["KHCR", "KBQX", "WTNR", "WJJH"], "tourDates": { "Seattle": "20150625", "Cleveland": "20150630"}, "rotation": Heavy}'
);

Cassandra

The Cassandra INSERT statements look very similar to that of PostgreSQL in general. However, there is one big difference in semantics. INSERT is actually an upsert operation in Cassandra where the row is updated with the latest values in case the row already exists.

Same as the PostgreSQL INSERT statements above.

MongoDB

Even though MongoDB is also a NoSQL database similar to Cassandra, its insert operation does not have the same semantic behavior as Cassandra. MongoDB insert() has no upsert possibility which makes it similar to PostgreSQL. The default insert behavior with no _id specified will lead to a new document added to the collection.

db.music.insert( {
artist: "No One You Know", 
  	songTitle: "Call Me Today", 
   	albumTitle: "Somewhat Famous", 
   	year: 2015,
   	price: 2.14,
   	genre: "Country",
   	tags: { 
Composers: ["Smith", "Jones", "Davis"],
LengthInSeconds: 214
} 
   }
);

db.music.insert( { 
    artist: "No One You Know",
    songTitle: "My Dog Spot",
    albumTitle: "Hey Now",
    price: 1.98,
    genre: "Country",
    criticRating: 8.4
   }
);

db.music.insert( { 
    artist: "The Acme Band",
    songTitle: "Look Out, World",
    albumTitle:"The Buck Starts Here",
    price: 0.99,
    genre: "Rock"
   }
);

db.music.insert( { 
    artist: "The Acme Band",
    songTitle: "Still In Love",
    albumTitle:"The Buck Starts Here",
    price: 2.47,
    genre: "Rock",
    tags: {
        radioStationsPlaying:["KHCR", "KBQX", "WTNR", "WJJH"],
        tourDates: {
            Seattle: "20150625",
            Cleveland: "20150630"
        },
        rotation: "Heavy" 
}
    }
);

Query a Table

Arguably the most significant difference between SQL and NoSQL in terms of modeling queries is on the use of the FROM and WHERE clauses. SQL allows FROM clause to include multiple tables and WHERE clause to be of arbitrary complexity (including JOINs across tables). However, NoSQL tends to put a hard restriction on the FROM clause to have only one table specified and the WHERE clause to always have the primary key specified. This is because of NoSQL’s high performance focus we discussed earlier that aims to reduce any cross-table and cross-key interaction. Such interaction may introduce high latency cross-node communication into the query response time and hence is best avoided altogether. E.g. Cassandra requires that queries be restricted by operators (only =, IN, <, >, =>, <= are allowed) on partition keys except when querying a secondary index (where only = operator is allowed).

PostgreSQL

Following are 3 types of queries that can be served easily by a SQL database.

Return all of the songs by an artist
Return all of the songs by an artist, matching first part of title
Return all of the songs by an artist, with a particular word in the title but only if the price is less than 1.00

SELECT * FROM Music
WHERE Artist='No One You Know';

SELECT * FROM Music
WHERE Artist='No One You Know' AND SongTitle LIKE 'Call%';

SELECT * FROM Music
WHERE Artist='No One You Know' AND SongTitle LIKE '%Today%'
AND Price > 1.00;

Cassandra

Of the PostgreSQL queries listed above, only the first one will work with Cassandra unmodified since LIKE operator is not allowed on clustering columns such as SongTitle. Only = and IN operators are allowed in this case.

SELECT * FROM Music
WHERE Artist='No One You Know';

SELECT * FROM Music
WHERE Artist='No One You Know' AND SongTitle IN ('Call Me Today', 'My Dog Spot')
AND Price > 1.00;

MongoDB

As shown in the previous examples, the primary method for querying MongoDB is the db.collection.find() method. This method is qualified by the collection name (music in the example below) to be queried very explicitly so querying across collections is explicitly disallowed.

db.music.find( {
		artist: "No One You Know"
	} 
);

db.music.find( {
		artist: "No One You Know",
		songTitle: /Call/
	} 
);

Read All Rows From a Table

Reading all rows is simply a special case of the generic query pattern we observed earlier.

PostgreSQL

SELECT * 
FROM Music;

Cassandra

Same as the PostgreSQL SELECT statement above.

MongoDB

db.music.find( {} );

Modify Data in a Table

PostgreSQL

PostgreSQL provides the UPDATE statement for modifying data. It does not allow any upsert possibility so the statement will fail if the row does not exist in the database already.

UPDATE Music
SET Genre = 'Disco'
WHERE Artist = 'The Acme Band' AND SongTitle = 'Still In Love';

Cassandra

Cassandra also has an UPDATE statement similar to PostgreSQL. UPDATE also same upsert semantics as that of the INSERT statement.

Same as the PostgreSQL UPDATE statement above.

MongoDB

MongoDB’s update() operation can update an existing document entirely or can update only specific fields. By default, it updates only one document with upsert semantics off. Multi-document updates and upsert behavior can be turned on by setting additional flags on the operation. E.g. the example below updates the genre of a specific artist across of the artist’s songs.

db.music.update(
  {"artist": "The Acme Band"},
  { 
    $set: {
      "genre": "Disco"
    }
  },
  {"multi": true, "upsert": true}
);

Delete Data from a Table

PostgreSQL

DELETE FROM Music
WHERE Artist = 'The Acme Band' AND SongTitle = 'Look Out, World';

Cassandra

Same as the PostgreSQL DELETE statement above.

MongoDB

MongoDB has two types of operations to handle document deletions — deleteOne()/deleteMany() and remove(). Both delete document(s) but have different return results.

db.music.deleteMany( {
        artist: "The Acme Band"
    }
);

Remove a Table

PostgreSQL

DROP TABLE Music;

Cassandra

Same as the PostgreSQL DROP TABLE statement above;

MongoDB

db.music.drop();

Summary

The SQL vs. NoSQL debate has been raging over a decade now. There are 2 aspects of this debate: the database core architecture (monolithic, transactional SQL vs. distributed, non-transactional NoSQL) and the data modeling approach (model your data in SQL vs. model your queries in NoSQL).

With a distributed, transactional database such as YugabyteDB, the database core architecture part of the debate can be put to rest easily. As data volumes grow beyond what can be written into a single node, a fully distributed architecture that enables linear write scalability with automatic sharding/rebalancing becomes a must-have. Additionally, as described in this post from Google Cloud, transactional, strongly consistent architectures are now widely accepted to deliver higher developer and operations agility than non-transactional, eventually consistent architectures.

Coming to the data modeling debate, it is fair to say that both the SQL and NoSQL data modeling approaches are essential for any complex real-world application. SQL’s model-your-data approach allows the developers to cater to changing business requirements more easily while NoSQL’s model-your-queries approach enables the same developers to manage large data volumes with low latency and high throughput. This is precisely the reason YugabyteDB implements both SQL and NoSQL APIs on the common core instead of promoting that one approach is strictly better than the other. Additionally, by ensuring wire compatibility with popular database languages including PostgreSQL and Cassandra, YugabyteDB ensures that developers do not have learn another language in order to benefit from the distributed strongly-consistent database core.

This post helped us understand how data modeling basics differ between PostgreSQL, Cassandra and MongoDB. In the next posts in the series, we will dive into advanced data modeling concepts such as indexes, transactions, JOINs, TTL directives and JSON documents.

What’s Next?

Compare YugabyteDB to databases like CockroachDB, MongoDB and Amazon Aurora.
Get started with YugabyteDB on macOS, Linux, Docker and Kubernetes.
Contact us to learn more about licensing, pricing or to schedule a technical overview.

November 21, 2018

Data Modeling Basics – PostgreSQL vs. Cassandra vs. MongoDB

Related Posts

Explore Distributed SQL and YugabyteDB in Depth