The Distributed SQL Blog

Thoughts on distributed databases, open source and cloud native

Relational Data Modeling with Foreign Keys in a Distributed SQL Database

Developer Advocate

Primary and foreign key constraints are as essential to a relational database as are tables and columns—without these, you cannot implement a design that conforms to the relational model. Their implementation is a technical challenge. For example, when child rows are inserted, their parent rows must be efficiently restricted; and when parent rows are deleted, multi-table transactions are needed. Nevertheless, monolithic SQL databases such as PostgreSQL, MySQL, and so on have supported foreign key constraints since time immemorial. However, the implementation challenges in a distributed SQL database are very much greater because the affected rows might span many table shards spread across many different nodes in the cluster. Ensuring low latency and high throughput in such a distributed database becomes a very important design concern.

We’re excited to announce that YugaByte SQL (YSQL), currently in Beta 3, supports foreign key constraints starting from v1.2.10.

This post highlights the need for relational data modeling with foreign keys in business-critical apps as well as provides a complete working example of foreign keys in YSQL. It concludes with a note on why foreign keys have traditionally not been supported in distributed databases prior to the advent of distributed SQL databases like YugaByte DB.

Why Developers Love Foreign Keys

Foreign keys enforce referential integrity constraints that are usually tied directly to the application’s business logic. One such key defines a reference relationship from one or many rows in a table that is constrained, the so-called child table, to exactly one row in a referred-to table, the so-called parent table. The reference source is a list of one or more columns in the child table; and the reference target is a list of exactly corresponding columns in the parent table. A parent row may have one or many child rows. A child row points to its parent row by recording the value of the parent row’s primary key; and a child row cannot exist unless it points to a parent row.

The dependency of child rows upon their parent row implies that the foreign key constraint’s definition must specify the action to be taken when a parent row is deleted. When no column that participates in a data rule can be null, there are only two choices: either to delete all of the child rows together with their parent row; or to resist the attempted delete by causing an error. The same applies for the attempt to update the primary key of a parent row. If you follow the strict school of database design thought that insists that primary key values are immutable, then it’s enough to specify that an attempt to change the value of a parent row’s primary key will cause an error. As a developer, having the database do all the tracking and enforcement that comes with foreign keys, is certainly a welcome relief. The alternative situation of building all this directly into the application logic is simply maintaining a long-term complexity without any data integrity guarantees.

A Simple Use Case

I’ve chosen one of the canonical examples that are used to illustrate proper normalization and that imply the use of foreign key constraints. Picture a retailer that has an inventory of items for sale—implying a table of items—and that takes orders from customers to whom the items are to be delivered—implying a table of orders. The orders table must specify the set of ordered items. Relational database design insists that a set of facts, like an item and all of its associated information, is represented just once. Therefore, a row that records the facts about an order must not explicitly record facts about the ordered items but, rather, must point to these in the items table. Moreover, a single row in one table cannot point directly to many rows in another table because its column values (in a purist approach) must be scalar. This implies, therefore, a table of order lines, where the row for each ordered item points both to the order to which it belongs and to the item that is ordered. Similarly, the row for an order must not explicitly represent the facts about the customer that placed the it because a customer might place many different orders. Rather, there must be a table of customers, and an order row must point to the customer that created the order.

These principles lead to the table design that the picture below shows. The so-called crow’s foot is at the child table. The other end of the line that denotes the foreign key constraint is at the parent table).

  • The customers table uses a numeric surrogate primary key which is automatically generated from a sequence. Because this is not found in the real world that the table models, there must also be a not null, unique business key. The natural choice is the customer’s email address. (Incidentally, this emphasizes the value of a surrogate primary key over a natural primary key: a customer can be uniquely identified by its preferred email address, but might very well change this from time to time. If the value of a natural key is changed, then the new value must be propagated to all rows in all of its child tables.) I’ll use the customer’s name instead of the email address to make the values in the code example easier to read. The ellipsis indicates various other facts, like the customer’s “preferred” status, that the business might want to record. Some of these, like delivery address, might be multi-valued. This would imply a table of addresses. And because several customers might specify the same delivery address, this would imply another so-called intersection table. (The order_lines table is such an intersection table.) It’s easy to see how the proper table design for modern businesses can require very many tables and very many foreign key constraints.
  • The items table, too, uses a numeric surrogate primary key which is automatically generated from a sequence. I’ve chosen the item’s name to illustrate the notion of the not null, unique business key. The ellipsis indicates various other facts, like price, quantity in stock, and so on.
  • The orders table reflects common practice. Its automatically generated numeric primary key serves as the business key. It uses a foreign key constraint to point to the customer that placed the order. The ellipsis indicates various other facts, like when it was created, its status (“under construction”, “placed”, “in transit”, “fulfilled”) and so on.
  • The order_lines table has a foreign key constraint to the orders table and another one to the items table. The primary key is defined on the column list (order_pk, item_pk). The ellipsis indicates various other facts like the quantity of items ordered and the discount to be applied to the ordered item’s list price because a promotion code was used.

The retailer requires that the table design supports various business functions, constrained appropriately by rules. Here are some examples:

  • Create a new customer.
  • Create a new item.
  • Create a new empty order for a specified customer.
  • Add an item to an existing order specified by its order number, the item name, and quantity ordered.
  • Delete an item. Must fail if it is referenced by existing order lines.
  • Calculate the total cost of an order.
  • Delete a customer and consequentially delete all its orders with their order lines.

SQL with Foreign Keys in Action

The SQL code presented in this section is available in .sql files and can be run at the prompt of ysqlsh, the YSQL Shell, by using the \i command. You can download a .zip of the .sql files here.

YSQL currently controls row-level locking in DocDB only implicitly, according to the isolation level and whether the SQL statement reads or writes. So explicit locking, such as the familiar “select for update”, isn’t yet supported. Until this support is added in a future release, any attempt to make a data change that might violate a foreign key constraint must be made in a serializable transaction. This means operations such as insert and delete have to be executed inside begin isolation level serializable and commit commands. This enhancement is tracked by GitHub Issue #1199.

First, follow the first two steps in the Quick Start to install YugaByte DB and create a local cluster. Then connect to ysqlsh by using instructions from the Explore YSQL section.

Now, we are ready to create some tables.

Get a clean start.

Notice that the sequence of DROP commands is critical. You cannot drop a parent table when it has child tables. For the same reason, a parent table must be created before its child tables.

Create the customers table.

I decided to write all the primary key and foreign key constraint definitions at table level where they can be easily seen. Some people prefer column-level constraints, and YugaByte DB supports these. You can establish a foreign key constraint either at create table time or, later, using alter table. My code illustrations for this post establish them only at create table time.

Create the items table.

Create the orders table.

The match full clause says that no value in the foreign key column list may be null. My implementation enforces this rule at a higher level of granularity with not null constraints on all columns that participate in referential integrity constraints.

The on delete cascade clause reflects the business rule that a parent customer may be deleted and, that when this is done, all of its child orders and grandchild order lines must be silently deleted in the same transaction. The on update restrict clause reflects the business rule that primary key values are immutable.

The column list on which a foreign key constraint is defined should have an index to benefit the performance of the restrict and cascade variants of on delete and on update.

Create the order_lines table.

The on delete cascade for the reference to the orders table reflects the business rule that a parent order may be deleted and, that when this is done, all of its child order lines must be silently deleted in the same transaction.

In contrast, the on delete restrict for the reference to the items table reflects the business rule that you cannot delete an item when at least one order line refers to it.

With the tables in place, we can now show the implementation of some illustrative business functions.

Create some customers.

Create some items.

Create a new empty order for customer Helen.

Notice the use of the returning clause. The UI needs to let a user create a new order and immediately add order lines to it without having to note down the order number when it’s first created and then having to enter it time and again for each order line. The application code looks after this by using the returned value of the system-generated order number, and by informing the user what order number was generated when the order is submitted.

Add one order line for the specified quantity of a particular item.

Add several more order lines for the specified quantities of various items.

Create a new empty order for customer Dave and insert several order lines.

Create a second new empty order for customer Helen and insert several order lines.

Basic two-table inner join: list all the orders.

Here are the results of this query:

List everything with an inner join between all four tables.

Notice that the surrogate primary key columns, with the exception of order_pk, are not included in the select list because they implement the “plumbing” and don’t belong in the business world. Order_pk is icluded because, while it is generated, it does surface into the business world. Here are the results of this query:

Notice the duplication of the values for customer_name and order_timestamp. Usually, application code will process these results to present customer_name and order_timestamp once as a header and to list the values for item_name, price and quantity under this. This is a very common pattern, and experience has shown that better performance is delivered this way than by doing two round trips first to get the values for customer_name and order_timestamp and then next to get the order items for this order.

List the items for order #1.

This uses the same four-table inner join as the previous query, adds a restriction, and removes the columns that the restriction uses from the select list. Here are the results of this query:

Compute the total cost of all orders.

Here are the results of this query:

Delete an unreferenced item.

Delete a referenced item.

The attempt fails with this error: update or delete on table “items” violates foreign key constraint “item_pk_fk” on table “order_lines”.

Cascade delete customer Helen.

This quietly succeeds and removes not only the specified customer by also all of its child orders (with the data created earlier, there is just one) and all of its grandchild order lines (there are two of these). The obvious select * queries confirm this.

Try to create an order line for an order that doesn’t exist.

The attempt fails with this error: insert or update on table “order_lines” violates foreign key constraint “order_pk_fk”

Why Foreign Keys are Challenging in a Distributed SQL Database

Guaranteeing that all committed states of a database have to honor a foreign key constraint, in the presence of concurrent sessions, requires fine-grained locking (for serializable isolation) and optimistic concurrency control (for snapshot isolation). Additionally, the updates between the parent and the child table have to be coordinated with distributed transactions. If one session attempts to delete a parent row, then all of its child rows (cascading through the closure of all descendant rows) must be restricted because other sessions might attempt, for example, to update an existing child row of the to-be-deleted parent. And if another session attempts to insert a new child of the to-be-deleted parent, then the attempt must be resisted. Corresponding requirements must be met if the order of events is reversed. If one session attempts to update an existing child row, or to insert a new child, then the parent must be restricted.

Of course, the technical challenges are greater in a distributed database than in a monolithic one because distributed transactions and distributed concurrency control (including lock management) are needed. Notwithstanding the scale of these challenges, YugaByte DB’s DocDB distributed document store has all the mechanisms to manage this properly. Read the documentation section Distributed ACID Transactions and the blog post Distributed PostgreSQL on a Google Spanner Architecture – Storage Layer for more information.

Summary

Application developers whose transactional use cases are adequately met by a monolithic database naturally choose a SQL database because of the huge benefits these bring for implementing new business cases quickly without compromising data integrity. And application developers whose use cases demand a distributed database have, during the previous many years, simply been forced to give up SQL and to invent and implement their data model from scratch. This practice brought a consequent need to program all business uses explicitly and, quite often, to redesign and reimplement the NoSQL data model as new cases arrived. They made this painful compromise because the need for large scale data volumes and numbers of concurrently submitted transactions, and automatic fault tolerance, trumped all other concerns.

This post has shown that developers can now meet the requirements that demand a distributed database and yet not compromise to give up the core benefits of relational data modeling. The code examples illustrate a canonical example of a database design within the relational model, and show how a distributed SQL database like YugaByte DB can implement requirements that would take huge amounts of programming in the regime of a home-grown, ad hoc, data model in just a few lines of code.

What’s Next?

  • Compare YugaByte DB in depth to databases like CockroachDB, Google Cloud Spanner and Amazon Aurora.
  • Get started with YugaByte DB on the cloud or container of your choice.
  • Contact us to learn more about licensing, pricing or to schedule a technical overview.

Related Posts

Developer Advocate