The Distributed SQL Blog

Thoughts on distributed databases, open source and cloud native

Document Data Modeling in YugabyteDB with the JSON Data Types

Developer Advocate

YugabyteDB has two JSON data types, json and jsonb, to let you store documents in a column in a YSQL table and to do content-based queries with index support. YSQL is PostgreSQL compatible and it therefore supports every one of the rich set of about thirty five JSON-specific operators and functions that users of PostgreSQL’s JSON functionality might already know, and rely upon. These features let you handle semi-structured data, without the need for “alter table” DDLs to add or change columns. This blog post describes a few straightforward JSON use cases to whet your appetite.

Before we dive in, we wanted to let you know that the Yugabyte team has been working from home in order to do our part with social distancing and to help with containment efforts. We have also transitioned to online meetings with our customers, partners, candidates, community, and fellow Yugabeings. We will continue with these and other measures until it’s determined to be safe otherwise. We wish everyone the best health and wellness, and our thoughts are with all those around the world who have been affected by COVID-19. If during these unprecedented circumstances, you have some time or would welcome a brief distraction, we invite you to check out this post below.

Introduction

There are about thirty-five YSQL operators and functions to let you read a JSON value from inside a document, create a document from SQL values, and update a document by making specified changes. At the other end of the spectrum from basic content-based queries, you can, for example, shred a complex document into SQL tables.

Moreover, the PL/pgSQL language supports the SQL type system, and the SQL builtin operators and functions, so you can also manipulate JSON documents in stored procedures, functions, and triggers using the same json and jsonb data types, spelling expressions in exactly the same way that you do in SQL statements.

Because much of YSQL’s functionality is built by reusing the PostgreSQL query layer, its JSON operators and functions behave exactly the same in YugabyteDB as they do in PostgreSQL.

I won’t describe all the thirty-five operators and functions in this blog post. This is done in the YSQL documentation, here, where each is illustrated with a self-contained code example that you can simply copy and paste into ysqlsh.

This blog post introduces you to the basics. I’m hoping to describe use cases that need YugabyteDB’s more advanced JSON functionality in future posts.

What is JSON and why is it interesting in a SQL database?

If you’re already well-versed in JSON, you can skip this section. JSON stands for JavaScript Object Notation, a text format for the serialization of structured data. Its syntax and semantics are defined in RFC 7159. Such a serialization is usually called a document because its representation is ordinary text (using Unicode characters).

Briefly, a JSON document can represent the values in a row from a conventional SQL table, using a compound data type called object, or it can represent a whole table, using a compound data type called array. Values of these data types can be indefinitely nested. An object is a set of key-value pairs; these values might be primitive (a number, a string, a boolean, or null); or a value might itself be an object (usually of a different kind); or it might be an array. An array is an ordered list of unnamed values; and, once again, each array value might be primitive, or an object, or an array. And so it goes on.

You can see that a complete set of tables of interest (like, for example, the Northwind sample database) can be serialized to JSON for transport. The top-level document would be an array whose values are objects, each of which would have two key-value pairs: “table name” (with a string value); and “table data” (whose value is an array of objects). In this way, JSON and SQL go hand-in-hand.

The YSQL data types json and jsonb

YSQL supports two data types to represent a JSON document: json and jsonb. Both these data types conform to the RFC 7159 standard. The json data type simply stores the text representation of a JSON document as presented. In contrast, the jsonb data type stores a parsed representation of the document hierarchy of subvalues in an appropriate internal format. Some people prefer the mnemonic “binary” for the “b” suffix; others prefer “better”. Of course, it takes more computation to store a JSON document as a jsonb value than as json value. But this cost is repaid when subvalues are operated on using the operators and functions that the YugabyteDB documentation describes.

Among the set of operators and functions, about three-quarters have both a jsonb and a json variant; just under a quarter have only a jsonb variant; and just two have only a json variant. Moreover, the two functions that have only a json variant have better alternatives that handle jsonb.

If you choose to use the json data type, then the workaround, when you find out that you need jsonb-only functionality, is trivial: you simply typecast the json value to jsonb before invocation and typecast the return value from jsonb back to json. But doing this not only clutters your code; it also squanders the time that you saved on ingestion each time you execute an operation that’s available only for jsonb values. You might argue that you’ll never need such functionality. But experience has shown that, sooner or later, new processing requirements arise that do require jsonb.

I strongly recommend, therefore, that you simply choose to use jsonb in all use cases.

Example table

Here’s an example table that stores JSON objects that represent information about books. I’ll use the “books” example throughout what follows. The {} pair brackets an object; and the [] pair brackets an array. I’m using “k serial” for simplicity. I described a better approach for implementing a self-populating primary key in a distributed SQL database in an earlier blog post.

Notice that the first two rows have some of the keys missing, as JSON allows. But the other rows each have every key. When your documents are more complex, and (as is typical) have little or no whitespace, you can understand them more easily by using “jsonb_pretty()”, documented here.

Executing this:

shows this:

Ignore the + signs. These are inherited by ysqlsh from PostgreSQL’s psql to make “newline” characters visible.

Content-based queries

Here’s an example of the simplest content-based query that JSON supports:

Notice that in JSON, key names are case-sensitive, and so you must spell it like this:

And not like this:

That’s why I spelled the key “ISBN” in upper case—just to make this point.

The ->> operator, documented here, simply says: get me the value, from the key-value pair called “k”, from the object. This is the result:

The first two rows are blank (meaning the SQL “null”) because those JSON objects don’t have the “genre” key. If you want to exclude them, just use your standard SQL knowledge:

It’s annoying that you have to spell out the JSON expression again in the “where” clause. But SQL is just like this—it’s the same for a non-JSON expression in the “select” list. You can’t use the alias “genre”. You can avoid repeating the expression by using a “with” clause:

If you want to extract a deeply nested value, you just spell out the path, step by step:

You spell the path with -> for all but the last step, which you spell with ->>. See the documentation on these operators, here, to understand why. This is the result:

Here’s an interesting requirement:

Get me the ISBN values for books that have Melisa among the editors.

The ? operator, described here, comes to the rescue:

It produces this result:

Notice the typecast ::bigint is applied to the extracted value of the “ISBN” key. That’s because it’s a JSON “number” value (not surrounded by double quotes) and you want to read it into the corresponding SQL data type.

There are also “any” and “all” variants, ?| and ?&. These are described in the same section in the documentation as the ? operator. For this example, you would supply a list of candidate editors.

Adding index support

You’re probably thinking already that the queries shown above, for a table with a realistic number of rows, would never perform adequately because each one that I show implies a table scan and an on-the-fly programmatic analysis of each document. So you’ll want to create indexes on JSON values that you typically reference in your “where” clause. YSQL lets you create an index on an expression, or a list of expressions. So indexing JSON documents is easy.

Let’s think first about “ISBN”. This is the real world’s version of a surrogate primary key—just an arbitrary, automatically assigned, unique value. So it has to be mandatory. (See the next section. I explain, there, how we can enforce this.) You’ll want to use “ISBN” in the “where” clause for single-row lookups, or (in a short list) to retrieve a few rows. But you’ll never want to do a range query because this can’t have real world significance. This is where a HASH index is good.

Notice the typecast. You’ll want to express “ISBN” in queries as a 13-digit integer, like this:

And then this:

It produces this result:

You’ll doubtless also want to do range queries on JSON values, like “year”, that are amenable to that notion. This is where you want ASC—or maybe DESC. You know already, now, that this is bound to work:

Now, this query will be index-supported:

It produces this result:

Lastly, in this section, we see from the sample data above that there can be lots of rows that don’t have the “genre” key-value pair, and so the -> extraction evaluates to the SQL null. This is a perfect case for a partial index.

Notice that not all distributed SQL databases support partial indexes. For example, CockroachDB currently does not.

Adding constraints

Developers don’t like to assume that the data that their code handles will always follow the business rules simply by programming carefully. This is why SQL databases have supported constraints since the dawn of time. YSQL lets you define constraints, as well as indexes, with reference to expressions. So you can straightforwardly create the constraints that achieve what you’d want in a classic relational table design by imposing rules on the otherwise entirely unconstrained flexible content of your JSON documents. Here are four examples.

Ensure that every “books.doc” value is a JSON object

This is your fundamental assumption about your corpus of documents. So you definitely don’t want to admit any non-objects. You need no more than the “jsonb_typeof()” function, documented here.

Ensure that ISBN is a positive 13-digit integer

This is the current standard for the form of a book’s ISBN—used across the publishing industry.

I ought to show tests for all the constraints that I define in this section. But I decided to save you from mental overload. (Trust me, I did test them all.) Here’s the test for just the “books_doc_isbn_ok” constraint:

It causes this error:

Ensure that the title and author’s first and last name are jointly unique

You’ll never find two different Amy Tans, each of whom has written a book called Joy Luck Club. So it’s sensible to enforce this rule. It turns out that, while you can create a unique constraint with reference to a plain column list, you cannot do this with reference to a list of expressions. (This is a PostgreSQL limitation that YSQL inherits.) But this is of no practical consequence because a unique index serves the purpose.

First, we decide that “title” and “author->>family_name” must be not null. We’re fine with a book written by, say, Sting or Prince—but we won’t allow two books by Sting with the same title.

Now we create the unique index:

Ensure that every “books.doc” value has the expected structure

For this example, I’ll assume that you know the following from external documentation:

  • A “book” JSON object will never have any keys except these: “ISBN”, “year”, “genre”, “title”, “author”, and “editors”. It might, however, not have all of them.
  • The value for the “author” key is an object with these keys: “given_name” and “family_name”. It’s OK, however, if it doesn’t have both of these.
  • The value for the “editors” object is an array, each of whose values is a primitive string.
  • The value for each of the “ISBN” key and the “year” key is JSON number. And the values for each of the remaining keys is JSON string.

You can create a single constraint that enforces conformance to all these rules. It’s as easy as this:

The “trick” behind the apparent simplicity—and maximum readability—of this “add constraint” statement is that all the implementation detail is encapsulated in a single user-defined PL/pgSQL function, “chk_book_doc_is_conformant()”. The input is the whole “book” JSON object. And the return value is a SQL boolean. Not all databases let you define a constraint with respect to a user-defined function. In fact, Oracle Database does not. But PostgreSQL does, and YSQL inherits this ability.

Caution: when a constraint is created, the YSQL system checks that every row in the table satisfies the specified rule—and if at least one doesn’t, then the DDL to create the constraint will fail. However, if you “create or replace” a user-defined function that is referenced by the constraint definition, then the existing rows are not rechecked. I therefore recommend strongly against doing this, even when—as presumably you “know”—all existing rows will still satisfy the constraint. Rather, I recommend that you handle your intended change by dropping the constraint with “alter table”, changing the function definition, and then re-creating the constraint with another “alter table” so that all rows will be checked against the new rule. This, of course, must be done during a period when you ensure that the table can receive no changes. (This discussion isn’t specific to JSON or to constraints, or to YugabyteDB. For example, you can create an index with respect to a user-defined function. And the same practice recommendation applies there too—and in other SQL database systems like PostgreSQL and Oracle Database.)

This “encapsulate the detail” paradigm is the de facto standard, followed by all professional developers, almost instinctively. YSQL lets you write stored procedures—inheriting this ability, too, from PostgreSQL. But not one of the other distributed SQL databases support user-defined functions, procedures, or triggers. This is a huge differentiator.

Notice that the rules that the other constraints that I implemented with ordinary tests in the “check” clause itself can be brought into this single user-defined function. It brings the “single point of definition and maintenance” benefit. This is another pattern that the professional programmer instinctively prefers.

The encapsulated logic is very straightforward:

  • Get the list of keys in the top-level “book” JSON object. For this, you use the “jsonb_object_keys()” function, documented here.
  • For each key in turn, get the JSON data type of the value. For this, you use “jsonb_typeof()”, documented here. Check that the (“key name”, “key datatype”) pair is found in the list of expected such pairs (for the top-level object).
  • Go down to the “author” JSON object, get its list of keys, and again check that the (“key name”, “key datatype”) pair is found in the list of expected such pairs (for the “author” object).
  • Go down to the “editors” JSON array and traverse its values. For this, you simply use the -> operator with the loop runner as its right-hand operand. Use “jsonb_typeof()” to check that the JSON data type of each value is string. The loop needs the count of values in the array. You get this with the “jsonb_array_length()” function, documented here.

The code, though straightforward, inevitably ends up at a size that is too big to show right here. I’ve copied it in the Appendix. I added the “add constraint” statement that I showed you above. And I added a few tests that provoke the “constraint violated” error. If you want to get a close understanding of the code, then create and populate the “books” table using the code from the start of this post. Then just copy and paste all the code from the Appendix. Then write your own tests to double-check that my logic meets its promise. Then you’ll be ready to read the code.

Making targeted changes to a JSON object

Suppose that you found out that “Joy Luck Club” had been edited by three people and that you were missing Xinyu. First, let’s inspect the starting state:

Notice that I used the “jsonb_array_elements_text()” function, documented here. It’s a builtin table function (see my earlier blog post series on table functions in general, starting here). It produces this result:

I could have simply said “select … (doc->>’editors’)…” and seen the value as the RFC 7159 text of a JSON array, thus:

Now you use the “jsonb_insert()” function, documented here. This is one of those useful functions that’s available only for a jsonb input.

I expect that you’re wondering about my use of “999999”. Array values are ordered, and the -> and ->> operators let you address them by the index (i.e. position) in the array. (This is a dense sequence starting at zero.) If you ask to insert a new value into an array at an index that’s bigger than the present number of values in the array, then it goes at the end. My “999999” is just my convention for this. (There is no “append” syntax to ask explicitly for the new value to go after the previous last value.) Sometimes, the order matters for what the JSON document represents. But often, as with a book’s editors, you think of this as just a set. But you can’t get away from the fact that there always is an order. A JSON array is quite different from a SQL table in this respect. You can read about all these details in the documentation.

After inserting Xinyu, the query that I used above now produces this, as expected:

Notice that this is not “in place” edit. (YSQL has no way to do this. It inherits this from PostgreSQL.) Rather the old jsonb value is simply replaced by the derived new value. There are other operators and functions that derive a new JSON value from a starting value according to your specification.

You can:

  • insert a new value before the present first value in an array
  • insert a new value between two existing values in an array
  • insert a new key-value pair in an object
  • update the value for an existing key-value pair in an object
  • update an existing value at a specified index in an array
  • delete an existing key-value pair in an object
  • delete the value at a specified index in an array

Conclusion

The examples in this blog post rely on just a tiny fraction of YugabyteDB’s JSON functionality:

  • The jsonb data type and typecast operators, especially ::jsonb
  • jsonb_pretty()
  • The -> and ->> operators
  • a partial index created on an expression that uses these operators
  • a constraint created on JSON expressions
  • jsonb_typeof()
  • jsonb_array_elements_text()
  • jsonb_insert()
  • jsonb_object_keys()
  • a constraint created on a PL/pgSQL function that encapsulates rich JSON functionality

This small subset of YSQL’s JSON functionality gets you a remarkably long way.

My examples show how easy it is for developers who are used to JSON in a document-based NoSQL context and who want to make the move to YugabyteDB simply to start with what they know. They continue to draw upon their existing skills as they make the incremental transition to a “normal” approach (pun intended). Eventually they will be able to use all of YSQL’s rich fully-relational SQL functionality.

Appendix

This appendix presents the complete source code of the function “chk_book_doc_is_conformant()” that implements the constraint described in the section ‘Ensure that every “books.doc” value has the expected structure’ subsection in the ‘Adding constraints’ main section in this post. It also shows a few tests that provoke the intended “constraint violation” error.

Because the length of the code is about 125 lines, I’ll start with just the skeleton. The comments map to the bullets following the text The encapsulated logic is very straightforward.

If you followed all the steps that I described in this post, you will already have created some constraints (and updated one row in the table). To get a clean start, just drop the “books” table and recreate it using the “create table” and “insert into books…” statements from the start of this post.

We’ll also need to recreate the two unique indexes, first on “ISBN”:

and then on (“title”, “author_family_name”, “author_given_name”):

Next, before we create the function, we need to create the SQL type upon which it will depend:

Now for the promised 125 lines:

Now, add the constraint:

Now you can test the constraint. Simply creating it on the existing set of six documents served as the positive test. Now create the rule violations that you want to test to confirm that you get the “constraint violated” error when you should. Remember to change the “ISBN” for each new attempt because this has a unique index. You can change a key name or the data type of value—either for a key or in the “editors” array. Notice that this example has the typo “given_namX”:

This mistake provokes this error text:

I inserted some whitespace to make the message easier to read.

Here are a few more tests. I commented each with the message output by “chk_book_doc_is_conformant()”:

Related Posts

Developer Advocate