Polymorphism in SQL part one – anyelement and anyarray

Technical Product Manager

June 24, 2020

The term “polymorphism”, and its inflections, has various meanings in the context of programming languages in general. Roughly, it means “chameleon-like”. Of course, Wikipedia has something to say about it, here. PostgreSQL uses it with this meaning: “Parametric polymorphism: when one or more types are not specified by name but by… symbols that can represent any type”.

The term “variadic” has a universal meaning, and again Wikipedia has something to say about it, here. “A variadic [subprogram] has indefinite “arity”, [that is, it] accepts a variable number of arguments”.

In this two part post, I explain how YugabyteDB stored procedures implement these two central programming language notions.

Part One (this post) explains how polymorphism is implemented by declaring a subprogram with formal parameters of the pseudo-types “anyelement” and “anyarray”.
Part Two explains how marking a subprogram’s formal parameter with the keyword “variadic” implements that notion. I also bring the two notions together to show you how to implement a non-aggregate version of any built-in aggregate function so that you can use it outside the context of a SQL subquery, just as the “least()” and “greatest()” built-ins already (but uniquely) do this for the “min()” and “max()” aggregate functions.

You can see, then, that I’m stretching the technical meaning of “polymorphism” in the title of this post series. I’m focusing on the chameleon metaphor to capture the idea that I’m showing two powerful techniques that let functions automagically adapt to the way in which they’re invoked.

YugabyteDB is an open source, high-performance distributed SQL database built on a scalable and fault-tolerant design inspired by Google Spanner. YugabyteDB uses its own special distributed document store called DocDB. But it provides SQL and stored procedure functionality by re-using the “upper half” of the standard PostgreSQL source code. This is explained in the two part blog post “Distributed PostgreSQL on a Google Spanner Architecture”: (1) Storage Layer; and (2) Query Layer.

If you’re already used to writing PostgreSQL subprograms in your day job, then I recommend that you read the two parts of my essay in a single short sitting.

If you’re newer to this field, then you might find these earlier posts helpful:

Everything that I say in this post applies in the same way to vanilla PostgreSQL as it does to YugabyteDB. And all the code examples that I describe have identical outcomes in the two environments. There are no special distributed SQL considerations.

Understanding “anyelement” and “anyarray”

A user-defined function is usually created like this:

create function f(p1 in int, p2 in boolean, ...)
  returns text
  immutable
  language plpgsql
as $body$
begin
  return <a text value>;
end;
$body$;

The function’s requirements are stated by specifying how the output value is determined by its input values. And commonly the rules can be stated only in terms of inputs of specified data types. It’s the same story for a user-defined procedure—and the same if the implementation uses “language plpgsql” or “language sql”.

Note: My earlier post, Four Compelling Use Cases for PostgreSQL Table Functions, shows examples of these two ways to implement user-defined functions.

Sometimes, though, the requirements reflect the same general rule with a specialization for each of several different input data types. Here’s a compelling example:

Return an N-character text value that represents values of various data types according to rules that are specific to each different data type. The output should be right-justified, and should represent NULL as “<null>”.

A proper requirements statement would specify more detail. But I’ll simply show you two implementations that treat some interesting cases in a natural way. User-defined functions and procedures can be created in overload sets where each set is for a function or procedure with a certain name and members of a set are distinguished from each other by the the data types of their formal parameters and/or by the number of these. So you could meet the present requirements by creating lots (and lots) of functions all called, say, “display()”. Here’s an example for a “boolean” input:

create function display(val in boolean)
  returns text
  immutable
  language plpgsql
as $body$
begin
  return
    case val
      when true  then '  true'
      when false then ' false'
      else            '<null>'
    end::text;
end;
$body$;

And here’s an example for a “bigint” input:

create function display(val in bigint)
  returns text
  immutable
  language plpgsql
as $body$
begin
  return
    case
      when val is null                     then lpad('<null>', 7)
      when abs(val) < 1000                 then lpad(val::text, 7)
      when abs(val) between 10^3 and 10^6  then lpad(round(val/10^3)::text||'K', 7)
      when abs(val) between 10^6 and 10^9  then lpad(round(val/10^6)::text||'M', 7)
      when abs(val) between 10^9 and 10^12 then lpad(round(val/10^9)::text||'G', 7)
      else                                      lpad('too big', 7)
    end::text;
end;
$body$;

You can imagine similar overloads for other primitive scalar data types like “numeric”, “double precision”, “money”, “timestamp”, “timestamptz”, “interval” and so on. Maintenance would be simpler if the code for each data type were implemented in a single function. This would optimally self-document, in a single place, which data types it supported. And it could, for example, define a single manifest constant “width” to be used as the second argument for “lpad()” for values of all the supported data types. This is where “anyelement” comes to the rescue.

The anyelement pseudo-type

YSQL inherits the PostgreSQL type system. This implements several so-called pseudo-types. A pseudo-type cannot be used as a column data type, or for a local PL/pgSQL variable, but it can be used to declare a function’s formal parameters and result type. This blog post focuses on “anyelement” and “anyarray”.

Here’s the basic proof of concept. Make sure that you first drop any previously existing versions of the function “display()”.

-- Such a user-defined type is known as a "row" type.
create type rt as (b boolean, i int);

create function display(val in anyelement)
  returns text
  immutable
  language plpgsql
as $body$
declare
  val_type constant regtype := pg_typeof(val);
begin
  return
    case val_type
      when pg_typeof(null::boolean) then '"boolean" input'
      when pg_typeof(null::int)     then '"int" input'
      when pg_typeof(null::bigint)  then '"bigint" input'
      when pg_typeof(null::rt)      then '"rt" input'
      else                               'unsupported data type'
    end::text;
end;
$body$;

Test it like this:

select display(true)
union all
select display(43)
union all
select display(567890123456789)
union all
select display((true, 42)::rt)
union all
select display(now()::timestamp);

This is the result:

        display        
-----------------------
 "boolean" input
 "int" input
 "bigint" input
 "rt" input
 unsupported data type

We see, then, that “anyelement” stands for any scalar (that is non-array) data type—simple (like “boolean” or “bigint”) or composite (like a user-defined “row” type). A function with a pseudo-type formal parameter like “anyelement” (or “anyarray”) is called a polymorphic function. We can now use the approach demonstrated by the proof of concept to encapsulate “boolean”, “bigint”, and “rt” implementations of “display()” into a single polymorphic function, like this. As before, make sure that you first drop any previously existing versions of the function “display()”. And if the type “rt” already exists, you can leave in in place.

create type rt as (b boolean, i int);

create function display(val in anyelement)
  returns text
  immutable
  language plpgsql
as $body$
declare
  val_type constant regtype := pg_typeof(val);
  pad constant int := 22;
begin
  case val_type
    when pg_typeof(null::boolean) then
      return
        case val
          when true  then lpad('true',   pad)
          when false then lpad('false',  pad)
          else            lpad('<null>', pad)
        end::text;

    when pg_typeof(null::bigint) then
      return
        case
          when val is null then
            lpad('<null>', pad)
          when abs(val) < 1000 then
            lpad(val::text, pad)
          when abs(val) between 10^3 and 10^6 then
            lpad(round(val/10^3)::text||'K', pad)
          when abs(val) between 10^6 and 10^9 then
            lpad(round(val/10^6)::text||'M', pad)
          when abs(val) between 10^9 and 10^12 then
            lpad(round(val/10^9)::text||'G', pad)
          else
            lpad('too big', pad)
        end::text;

    when pg_typeof(null::rt) then
      return lpad(val::text, pad);

    else
      return lpad('unsupported data type'::text, pad);
  end case;
end;
$body$;

Test it like this:

(with v as (values
             (true),
            (false),
    (null::boolean))
select
display(column1)
from v)

union all
(with v as (values
              (123),
           (234567),
        (345678901),
     (456789012345),
  (567890123456789),
     (null::bigint))
select
display(column1)
from v)

union all
select display((true, 42)::rt)

union all
select display(now());

This is the result:

        display         
------------------------
                   true
                  false
                 <null>
                    123
                   235K
                   346M
                   457G
                too big
                 <null>
                 (t,42)
  unsupported data type

The anyarray pseudo-type

This is the natural partner to “anyelement”. It stands for an array of any data type. This simple example demonstrates the idea:

create function array_display(arr anyarray)
  returns setof text
  immutable
  language sql
as $body$
  select display(unnest(arr));
$body$;

You take advantage, here, of a very useful built-in that materializes the elements of an array as a regular SQL table. Read about this, and more, in the “Array data types and functionality” major section of the YugabyteDB documentation—and, in particular, read about the “unnest()” built-in.

Test it like this:

select array_display(array[true, false, null]::boolean[])
union all
select array_display(array[
  123, 234567, 345678901, 456789012345, 567890123456789, null]::bigint[])
union all
select array_display(array[now()]::timestamp[]);

This is the result:

     array_display      
------------------------
                   true
                  false
                 <null>
                    123
                   235K
                   346M
                   457G
                too big
                 <null>
  unsupported data type

You’ll notice that, because the actual input values are identical to those used to demonstrate the bare “display()” function, the result is identical too.

Conclusion to Part One

So far, I’ve shown you just one of the two interesting and useful features that this two-part post explains:

How to use the “anyelement” and “anyarray” pseudo-types to write a polymorphic subprogram with a dynamically determined polymorphic return value.

YugabyteDB inherits both features from PostgreSQL thanks to its unique architecture that simply uses its SQL processing code as is.

I can’t resist saying here (but I’ll say it again in the conclusion to Part Two) that Oracle Database currently supports neither anything equivalent to the pseudo-types like “anyelement” and “anyarray”, nor the ability to create variadic user-defined subprograms. And, as I never miss a chance to point out, the other distributed SQL databases with which we compare YugabyteDB (CockroachDB, TiDB, and our common inspiration Google Cloud Spanner) don’t support stored procedures at all.

Please go on now to read Part Two. It explains how to write a variadic subprogram.

June 24, 2020

Polymorphism in SQL part one – anyelement and anyarray

Related Posts

Explore Distributed SQL and YugabyteDB in Depth