The Distributed SQL Blog

Thoughts on distributed databases, open source and cloud native

Polymorphism in SQL part two – variadic functions

Developer Advocate

YugabyteDB is an open source, high-performance distributed SQL database built on a scalable and fault-tolerant design inspired by Google Spanner. YugabyteDB uses its own special distributed document store called DocDB. But it provides SQL and stored procedure functionality by re-using the “upper half” of the standard PostgreSQL source code. This is explained in the two part blog post “Distributed PostgreSQL on a Google Spanner Architecture”: (1) Storage Layer; and (2) Query Layer.

Make sure that you read Part One of this two-part series, and in particular the common introduction, before continuing here. Part One explains how polymorphism is implemented by declaring a subprogram with formal parameters of the pseudo-types “anyelement” and “anyarray”.

This part explains how marking a subprogram’s formal parameter with the keyword “variadic” implements the second critical notion.

I finish by bringing the two notions together to show you how to implement a non-aggregate version of any built-in aggregate function so that you can use it outside the context of a SQL subquery, just as the “least()” and “greatest()” built-ins already (but uniquely) do this for the “min()” and “max()” aggregate functions.

Understanding variadic subprograms

You may have noticed that some of YSQL’s built-in functions can be invoked using a comma-separated list of any number of actual arguments. The built-ins “greatest()” and “least()” are like this.

select
  (select greatest(17, 42, 13) as r1),
  (select greatest(6, 4, 8, 2, 8, 6, 3, 1) as r2);

This ability brings very useful end-user functionality, and you can imagine that it isn’t feasible to implement dedicated overloads of such a function with respectively one, then two, then three, and so on up to some present maximum of N formal parameters. Rather, one single subprogram with a so-called “variadic” formal parameter provides the functionality. In other words, a subprogram with such a parameter can be invoked using a comma-separated list of an arbitrary number of actual arguments. “Variadic” is a generic notion that lets you ask the question “Does language X support variadic functions?”) The good news is that YugabyteDB inherits from PostgreSQL the ability to implement user-defined subprograms with “variadic” formal parameters.

As mentioned, the “variadic” ordinary functions “greatest()” and “least()” are functionally equivalent to the aggregate functions “max()” and “min()”. It’s just that their invocation styles are different. It has always struck me as odd that other aggregate functions like, for example, “sum()” and “avg()”, don’t have built-in ordinary “variadic” equivalents. This is true not only for PostgreSQL (and therefore for YugabyteDB); it’s also true for other databases.

I’ll show you, now, how to write a “variadic” version of “avg()” that respects the same semantics for inputs with a mixture of NULLs and NOT NULL values as does the built-in aggregate function. I’ll use a naïve scheme. The following section shows a much better method. My aim, here, is just to demonstrate the technique that you use to implement a “variadic” function. Here it is:

create function mean(arr variadic numeric[])
  returns numeric
  immutable
  language plpgsql
as $body$
declare
  n int not null := 0;
  tot numeric := 0;
  v numeric;
begin
  foreach v in array arr loop
    continue when v is null;
    tot := tot + v;
    n := n + 1;
  end loop;
  return
    case n
      when 0 then null
      else        tot/n::numeric
    end;
end;
$body$;

Test it like this:

select mean(6, 3, null, 4);

The result is “4.33…”. Notice that, while on the “outside” the actual arguments are presented as the comma-separated list “6, 3, null, 4”, on the “inside” these values are available as the elements of the array “arr”. The identical functionality could have been provided by implementing “mean()” like this:

create function mean(arr numeric[])
  returns numeric
  immutable
  language plpgsql
as ...

and by invoking it thus:

select mean(array[6, 3, null, 4]);

So “variadic” boils down to no more than syntax sugar. But it is no less useful for that. It’s far more convenient, and clear (when, but only when, you have the use case for this), just to write the input values as an ordinary list than to clutter the invocation with an array constructor whose purpose is entirely technical and has nothing to do with the semantics of the function.

This last point brings us to an interesting realization. It’s not very common, in an ordinary procedural program in any language to handle an unknown number of items using lots of specifically declared scalar variables. This is exactly where an array (in SQL and PL/pgSQL), or a functional equivalent in some other programming language, finds its use. Here’s a thought experiment:

The need arises, in some subprogram, to assign a number of emergent values. You’re sure that the possible number of values lies between two and nine. So you decide to implement this using nine variables, “v1”, “v2”, and so on through “v9”. You need to calculate some function of however many values you end up with in a particular execution of the subprogram. So you write a case statement like this:

x :=
  case no_of_values
    when 2 then my_variadic_fn(v1, v2)
    when 3 then my_variadic_fn(v1, v2, v3)
    when 4 then my_variadic_fn(v1, v2, v3, v4)
    ...
    when 9 then my_variadic_fn(v1, v2, v3, v4,... v9)
  end;

Now you want to ask a “how to” question in connection with another part of this subprogram and you post the whole text on, say Stack Exchange—even saying that, so far, it all works as intended. You’d be pilloried. No end of folks would seize the chance to tell you how silly your scheme is and that you ought to use an array! No amount of protesting that you’re cross-your-heart-hope-to-die certain that there’ll never be more than nine values would change their opinion.

Fortunately, you don’t need to overthink this. You can have your cake and eat it simply by implementing both the straight “array” variant of “mean()” and re-use that to implement a “variadic” variant. You need to realize only that such a function pair cannot be mutual overloads, so you must distinguish between them by name, like this. Be sure to drop the previous version of “mean()” first.

create function mean(arr numeric[])
  returns numeric
  immutable
  language plpgsql
as $body$
declare
  n int not null := 0;
  tot numeric := 0;
  v numeric;
begin
  foreach v in array arr loop
    continue when v is null;
    tot := tot + v;
    n := n + 1;
  end loop;
  return
    case n
      when 0 then null
      else        tot/n::numeric
    end;
end;
$body$;

create function mean_v(arr variadic numeric[])
  returns numeric
  immutable
  language plpgsql
as $body$
begin
  return mean(arr);
end;
$body$;

Test it like this:

do $body$
begin
  assert
    (mean(array[6, 3, 4]) = mean_v(6, 3, 4)),
  'assert failed';
end;
$body$;

Putting anyelement/anyarray and variadic together

I’ll use the aggregate functions “max()”, “min()”, “sum()”, and “avg()” as examples here. I’ll show you how to implement them all as user-defined “variadic” polymorphic functions. The implementation will show us one final critical notion: how to implement a function whose return data type polymorphically reflects the data type of the actual arguments with which it is invoked.

You’re likely to want an implementation for each of several input data types. Try “\df avg”  to see what I mean. This is the result:

... | Result data type | Argument data types | Type 
----+------+------------------+---------------------+------
... | numeric          | bigint              | agg
... | double precision | double precision    | agg
... | numeric          | integer             | agg
... | interval         | interval            | agg
... | numeric          | numeric             | agg
... | double precision | real                | agg
... | numeric          | smallint            | agg

Notice that for some input data types (“numeric”, “double precision”, and “interval”), the output data type is the same as that of the input. But for the remaining input data types (“smallint”, “int”, “bigint”, and “real”), the output data type is different from that of the input. This is readily understood by considering the average of two integers, say, “3” and “4”. Common sense tells you that the answer that you want is “3.5” (a “numeric” value) —and that you would not want the answer to be rounded to the nearest integer. This difference between input and output data types, for some overloads, is significant. I’ll come back to this point presently.

You can implement the function like this, as a “language sql” function:

create function mean(arr variadic anyarray)
  returns anyelement
  immutable
  language sql
as $body$
  with v1 as (
    select unnest(arr) as element)
  select avg(element)
  from v1;
$body$;

(Here again, you take advantage of the “unnest()” built-in.)

You might prefer to implement the function as a “language plpgsql” so that you can use an “assert” to check, in line with proper practice, that an invocation meets the specified prerequisites.

create or replace function mean(arr variadic anyarray)
  returns anyelement
  immutable
  language plpgsql
as $body$
declare
begin
  assert
    (array_ndims(arr) = 1),
  'my_variadic() works only for 1-d arrays';

  return (
    with v1 as (
      select unnest(arr) as element)
    select avg(element)
    from v1);
end;
$body$;

Notice that, while the input formal parameter is declared as “anyarray”, the return value is declared, as you’d expect, as the scalar “anyelement”. This is critical. It’s this that says that the function’s return data type should polymorphically reflect the data type of the actual arguments with which it is invoked. But its meaning is maybe not quite what you’d hope for. And if you don’t understand the meaning, then you risk run-time errors with the “language sql” implementation and, worse, silently getting wrong results with the “language plpgsql” implementation. I’ll discuss this point after showing the positive tests.

The implementation takes advantage of the pre-existing set of overloads of the built-in avg() function. And not only is it, therefore, terse; also, and more importantly, you get the proper semantics for these operations (especially with respect to possible NULL inputs) for free. This is huge. You can trivially implement “my_sum()”, “my_greatest()”, and “my_least)” simply by replacing “avg()” in the templates above with, respectively, “sum()”, “max()”, and “min()”. Test the user-defined polymorphic variadic “mean()” like this:

select mean(10.5::numeric, 20.5::numeric)::text
union all
select mean(13.5::numeric, 23.5::numeric, 33.5::numeric, 43.5::numeric)::text
union all
select mean(20.5::double precision, 30.5::double precision)::text
union all
select mean(20.5::double precision, 30.5::double precision)::text
union all
select mean('2 4:10:40'::interval, '4 8:20:50'::interval)::text
union all
select mean('2 4:10:40'::interval, '4 8:20:50'::interval, '6 2:30:30'::interval)::text;

This is the result:

        mean         
---------------------
 15.5000000000000000
 28.5000000000000000
 25.5
 25.5
 3 days 06:15:45
 4 days 05:00:40

Test the polymorphic input/output behavior like this:

select pg_typeof(mean(10.5::numeric, 20.5::numeric))::text
union all
select pg_typeof(mean(13.5::numeric, 23.5::numeric, 33.5::numeric, 43.5::numeric))::text
union all
select pg_typeof(mean(20.5::double precision, 30.5::double precision))::text
union all
select pg_typeof(mean(20.5::double precision, 30.5::double precision))::text
union all
select pg_typeof(mean('2 4:10:40'::interval, '4 8:20:50'::interval))::text
union all
select pg_typeof(mean('2 4:10:40'::interval, '4 8:20:50'::interval, '6 2:30:30'::interval))::text;

This is the result:

    pg_typeof     
------------------
 numeric
 numeric
 double precision
 double precision
 interval
 interval

Now is the time to come back, as promised, to the fact that for some overloads of “avg()”, the output data type differs from the input data type. (There are similar cases for some of the overloads of “sum()”, “max()”, and “min()”.)

The rule used to determine the actual data type of the return value, for a given invocation of a user-defined polymorphic function with an “anyarray” input and an “anyelement” output is mechanical—and therefore not always what you want. It’s simply that when “anyarray” is “some_type[]”, “anyelement” is set to be the corresponding scalar “some_type”. But consider this:

select pg_typeof(avg(column1))::text
from (values (3::int), (4::int)) as a;

This is the result (just as “\df avg” predicts):

 pg_typeof 
-----------
 numeric

It turns out that, for a “language sql” function, this difference between the actual data type of the computed value, and the calculated data type of the “anyelement” return, is fatal. Try this, after having created the “language sql” variant of “mean()”.

select mean(3::int, 4::int);

This is the outcome:

ERROR:  return type mismatch in function declared to return integer
DETAIL:  Actual return type is numeric.

I think that the error text would be better if if said this:

...mismatch in function whose declared "anyelement" return was computed to be integer

But that’s just me being fussy. It’s clear what it means.

Now repeat the same test after having created the “language plpgsql” variant of “mean()”. This is the result:

 mean 
------
    4

So you silently got the wrong answer—and, in my book, that’s worse than a clear run-time error!

The workaround, however, is simple. And it works for both the “language sql” and the “language plpgsql” variants. Just write the invocation like this:

select mean(3::numeric, 4::numeric);

Now you get the right result, with no error:

        mean        
--------------------
 3.5000000000000000

While the workaround is easy to write, it does require that you embrace the relatively large mental model that I just explained and program with awareness and care. But, then, I might argue that you always have to program with awareness and care!

Conclusion

I’ve shown you two interesting and useful features that YSQL stored procedures inherit from PostgreSQL:

  • How to use the “anyelement” and “anyarray” pseudo-types to create a polymorphic user-defined function (or procedure) with a dynamically determined polymorphic return value.
  • How to use the “variadic” keyword to create a user-defined function or procedure that can accept a comma-separated list of any number of input actual arguments.

And I then showed you this:

  • How to use these two features together to implement ordinary versions of any aggregate function.

Finally, it’s worth pointing out that Oracle Database currently supports neither anything equivalent to the pseudo-types like “anyelement” and “anyarray”, nor the ability to create variadic user-defined subprograms. And, as I never miss a chance to point out, the other distributed SQL databases with which we compare YugabyteDB (CockroachDB, TiDB, and our common inspiration Google Cloud Spanner) don’t support stored procedures at all.

Related Posts

Developer Advocate