Row Counts of Tables in a SQL Schema and Database: PostgreSQL vs. YugabyteDB

Founder & CTO

August 27, 2020

Getting the total row counts of data in tables across various dimensions (per-table, per-schema, and in a given database) is a useful SQL technique to have in your toolkit. This blog post will guide you to obtain per-table row counts of all tables in PostgreSQL and YugabyteDB. This can serve as a first sanity check after migrating an application with pre-existing data from PostgreSQL to YugabyteDB.

This blog post also outlines how to get the following row counts broken down per table in a schema, aggregate row counts per schema, and aggregate row counts across all tables in the database.

To perform and illustrate these processes we will create a sample database and import two popular SQL datasets—Northwind and SportsDB.

Finally, the examples in this blog post, which are essentially dynamic SQL queries on the system catalog tables, require superuser privileges. Also, note that the programmatic generation of SQL queries using catalog tables needs to handle exotic names properly. An instance of a table and a column with an exotic name is shown below.

create table "Some Exotically Named Table"(
  k bigserial primary key,
  "Some Exotically Named Column" text
);

While this post does not explicitly discuss the challenges posed by the example above, the SQL functions below handle these cases correctly and do so by incorporating some of the important and well-known techniques necessary to prevent SQL injection attacks.

Join us at our free online Distributed SQL Summit on September 12 to hear about our new Connection Manager during the keynote.

Create an example database

In order to create a test setup, I simply installed YugabyteDB on my laptop, created a database example, and loaded the Northwind dataset. This only took a few minutes. For the purpose of simplicity, we’re going to use the default yugabyte user for the operations below. However, creating a dedicated user for each of these datasets with the appropriate privileges is the recommended best practice.

Create an example database and connect to it.

yugabyte=# CREATE DATABASE example;
yugabyte=# \c example

Create the Northwind tables and import the dataset into the northwind schema.

example=# CREATE SCHEMA northwind;
example=# SET SCHEMA 'northwind';
example=# \i northwind_ddl.sql
example=# \i northwind_data.sql

You can verify that the tables have been created by running the following command.

example=# \d
                   List of relations
  Schema   |          Name          | Type  |  Owner
-----------+------------------------+-------+----------
 northwind | categories             | table | yugabyte
 northwind | customer_customer_demo | table | yugabyte
 northwind | customer_demographics  | table | yugabyte
 northwind | customers              | table | yugabyte
 northwind | employee_territories   | table | yugabyte
 northwind | employees              | table | yugabyte
 northwind | order_details          | table | yugabyte
 northwind | orders                 | table | yugabyte
 northwind | products               | table | yugabyte
 northwind | region                 | table | yugabyte
 northwind | shippers               | table | yugabyte
 northwind | suppliers              | table | yugabyte
 northwind | territories            | table | yugabyte
 northwind | us_states              | table | yugabyte
(14 rows)

Next, let’s import the SportsDB dataset into a new schema named sportsdb as shown below.

example=# CREATE SCHEMA sportsdb;
example=# SET SCHEMA 'sportsdb';
example=# \i sportsdb_tables.sql
example=# \i sportsdb_indexes.sql
example=# \i sportsdb_inserts.sql
example=# \i sportsdb_constraints.sql
example=# \i sportsdb_fks.sql

Create a function for row count

Recall that YugabyteDB re-uses the native PostgreSQL codebase for its query layer (or the SQL processing layer) of the database. This means that the high-level approach to solving this problem is identical in the case of both PostgreSQL and YugabyteDB.

We’ll solve this problem by first creating a user defined function (UDF), count_rows_of_table which counts the number of rows in a single table. Note that this function must be owned by a suitably privileged user, in our example we will use the yugabyte user. This function can subsequently be used in various types of queries to print the desired row counts in the various scenarios. The function definition is shown below.

create function count_rows_of_table(
  schema    text,
  tablename text
  )
  returns   integer

  security  invoker
  language  plpgsql
as
$body$
declare
  query_template constant text not null :=
    '
      select count(*) from "?schema"."?tablename"
    ';

  query constant text not null :=
    replace(
      replace(
        query_template, '?schema', schema),
     '?tablename', tablename);

  result int not null := -1;
begin
  execute query into result;
  return result;
end;
$body$;

You can test the above function by passing in a table (for example, the orders table loaded from the Northwind dataset) as shown below.

example=# SELECT count_rows_of_table('northwind', 'orders');
 count_rows_of_table
---------------------
                 830
(1 row)

Per-table row counts in a given database

The information_schema.tables table in the system catalog contains the list of all tables and the schemas they belong to. Because we are mainly interested in the user tables, we filter out all tables belonging to pg_catalog and information_schema, which are system schemas. We then call the function we defined in the previous section to get the row count for each table.

select
  table_schema,
  table_name,
  count_rows_of_table(table_schema, table_name)
from
  information_schema.tables
where 
  table_schema not in ('pg_catalog', 'information_schema')
  and table_type = 'BASE TABLE'
order by
  1 asc,
  3 desc;

The query above outputs a table that contains the row counts of all tables across the various schemas, first sorted by the table_schema column and for each table schema, sorted by the tables with the largest number of rows. If we run the above query on our test database, we should see the following output.

 table_schema |       table_name       | count_rows_of_table
--------------+------------------------+---------------------
 northwind    | order_details          |                2155
 northwind    | orders                 |                 830
 northwind    | customers              |                  91
 northwind    | products               |                  77

...
 sportsdb     | affiliations_events    |               13052
 sportsdb     | stats                  |                9398
 sportsdb     | participants_events    |                8700
 sportsdb     | events_documents       |                7915
...
(121 rows)

Aggregate row counts per schema

Next, let us say we want to get the total row count across all tables broken down per schema. This can be achieved by using the following query.

SELECT table_schema, SUM(row_count) AS total_rows FROM (
  SELECT table_schema, 
         count_rows_of_table(table_schema, table_name) AS row_count
    FROM information_schema.tables
    WHERE table_schema NOT IN ('pg_catalog', 'information_schema') 
      AND table_type='BASE TABLE'
) AS per_table_count_subquery
  GROUP BY table_schema
  ORDER BY 2 DESC;

The above uses a subquery to first compute the totals row count per table and performs a GROUP BY operation to get the total number of rows in each schema of the current database. The resulting output is sorted by the schema with the maximum number of rows.

 table_schema | total_rows
--------------+------------
 sportsdb     |      79138
 northwind    |       3362
(2 rows)

Aggregate row count across all tables

The query below simply sums the row counts of the individual tables from the previous step to get a total row count across all the tables. This is done by running the per-table row count as a subquery called per_table_count_subquery and performing a SUM across all the row counts that are the output of that subquery.

with per_table_counts as (
  select count_rows_of_table(table_schema, table_name) as row_count
    from information_schema.tables
    where table_schema not in ('pg_catalog', 'information_schema')
      and table_type='BASE TABLE'
) select sum(row_count) as total_rows
    from per_table_counts;

Running this on the Northwind example dataset produces the following output.

 total_rows
------------
      82500
(1 row)

Conclusion

This post highlights important SQL techniques for computing row counts for PostgreSQL-compatible databases like YugabyteDB. It is crucial to exercise caution, perform rigorous testing, and undergo thorough peer reviews when developing programs that generate and execute dynamic SQL to avoid errors that can lead to incorrect results and expose the code and database to SQL injection attacks.

The code examples used above work exactly the same way across PostgreSQL and YugabyteDB thanks to the Yugabyte database’s reuse of the PostgreSQL query layer in YugabyteDB.

Feel free to try out your favorite PostgreSQL feature on YugabyteDB, and let us know how it goes on our Community Slack. If you run into any issues, just file an issue on GitHub.

August 27, 2020