The Distributed SQL Blog

Thoughts on distributed databases, open source and cloud native

Part 1: Deploying a Distributed SQL Backend for Apache Airflow on Google Cloud

VP Developer Relations

Apache Airflow is a popular platform for programmatically authoring, scheduling, and monitoring workflows. Airflow has been deployed by companies like Adobe, Airbnb, Etsy, Instacart, and Square. The advantage of defining workflows as code is that they become more maintainable, versionable, testable, and collaborative. Airflow is used to author these workflows as directed acyclic graphs (DAGs) of tasks. Airflow’s scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. Its browser-based UI makes it easy to visualize pipelines running in production, monitor their progress, and troubleshoot issues when needed.

YugabyteDB and Apache Airflow and Google Cloud Platform, GCP

Why Airflow with a YugabyteDB backend?

By default, Airflow makes use of a SQLite database for its metadata store, which both the scheduler and web UI rely on. Typically, when Airflow is used in production, the SQLite backend is replaced with a traditional RDBMS like PostgreSQL. However, in order for PostgreSQL not to become a single point of failure in the Airflow deployment, administrators will still need to devise high-availability and failover strategies for PostgreSQL. There’s a simpler solution that Airflow can interact with just like PostgreSQL, but has the added advantages of high availability, support for multiple cloud and topology deployment options, plus high performance built in; it is YugabyteDB.

What’s YugabyteDB? It is an open source, high-performance distributed SQL database built on a scalable and fault-tolerant design inspired by Google Spanner. Yugabyte’s SQL API (YSQL) is PostgreSQL wire compatible.

When YugabyteDB is used in place of PostgreSQL in an Airflow deployment, the administrator no longer has to worry about maintaining complex PostgreSQL replication and failover schemes to minimize the possibility of a failed backend causing dataflows to fail. Instead, a distributed SQL backend like YugabyteDB can deliver PostgreSQL-compatibility, high availability, geo data distribution, and high performance in a single open source system.

Airflow Web UI - Airflow on GCP VM - YugabyteDB on GKE

In this post, part one of a two part blog series, we’ll show you how to get Airflow configured to use YugabyteDB as a backend. In part two, we’ll show you how to build an Airflow workflow that migrates data between PostgreSQL and YugabyteDB.

For the purposes of this demo, our Airflow and PostgreSQL installations will make use of virtual machines on Google Compute Engine and YugabyteDB will be installed on Google Kubernetes Engine.

In this post we’ll cover the following steps:

  • Deploy a VM for Airflow
  • Configure a firewall rule for the Airflow UI
  • Install Apache Airflow
  • Start the Airflow Service
  • Set up a GKE Cluster
  • Install YugabyteDB on GKE
  • Build the Airflow backend on YugabyteDB
  • Restart Airflow with a YugabyteDB backend

Prerequisites

Below is the environment that we’ll be using for the purposes of this blog.

Note: For the purposes of this demo, we aren’t creating a particularly secure deployment, instead we are focusing on demonstrating how to wire everything up with the least amount of fuss. In a production deployment you’ll want to enforce additional security measures throughout the stack.

Step 1: Deploy a VM for Airflow

For the purposes of this demo, I have specified the following configuration for the Google Compute Engine VM that will be hosting my Airflow server.

  • Name: airflowvm
  • Machine Type: n1-standard-2 (2vCPU, 7.5 GB memory)
  • OS: Ubuntu 18.04 LTS
  • Disk: 100 GB
  • Firewall: Allow HTTP/HTTPS traffic

Step 2: Configure a firewall rule for the Airflow UI

To access the Airflow Web UI we’ll need to set up a firewall rule to expose an Airflow ingress on port 8080. In the Google Cloud Console you’ll want to open up the Firewall rules page and create a rule with the following characteristics:

  • Name: airflowwebui
  • Direction: Ingress
  • Targets: All instances in the network
  • Source Filter: IP ranges
  • Source IP Ranges: 0.0.0.0/0
  • Protocols and ports: TCP, 8080

Step 3: Install Apache Airflow

Next, let’s install Apache Airflow on the VM by running the following commands:

$ cd ~/
$ sudo apt-get update
$ sudo apt-get install python3-pip
$ export AIRFLOW_HOME=~/airflow
$ sudo pip3 install apache-airflow[postgres]

Verify that Apache Airflow is installed by executing the following command:

$ airflow version
v1.10.10

Step 4: Start the Airflow service

Initialize the default backend

To get Airflow initialized with the default SQLite backend we’ll need to execute the following commands:

$ cd ~/
$ cd airflow
$ airflow initdb

Start the Airflow scheduler

Next, start the Airflow scheduler service by running the following command:

$ airflow scheduler -D

start the Airflow scheduler service YugabyteDB distributed SQL example

Start the Airflow web server

Start the Airflow web server with:

$ airflow webserver -p 8080 -D

Start the Airflow web server command distributed SQL YugabyteDB example

Access the Airflow web UI

To access the web UI, go to http://<airflowvm  IP>:8080

Access the Airflow web UI yugabytedb airflow example

At this point we have Airflow up and running on GCP. The next step is to set up YugabyteDB on GKE.

Step 5: Set up a GKE Cluster

If you recall, YugabyteDB will serve two purposes in this demo. First, as a backend for Airflow and second, as a target for our example DAG that will detect and move updates from PostgreSQL tables to YugabyteDB.

For the purposes of this demo I have specified the following configuration for my GKE cluster that will host YugabyteDB:

YugabyteDB

  • Name: yugabytedbgke
  • Number of Nodes: 3
  • Machine Type: n1-standard-8 (8vCPU, 30 GB memory)
  • Max Pods: 48

Once the GKE cluster is live, we are ready to log into the cluster and install YugabyteDB.

Step 6: Install YugabyteDB on GKE

Note: As of the writing of this post, the current version of YugabyteDB is 2.1.6. You can check for the latest version of YugabyteDB on our GitHub releases page.

Upgrade Helm

$ helm init --upgrade --wait

$HELM_HOME has been configured at /home/jimmy/.helm.
Tiller (the Helm server-side component) has been installed into your Kubernetes Cluster.
Please note: by default, Tiller is deployed with an insecure 'allow unauthenticated users' policy.
To prevent this, run `helm init` with the --tiller-tls-verify flag.
For more information on securing your installation see: https://docs.helm.sh/using_helm/#securing-your-helm-installation

Create a service account

Before we can create the cluster, we need a service account that has been granted the cluster-admin role. Use the following command to create a yugabyte-helm service account granted with the ClusterRole of cluster-admin.

$ kubectl create -f https://raw.githubusercontent.com/yugabyte/charts/master/stable/yugabyte/yugabyte-rbac.yaml

serviceaccount/yugabyte-helm created
clusterrolebinding.rbac.authorization.k8s.io/yugabyte-helm created

Initialize Helm

$ helm init --service-account yugabyte-helm --upgrade --wait

$HELM_HOME has been configured at /home/jimmy/.helm.
Tiller (the Helm server-side component) has been installed into your Kubernetes Cluster.

Please note: by default, Tiller is deployed with an insecure 'allow unauthenticated users' policy.
To prevent this, run `helm init` with the --tiller-tls-verify flag.
For more information on securing your installation see: https://docs.helm.sh/using_helm/#securing-your-helm-installation

Create a namespace

$ kubectl create namespace yb-demo

namespace/yb-demo created

Add the charts repository

$ helm repo add yugabytedb https://charts.yugabyte.com

"yugabytedb" has been added to your repositories

Update the repository

$ helm repo update

Hang tight while we grab the latest from your chart repositories...
...Skip local chart repository
...Successfully got an update from the "yugabytedb" chart repository
...Successfully got an update from the "stable" chart repository
Update Complete.

Install YugabyteDB

$ helm install yugabytedb/yugabyte -f --namespace yb-demo --name yb-demo --wait

To check the status of the YugabyteDB cluster, execute the command below:

$ helm status yb-demo

You can also navigate to the Services & Ingress view in the GCP Console and take note of the various endpoints, IPs, and ports.

Services & Ingress view in the GCP Console to see endpoints, IPs, and ports

Congrats! At this point you have a three node YugabyteDB cluster running on GKE.

Step 7: Build the Airflow backend on YugabyteDB

YugabyteDB currently doesn’t support the ALTER TABLE ADD PRIMARY KEY command (GitHub Issue #1104) which means that the Airflow initdb command won’t work out of the box. The simple workaround is to move the ALTER statements at the end of the initdb build script into the CREATE TABLE statements. Log into into GKE and download the YugabyteDB-friendly version of the Airflow backend build script here:

wget https://github.com/yugabyte/yugabyte-db/blob/master/sample/airflowybrepo.sql

After downloading the script, you’ll need to copy the script to a yb-tserver pod if you’d like to execute it from the command line. Modify the command below to account for the directory structure of your setup:

kubectl cp /home/jimmy/airflowybrepo.sql yb-demo/yb-tserver-0:/home/yugabyte/share -c yb-tserver

Create a password for the default YugabyteDB user

For Airflow to be able to connect to the YugabyteDB cluster, we need to create a password for the default yugabyte user, which by default has none.

Log into YugabyteDB:

$ kubectl --namespace yb-demo exec -it yb-tserver-0 -- /home/yugabyte/bin/ysqlsh -h yb-tserver-0

Assign yugabyte a password:

yugabyte=# ALTER USER yugabyte WITH PASSWORD 'password';
ALTER ROLE

Create a database for the Airflow repository

Create a database called airflowybrepo for Airflow to use as a metadata repository. Either execute the following script in your shell or use your favorite GUI tool.

yugabyte=# CREATE DATABASE airflowybrepo;

yugabyte=# \c airflowybrepo;

Build the Airflow database objects

Execute the airflowybrepo.sql script:

airflowybrepo=#\i /home/yugabyte/share/airflowybrepo.sql

After it completes you should see 23 tables populated with initialization data, plus 15 sequences and 44 indexes.

Build the Airflow database objects distributed sql example

Step 8: Restart Airflow with a YugabyteDB backend

Modify the backend connect string Airflow uses

Return to the airflowvm VM and update the airflow.cfg file with your YugabyteDB specific information and save the file:

$ cd airflow
$ vim airflow.cfg

sql_alchemy_conn = postgres://yugabyte:[email protected]<YUGABYTEDB-CLUSTER -IP>:5433/airflowybrepo

Restart the Airflow scheduler

$ airflow scheduler -D

start the Airflow scheduler service YugabyteDB distributed SQL example

Restart the Airflow web server

$ airflow webserver -p 8080 -D

Start the Airflow web server command distributed SQL YugabyteDB example

We can also verify that there is connectivity to the YugabyteDB backend by executing:

$ airflow checkdb

DB: postgres://yugabyte:***@104.197.36.19:5433/airflowybrepo
[2020-04-21 02:53:45,254] {db.py:430} INFO - Connection successful.

Refresh the Airflow web UI

Return to the Airflow web UI at http://<airflowvm  IP>:8080 and refresh your browser.

Refresh the Airflow web UI yugabytedb airflow example

That’s it! You now have Airflow deployed on a Google Compute Engine VM with a highly available, PostgreSQL compatible YugabyteDB backend running on Google Kubernetes Engine.

What’s next?

In part two of this blog series, we’ll show you how to configure and run a workflow that detects and migrates data changes between PostgreSQL and YugabyeDB. In that post we’ll cover:

  • Installing PostgreSQL on Google Cloud
  • Configuring GCP firewall rules
  • Configuring Airflow database connections
  • Creating an Airflow task file
  • Running the task
  • Monitoring and verifying the results

Stay tuned!

Related Posts

VP Developer Relations