Adventures in Database Administration - Starring SchemaHero!

Treva Williams

• 2022-04-12

Gone are the days when those of us not fluent in SQL found ourselves perched on dev.mysql.com seeking to verify just one more time that the syntax of a pages-long command to import a sensitive database that we were trusted with is precisely correct. Only to get an error message for a missing ";" or something. Oh, it was just me who did that?!

Well, never mind. Anyway, suppose you couldn't tell by that overly dramatic introductory sentence. In that case, database administration (DBA) is not my favorite activity - not because it's terrible, but because it was intimidating for a while at least. 

Fortunately, as mentioned in the first sentence, many of the more daunting components of DBA have either been automated or solved in other, more accessible ways, thanks to Developers Like You. Whether out of need, frustration, or the kindness of their hearts, they have created a suite of open source tools that solve a lot of these problems.

What is GitOps?

Excellent question. Let’s talk about it a bit. So, if you’ve been in the tech space for a while, you’ve most likely heard and/or used a plethora of seemingly ever-evolving descriptors for the different specialties that fall under the DevOps umbrella, including (but not limited to):

  • CI/CD (Continuous Integration & Delivery)
  • DevOps
  • Infrastructure As Code 
  • GitOps

For the sake of this article, we’re going to focus on GitOps (but if you’d like a dive into other concepts, ping us on K8s Slack). GitOps - using Git as a single source of truth for declarative infrastructure & applications - is a Cloud Native type of Infrastructure as Code, as it builds on & combines some critical components of IaC, including (but once again, not limited to):

  • Orchestration
  • Observability
  • Declarative IaC
  • Containers & immutable infrastructure
  • DevOps best practices

The main difference is that GitOps is tailored explicitly for containerized environments. One of the key benefits of applying GitOps best practices is that, while it enables the implementation of automated CI/CD pipelines, there’s always a “single source of truth” at the infrastructure & application level. This is thanks to a myriad of available tools to compare the actual production state of the environment in question with what’s under source control, which can be set to alert when a significant mismatch occurs. 

Now, what if I told you that the above concept could be applied to database schema migration?

Let’s Talk SchemaHero

DB schema migration scripts may have looked something like this back in The Before Times. Walls of code, usually not nearly as well documented as the example, that you hoped would perform the needed task promptly and with as little data loss as possible. Even if said script worked perfectly, writing was still a pain. It’s no wonder that the mere mention of database applications in Kubernetes triggered a fight-or-flight response in some (some, being me) as adding a containerization layer to an already daunting task could leave one feeling a bit overwhelmed. 

Upon further inspection, we learned that the exact opposite is true. Several tools are available that make the process easier to understand & execute while also providing a crucial verification layer - a cornerstone of GitOps best practices - that allows the administrator to double or triple check to be sure that what’s happening is what’s intended. 

One such tool is SchemaHero, a Kubernetes operator-slash-CLI utility that converts DB schema definitions into migration scripts. Instead of writing a series of scripts in Python/Go/Ruby/XML or whatever your preferred language, Kubernetes administrators can instead create, import, and manage containerized databases using the same YAML definitions used for virtually every other component of a Kubernetes cluster. 

What they look like now:

apiVersion: schemas.schemahero.io/v1alpha4
kind: Table
metadata:
  name: airport
  namespace: schemahero-tutorial
spec:
  database: airlinedb
  name: airport
  schema:
    postgres:
      primaryKey: [code]
      columns:
        - name: code
          type: char(4)
        - name: name
          type: varchar(255)
          constraints:
            notNull: true

Let’s do The Thing

We will start easy by going through the onboarding tutorial from the SchemaHero docs that’ll walk us through building a PostgreSQL database using SchemaHero. To follow along, you’ll need a running Kubernetes cluster. I’m using one of my trusty NUCs running Ubuntu 20.04.04 LTS bc why not, but if you don’t have a surplus of mini servers available for playing around, you can quickly deploy a dev cluster using kURLK3s, or even MiniKube if you’re feeling ambitious. 

Installing Krew 

Once the cluster runs, we will need to install Krew, a kubectl plugin manager that acts as a repository for other kubectl plugins. The following command will download & install the krew binary in `$PATH`:

(
  set -x; cd "$(mktemp -d)" &&
  OS="$(uname | tr '[:upper:]' '[:lower:]')" &&
  ARCH="$(uname -m | sed -e 's/x86_64/amd64/' -e 's/\(arm\)\(64\)\?.*/\1\2/' -e 's/aarch64$/arm64/')" &&
  KREW="krew-${OS}_${ARCH}" &&
  curl -fsSLO "https://github.com/kubernetes-sigs/krew/releases/latest/download/${KREW}.tar.gz" &&
  tar zxvf "${KREW}.tar.gz" &&
  ./"${KREW}" install krew
)

Once installation is complete, update `~/.bashrc` to add krew to the environment variable, then reload your shell. 

~]% export PATH="${KREW_ROOT:-$HOME/.krew}/bin:$PATH"
~]% exec bash

Now you should be able to access the krew command using kubectl. To be sure, let’s run a test command by listing out all plugins currently installed using krew:

[email protected]:~$ kubectl krew list
PLUGIN      VERSION
krew        v0.4.3

If you get an error message saying that krew isn’t installed, check `~/.bashrc` to verify that `$HOME/.krew/bin` was added to your path, then reload your shell.

Once Krew is up and running, we’re can move on to installing the SchemaHero plugin which is literally a single command that adds the kubectl-schemahero binary to $PATH:

[email protected]:~$ kubectl krew install schemahero
Updated the local copy of plugin index.
Installing plugin: schemahero
Installed plugin: schemahero
…

Once installation is complete, you may get a scary-looking error message similar to this:

WARNING: You installed plugin "schemahero" from the krew-index plugin repository.
   These plugins are not audited for security by the Krew maintainers.
   Run them at your own risk.

Rest assured that Schemahero is a safe, secure, and verified plugin that works really well and is currently in the CNCF sandbox. There will also be a message pointing to official docs and advising on the next steps - adding SchemaHero components to the cluster. This is done with the command kubectl schemahero install, as shown below:

SchemaHero plugin

[email protected] ~ % kubectl schemahero install
The SchemaHero operator has been installed to the cluster

[email protected] ~ % kubectl get ns            
NAME                	STATUS   AGE
default                	Active       7m12s
kube-node-lease     	Active       7m13s
kube-public         	Active       7m13s
kube-system         	Active       7m13s
schemahero-system   Active      6s

This command will kick off the installation of the SchemaHero in-cluster operator under the `schemahero-system` namespace. Depending on your clusters’ available resources, this step should take anywhere from a few seconds to a few minutes. Once the `schemahero-0` operator is in Running status, we’re ready to jump into the fun part of creating a database to connect to SchemaHero.

[email protected] ~ % kubectl get pods -n schemahero-system
NAME           READY   STATUS    RESTARTS   AGE
schemahero-0   1/1     Running   0          45s

Connecting a Database with SchemaHero

Before we can connect a database, we'll first need to create one. Fortunately, this has been simplified for the tutorial's sake by a handy YAML that will generate a Postgres server available in the SchemaHero GitHub repository. So let's create a new namespace named schemahero-tutorial for the Postgres instance to live under:

~ % kubectl create ns schemahero-tutorial
namespace/schemahero-tutorial created

Once the namespace is created, use the example config referenced above to create the Postgres instance:

~ % kubectl apply -n schemahero-tutorial -f https://raw.githubusercontent.com/schemahero/schemahero/main/examples/tutorial/postgresql/postgresql-11.8.0.yaml

[email protected]:~$ kubectl get po --namespace=schemahero-tutorial
NAME                     READY   STATUS    RESTARTS       AGE
postgresql-0             1/1     Running   0              143m

Once the `postgresql-0` container is Running, our next step is to provide DB information the SchemaHero operator created in a previous step so that it can manage the database by deploying a custom resource to the cluster that contains connection information. Now, that sounds like a lot, but it's pretty straightforward. We're going to create a new YAML database definition named `airline-db.yaml` that will set up a few needed parameters for interaction with SchemaHero.

This definition will establish the object's group, version, and kind, set a name for the database that other SchemaHero objects will use, and set the namespace where SchemaHero is to watch for schemas and credentials needed to connect to and authenticate to the database under spec.

apiVersion: databases.schemahero.io/v1alpha4
kind: Database
metadata:
  name: airlinedb
  namespace: schemahero-tutorial
spec:
  connection:
    postgres:
      uri:
        valueFrom:
          secretKeyRef:
            name: postgresql
            key: uri

The definition is added to the cluster with the command:

~] % kubectl apply -f ./airline-db.yaml

We can verify that the object has deployed properly with the following command:

[email protected]:~$ kubectl get databases -n schemahero-tutorial
NAME        AGE
airlinedb     2h

Create a new table

Next up, we’ll need to create a new table. Now, if we weren’t using SchemaHero, a command creating a new table would look something like this:

> CREATE TABLE airport;
   (
     code char(4) not null primary key,
     name varchar(255)
   )

Let’s not do that. Instead, let’s create a new file named `airport-table.yaml` containing the following:

apiVersion: schemas.schemahero.io/v1alpha4
kind: Table
metadata:
  name: airport
  namespace: schemahero-tutorial
spec:
  database: airlinedb
  name: airport
  schema:
    postgres:
      primaryKey: [code]
      columns:
        - name: code
          type: char(4)
        - name: name
          type: varchar(255)
          constraints:
            notNull: true

In this object, we’ll once again declare Group, Version, and Kind, then establish the name of our Table - in this case, it’s airport - and declare the namespace to deploy the Table object to. Under spec, we’ll declare database name, Postgres table name, & define table schema including keys, code, & type(s) compatible with Postgres.

Once you’ve saved the file, deploy the table object to SchemaHero by running the command:

~] % kubectl apply -f ./airport-table.yaml

Now we’re at the cool part where we get a crucial moment to pause and verify that DB schema information is correct before any damage has occurred. Let’s check out our pending migration with the command `kubectl schemahero` get migrations:

~$ kubectl schemahero get migrations -n schemahero-tutorial
ID           DATABASE   TABLE    PLANNED  EXECUTED  APPROVED  REJECTED
eaa36ef  airlinedb        airport     2h

With this command, we get basics like UUID, DB name, DB table name, and how long ago the migration was planned. At this point, there’s not yet been an actual change to any DB schema. Let’s take a closer look at the scheduled migration with the command `kubectl schemahero describe`:

[email protected]:~$ kubectl schemahero describe migration eaa36ef -n schemahero-tutorial

Migration Name: eaa36ef

Generated DDL Statement (generated at 2022-03-21T11:49:00Z): 
  create table "airport" ("code" character (4), "name" character varying (255) not null, primary key ("code"))
…

Under “Generated DDl Statement,” SchemaHero will print the *exact* SQL statement or statements run if the migration is approved. So let’s go ahead and approve the migration:

[email protected]:~$ kubectl schemahero -n schemahero-tutorial approve migration eaa36ef
Migration eaa36ef approved

Since our DB schema is only one line, this should complete fairly quickly, which we can verify by running schemahero get migrations one more time:

[email protected]:~$ kubectl schemahero get migrations -n schemahero-tutorial
ID       DATABASE   TABLE    PLANNED  EXECUTED  APPROVED  REJECTED
eaa36ef  airlinedb    airport    3h               45s       	       45s

We see that migration ID eaa36f was approved and executed 45 seconds ago, taking no time to complete since our DB schema was only one line. If you so desire, you can verify the migration using open source tools like Beekeeper Studio or whichever DB management tool you prefer.

Cleanup

Once done, you can clean up your cluster by deleting the schemahero-tutorial namespace with the command kubectl delete namespace. This should get rid of any pods, services, definitions, etc. that we created while familiarizing ourselves with SchemaHero:

[email protected]:~$ kubectl delete namespace schemahero-tutorial
namespace "schemahero-tutorial" deleted

If you want to take it a step further and completely remove SchemaHero from your system (though IDK why you would do that), delete the schemahero-system operator, then remove the kubectl-schemahero plugin from your system using krew:

[email protected]:~$ kubectl krew uninstall schemahero
Uninstalled plugin: schemahero

Or, if you want to continue on our journey of familiarizing ourselves with SchemaHero, keep the operator & plugin ready for the next edition of this blog, where we’ll venture out into uncharted territory to learn even more about how to be Heroes. 

You made it to the end! Thanks for reading and following along! Meanwhile, please feel free to join our next Community Meeting on April 21st at 1230 PM Pacific/330PM Eastern.