replicated troubleshoot preflight checks

If you were around in the tech industry back in The Olde Days, before containerization revolutionized infrastructure, you have probably lived through at least one Host Server Down (HSD) caused by a conflict in an OS patch, a missing dependency, an out-of-control logfile or any number of fairly common yet highly annoying reasons. If you’ve experienced a HSD first-hand, you probably also know how diagnosis was *always* the most time-consuming part, especially when attempting to work across teams, time zones, & technical skill levels. 

Fast forward several years & several XaaS offerings later, & we have an infinite number of monitoring/diagnostic tools available to help extinguish the fires ASAP. Troubleshoot.sh is one of those many fire extinguishers, but after using it for a bit I believe that it stands in a league of its own for just how much it can do. 

If you’ve ever heard of Replicated or spent any amount of time on the Replicated website you’ve probably seen mention of Troubleshoot & thought to yourself “hey, what the h*ck is that?!” 

Troubleshoot.sh is a fully open source framework that can be used to assess client environments before deploying your app via something called preflight, & can help diagnose host-level issues by way of support-bundle. In the spirit of Kubernetes, Troubleshoot.sh features a declarative YAML syntax for configuring a myriad of cluster checks (called analyzers & collectors) that can be customized to suit your/your application’s needs. All of the data that a support team could possibly need can be neatly stored in portable .tar.gz archives, basically eliminating the excruciating tech “language barrier” and back-and-forth that sometimes trigger dreaded bottlenecks in support workflows – extending the amount of time needed to find a resolution by minutes, hours, or even days.  

Sounds awesome, right? Well, you haven’t seen anything yet.

Prerequisites

So the above sounds awesome but seeing is believing, so I’m gonna walk you through the process of setting up Troubleshoot on your workstation starting with preflight, but before we get into the fun part you’ll need a workstation with a running kubernetes cluster  & kubectl installed. If you don’t already have k8s running, one of the quickest & easiest ways to stand up a cluster is with MiniKube, but I’m not the boss of you.

The important thing is that a cluster in some form is running, & that kubectl is installed & in your $PATH. Once the environment is set up, we’ll need to add the krew plugin in order to have access to install Troubleshoot components.

Installing krew

kubectl krew is an open source plugin manager for kubernetes that we’ll need for installing preflight & support-bundle. Think of it kinda like apt, yum, dnf or brew, but specifically for kubectl.

Installation of Krew is already pretty simple, but to save you a bit of effort I threw everything into a bash script that will complete the setup for you.

Once the script is downloaded, execute it using the command bash krew-installer.sh. Once that’s done, we’ll need to verify that everything worked properly by running a simple command like:

~ % kubectl krew -h

If you get an error message saying the plugin isn’t available, reload your terminal with the command source ~/.bashrc, then try the help command again.

Preflight

Alright, now that krew has installed, we can setup the first component of Troubleshoot.sh, which is called preflight. preflight is used to verify that kubernetes environments meet specific requirements before deployment, declared using something called analyzers. Analyzers are yaml specifications that define a set of criteria & operations to run against data collected in a preflight check or support-bundle.

We can install preflight using the command:

~ % kubectl krew install preflight 

Once install completes, you can verify that installation was successful by running the example preflight check provided by the Replicated crew to familiarize yourself with the preflight format. When you’re done, hit the letter q to return back to the main terminal.

~ % kubectl preflight https://preflight.replicated.com 

Writing custom checks

So now that we’ve verified that preflight is functional, let’s write our own little check to see if our environment can support kURL – another Open Source project in the Replicated catalog that creates & installs custom Kubernetes distros on a single node (which can be extended to N nodes) using cURL. 

To start, we’ll need a Kubernetes yaml file like this named kurl-preflight.yaml (or whatever you want – I’m not the boss of you):

apiVersion: troubleshoot.sh/v1beta2
kind: Preflight
metadata:
  name: kURL Preflight
spec:
  analyzers: []

Under analyzers is where we can specify exactly what we want checked using – you guessed it – analyzers listed in Troubleshoot docs. Technically, the above yaml is a valid check, though it won’t return any information since we haven’t told it what to look at/for. So let’s fix that by first looking at kURL minimum requirements according to official documentation:

Minimum System Requirements

  • 4 CPUs or equivalent per machine
  • 8 GB of RAM per machine
  • 30 GB of Disk Space per machine

After reviewing the available analyzers in official docs, we can evaluate a host for everything in the above list starting at line 6 to include analyzers from the node resources list. After getting rid of the [] placeholder, let’s specify which analyzer we want to use, & also give the check a description in the checkName section:

apiVersion: troubleshoot.sh/v1beta2
kind: Preflight
metadata:
  name: kURL Preflight
spec:
  analyzers:
        - nodeResources:
            checkName: One node must have 8 GB RAM, 4 CPU Cores and 30Gi disk 

Now we need to set some filters for what we want preflight to analyze using the filters parameter. Once again referencing the list of predefined filters in the docs, we can find the info we need using the allocatableMemory, cpuCapacity, & ephemeralStorageCapacity filters, as shown below:

apiVersion: troubleshoot.sh/v1beta2
kind: Preflight
metadata:
  name: kURL Preflight
spec:
  analyzers:
        - nodeResources:
            checkName: One node must have 8 GB RAM, 4 CPU Cores and 30Gi disk
         filters:
           allocatableMemory: 8Gi
           cpuCapacity: "4"
           ephemeralStorageCapacity: 30Gi

Lastly, we’ll need to tell preflight how to report whether the check passes or fails under the outcomes parameter. From the provided options, min(filterName) seems most suitable for our checks. So let’s update the allocatableMemory filter first:

apiVersion: troubleshoot.sh/v1beta2
kind: Preflight
metadata:
  name: kURL Preflight
spec:
  analyzers:
        - nodeResources:
            checkName: One node must have 8 GB RAM, 4 CPU Cores and 30Gi disk
       filters:
         allocatableMemory: 8Gi
       outcomes:
         - fail:
             when: "min(allocatableMemory) < 8Gi"
             message: Cannot find a node with sufficient memory 
         - pass:
             message: Sufficient memory is available

Once that’s updated, we’re ready to add outcomes to the rest of our filters following the same guide. The completed preflight check should look similar to the below:

apiVersion: troubleshoot.sh/v1beta2
kind: Preflight
metadata:
  name: kurl-precheck
spec:
  analyzers:
    - nodeResources:
        checkName: Single node must have 8GB RAM
        outcomes:
          - fail:
              when: "min(memoryCapacity) <= 8Gi"
              message: Node must have at least 8Gi of memory
          - pass:
              message: Node has sufficient memory
    - nodeResources:
         checkName: Node must have at least 4 CPU cores
         outcomes:
           - fail:
               when: "min(cpuCapacity) < 4"
               message: The node must have at least 4 CPU cores
           - pass:
               message: The node has sufficient CPU cores
    - nodeResources:
         checkName: Node has sufficient disk space
         outcomes:
           - fail:
               when: "min(ephemeralStorageCapacity) < 30Gi"
               message: The node must have at least 30Gi storage
           - pass:
              message: The node has sufficient disk space

And that’s it! Now we can head over to our nearest terminal to run our fully customized preflight check with the command:

~ % kubectl preflight kurl-preflight.yaml

Pretty neat, right? That’s barely scratching the surface of all that preflight can do, & we haven’t even gotten to support-bundle yet – which is why this blog is becoming a whole series on the awesomeness of Troubleshoot.sh. Keep an eye on this space for the next part of our series where we’ll take a deep dive into support-bundle – the most powerful little kubectl plugin you should be using. 

In the meantime, you can learn more about the Replicated catalog by signing up for a 21-day trial of  Replicated Kots – or Kubernetes off the Shelf software. In addition to awesome docs, follow along with our video+blog series that will get you up & running with Kots software in no time flat. Head on over to vendor.replicated.com to get started.