Troubleshoot.sh (A Guide to Debugging Kubernetes)

Alexander Trelore
 | 
Sep 13, 2022

Managing and debugging Kubernetes (K8s) is hard. I thought I understood Kubernetes; I had provisioned clusters from scratch, I had installed Istio and a whole bunch of other tools, I had fixed outages in the past. I then joined an organization whose expertise were Kubernetes, and oh boy, did I have a lot to learn.

This isn’t a blog post of what I’ve learned, I think that’s a useless post, it’d be irrelevant in weeks. This is a post on how to debug clusters using concepts from troubleshoot, which I’m hoping will stay relevant for a little while longer.

Note: This blog was originally published as an exploration by Replicated engineer Alexander Trelore on his own page here, and is republished here only lightly edited. Replicated doesn't necessarily recommend K9s to manage and visualize a cluster, but this was a nice way to help others understand Troubleshoot.

The basics of debugging Kubernetes

This section will be brief as we just make sure everyone is up to speed on the basics of debugging Kubernetes clusters.

The Kubernetes docs have a great set of resources for monitoring, logging, and troubleshooting.

Another great resource is this flowchart.

Lastly I want to highlight a few tools to help visualize the cluster, personally I’m a visual learner, and the greater insight into the cluster I have the better. So for starters here’s k9s, it’s a CLI visualization tool that describes itself as a way to manage your cluster - but there’s a few visualization tools within it. Secondly, the visualization choice of many lens (it’s on my to do list, to use on stream), which just looks phenomenal.

This should serve as a baseline for everyone to get up to speed. So let’s get started with the main course!

Introducing Troubleshoot.sh, an open source tool for debugging Kubernetes

Replicated sponsors an open source project called Troubleshoot, follow the link for the official troubleshoot.sh page. There are two major sides to Troubleshoot, which are preflight checks and support bundles.

Preflight checks are intended to be run before an application is installed onto a cluster. They allow the administrator to know things like, "is there enough CPU for my application?", "is the Kubernetes version at least version X?", and "do certain secrets exist?"

Support bundles are subtly different. Their main use case is “something broke, what can I inspect and share to software maintainers?” They allow the cluster administrators to know the same thing as preflight checks, however they export a file that users can explore and share after the fact.

Both of these have three components: collectors, redactors, and analyzers.

Collectors allow administrators to collect (as the name implies) details around their cluster - from logs, to cluster information and everything in between.

Redactors (again as the name implies), allow users to redact information. The use case here being that a collector may collect sensitive information such as database connection strings. This step allows us to remove any sensitive information in a couple of different ways. By default there are a few redactors run.

Lastly we have analyzers (also excellently named if I may add), instead of manually going through an incredibly large amount of information, the analyzers allow users to quickly highlight issues with the cluster.

Customizing Troubleshoot

For this section we’ll just spin up an empty k8s cluster, with a couple extra little things. Feel free to deploy as much or as little as you want to your cluster.

Let’s create a simple password secret with the following command:

[.pre]kubectl create secret -n default generic mysecret --from-
literal=password=hunter2[.pre]

Once your demo application (or real application) is ready to be troubleshot (troubleshooted?), we’ll explore how to write collectors, redactors, and analyzers.

Writing your own Troubleshoot Collector

There are many types of collectors to choose from, varying from host level information to copy files from pods. We’re going to copy a file from a pod.

[.pre]apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
    name: example
spec:
    collectors:
       -copy:
           selector:
               - run=busybox
           namespace: default
           containerPath: /etc/foo
           containerName: busybox[.pre]

This will copy the path /etc/foo from the pod with the label run=busybox into the support bundle.

There are so many more collectors, I’d heavily recommend exploring them, trying out ones that may be useful.

Writing your own Troubleshoot Redactor

Redactors are slightly different in that they have their own manifest Kind (i.e. it’s not SupportBundle or Preflight)

[.pre]apiVersion: troubleshoot.sh/v1beta2
kind: Redactor
metadata:
     name: example
spec:
     redactors:
     -name: all files
           removals:
                yamlPath: 
                       - password[.pre]

There are a few different ways to redact information, we’ve chosen yamlPath as in our collector we’re collecting a yaml file.

Writing your own Troubleshoot Analyzer

These allows us to quickly identify issues with the cluster.

[.pre]apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
     name: example
spec:
     analyzers:
          - yamlCompare:
               checkName: Compare YAML Example
               fileName: default/busybox/busybox/etc/foo/secrets.yaml
               path: username
               value: "Alexander"
               outcomes:
                     - fail:
                          when: "false"
                          message: The collected data does not match the value.
                    - pass:
                          when: "true"
                          message: The collected data matches the value[.pre]

Following our yaml obsession, we’re going to use yamlCompare, this will allow us to specify a file, path and value to compare against. We can also specify the pass and fail conditions.

As with collectors and redactors, there are also many analyzers.

Putting it all together to troubleshoot Kubernetes

Here's our secrets.yaml file:

[.pre]kubectl create secret -n default generic mysecret --from- 
file=secrets.yaml[.pre] 

[.pre]username: Alexander
password: "12345678"[.pre]

Here's our deployment.yaml file:

[.pre]kubectl apply -f deployment.yaml[.pre]

[.pre]apiVersion: v1
kind: Pod
metadata:
     name: busybox
     labels:
          run: busybox
spec:
     containers:
       - command:
          - sleep
          - "3600"
          image: busybox
          name: busybox
          volumeMounts:
               - name: foo
                     mountPath: /etc/foo/secrets.yaml # needed for volumeMounts
                     subPath: secrets.yaml # needed otherwise it's a symlink
                     readOnly: true
       volumes:
           - name: foo
               secret:
                     secretName: mysecret
                     optional: false[.pre]

Here's our support-bundle.yaml file:

[.pre]kubectl support-bundle -f support-bundle.yaml[.pre]
[.pre]apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
     name: example
spec:
     collectors:
          - copy:
               selector:
                    - run=busybox
               namespace: default
              containerPath: /etc/foo
              containerName: busybox
     analyzers:
          - yamlCompare:
               checkName: Compare YAML Example
               fileName: default/busybox/busybox/etc/foo/secrets.yaml
               path: username
               value: "Alexander"
               outcomes:
                    - fail:
                         when: "false"
                         message: The collected data does not match the value.
                    - pass:
                         when: "true"
                         message: The collected data matches the value
---
apiVersion: troubleshoot.sh/v1beta2
kind: Redactor
metadata:
     name: example
spec:
     redactors:
     - name: all files
          removals:
               yamlPath:
              - password[.pre]

With this you should see a support bundle in your terminal. (You can ignore this if you used --interactive.)

You can then share this bundle with the maintainers of your cluster, or anyone else that’s interested.

Recap on using troubleshoot.sh with Kubernetes

To recap, we’re created a secret, deployed a pod, and created a support bundle. We have redacted the users password, and automatically confirmed the users username is correct.

I hope these examples gets the point across of how automation can easily diagnose problems with your own cluster.