Debugging Kubernetes: Enhancements to Troubleshoot

Ada Mancini
 | 
Oct 26, 2022

We are always working to make debugging Kubernetes problems easier, and we've added some new features to the open source Troubleshoot project that will save time and energy for everyone! If you’re not familiar, Troubleshoot is a kubectl plugin providing diagnostic tools for Kubernetes applications, often used to check that application and resource environment requirements are met (with preflight checks), and also used to collect and analyze bundles of log files (with support bundles.)

First, we considered the design of support bundle manifests. For software vendors shipping their app using Replicated, there's a single support bundle spec with all the collection and analysis logic for an application and the K8s control plane. Yet that pattern doesn't map to how some teams deliver software. Many organizations, us included, use a team-based release model, contributing updates and improvements in parallel. We felt that it should be easier for teams who are releasing a project together to be able to write support bundles and preflight checks this way too.

And for users, it would be amazing if your apps and cluster components like CNI, CSI, Ingress Controller, etc. came with support bundle that were discoverable from `kubectl` in the cluster, so that if you ever run into a problem you have all the tools at hand to solve it quickly.

Many end users may not consume an app update or upgrade as soon as they become available, and that presents a problem for Troubleshoot. If you release version 1.0 of your app and in a month you realize that there's a bug that needs to be fixed, then you’ll want to alert people through your support bundle (often when they run into that bug and suggest an upgrade.) But upgrading the support bundle manifest included with the application requires an application upgrade, so then you have a chicken-or-egg problem. It would be great if app consumers could benefit from the updates in troubleshooting specs provided by vendors without having to explicitly update the app (or kURL cluster addons).

Enhancements to Troubleshoot: modular specs and updates

We're happy to announce some new updates for the Troubleshoot project to help make debugging applications more modular and dynamic, including:

  • Support bundle specs are now more modular, and a support bundle can be generated by merging specs provided from multiple apps or components.
  • The `uri` field offers a way to refer to a support bundle hosted online, which can be updated out of band from the application

Making support bundle specs more modular

Here’s a minimal example of a support bundle for debugging a “Super Cool CRUD” application:

[.pre]apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
     name: super-cool-crud
spec:
     collectors:
          - clusterInfo: {}
          - clusterResources: {}
          - ceph: {}
          - longhorn: {}
          - weave: {}
          - logs:
               selector:
                    - app.kubernetes.io/name=nginx
                    - app.kubernetes.io/part-of=wordpress
           - logs:
               selector:
                    - app.kubernetes.io/name=php-fpm
                    - app.kubernetes.io/part-of=wordpress
           - logs:
                selector:
                    - app.kubernetes.io/name=postgres
                    - app.kubernetes.io/part-of=wordpress
     analyzers:
          - deploymentStatus:
               checkName: Check nginx is operational
               name: nginx
          - deploymentStatus:
               checkName: Check php is operational
               name: php-fpm
          - deploymentStatus:
               checkName: Check postgres is operational
               name: postgres
           - cephStatus:
                 checkName: Check Ceph status
            - clusterPodStatuses:
                 outcomes:
                 - fail:
                         message: 'Status: {{ .Status.Reason }}'
                         when: '!= Healthy'
                  - weaveReport:
                          reportFileGlob: kots/kurl/weave/kube-system/*/weave-report-stdout.txt[.pre]

If your frontend and backend teams are working on different parts of the application independently, and in parallel, that could lead to a lot of merge conflicts if many people are attempting to keep this one spec file up to date. With a larger application, ownership of the changes here could get messy.

You could split this manifest into a few smaller manifests that map more closely to how your teams do their releases:

[.pre]# super-cool-crud-cluster.yaml
---
apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
     name: super-cool-crud-cluster
spec:
     collectors:
          - clusterInfo: {}
          - clusterResources: {}
          - ceph: {}
          - longhorn: {}
          - weave: {}
      analyzers:
          - cephStatus:
                 checkName: Check Ceph status
          - weaveStatus:
                 checkName: Check Weave CNI
          - clusterPodStatuses:
                 outcomes:
                 - fail:
                      message: 'Status: {{ .Status.Reason }}'
                      when: '!= Healthy'
                - weaveReport:
reportFileGlob: kots/kurl/weave/kube-system/*/weave-report-stdout.txt[.pre]

[.pre]# super-cool-crud-frontend.yaml
---
apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
     name: super-cool-crud-frontend
spec:
     collectors:
          - logs:
                selector:
                     - app.kubernetes.io/name=nginx
                     - app.kubernetes.io/part-of=super-cool-crud
          - logs:
               selector:
                     - app.kubernetes.io/name=php-fpm
                     - app.kubernetes.io/part-of=super-cool-crud
          - deploymentStatus:
                 checkName: Check php is operational
                  name: php-fpm
     analyzers:
          - deploymentStatus:
               checkName: Check nginx is operational
               name: nginx[.pre]

[.pre]# super-cool-crud-backend.yaml
---
apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
name: super-cool-crud-backend
spec:
collectors:
- logs:
selector:
- app.kubernetes.io/name=postgres
- app.kubernetes.io/part-of=super-cool-crud
analyzers:
- deploymentStatus:
checkName: Check postgres is operational
name: postgres
- postgres:
checkName: Check[.pre]

Now, if your frontend team wants to add checks for HTTP request logs, or if the backend team wants to query Postgres for analysis, they don't have to wait to synchronize changes or manage merge conflicts updating a single manifest file for an entire project.

To generate a support bundle from these combined specs, you can just list them at the command line:

[.pre]kubectl support-bundle super-cool-crud-cluster.yaml super-cool-crud-
frontend.yaml super-cool-crud-backend.yaml[.pre]

All of the collectors and analyzers from all three specs will be merged into one support bundle.  Right now this functionality is available from the `kubectl support-bundle` CLI, and soon we'll be adding support for modular specs to our App Manager for Replicated apps!

Using online specs for support bundles

Replicated ships a Kubernetes addon called EKCO as part of kURL, and we want to make sure that if you are using a kURL cluster you have installed this addon in order to automate some cluster tasks, like certificate rotation. You might want to ship your first version with a manifest that checks to see if EKCO is installed:

[.pre]
     collectors:
          - data:
               name: static/ekco.yaml
               data: |
               apiVersion: apps/v1
               kind: Deployment
               metadata:
                    name: ekc-operator
                    namespace: kurl
               spec:
                    containers:
                         - name: ekc-operator
                              image: replicated/ekco:v0.1.0
     analyzers:
          - deploymentStatus:
               checkName: Check EKCO is operational
               name: ekc-operator
               namespace: kurl
               outcomes:
                    - fail:
                         when: absent
                         message: EKCO is not installed - please add the EKCO component to your kURL spec and re-run the installer script
                    - fail:when: "< 1"message: EKCO does not have any ready replicas
                    - pass:message: EKCO has at least 1 replica[.pre]

Let’s say a few months have gone by since the 0.1.0 release, and you've been iterating on this component, but in the intervening time, some issues have been opened in your GitHub project and you've discovered a bug in your certificate rotation that affects users of EKCO 0.4.0 and lower. It would be great if you could detect this potential problem and let your users know how to fix it! You might think, "let's update our support bundle spec and warn the user if they have a version below 0.5.0, so let's write an analyzer that will look for the `image:` field to see what version is running."

[.pre]analyzers:
     - textAnalyze:
          checkName: Check installed EKCO version
          fileName: static/ekco.yaml
          regexGroups: 'image: replicated/ekco:v(?P<Major>\d+)\.
(?P<Minor>\d+)\.(?P<Patch>\d+)'
          outcomes:
               - warn:
                    when: "Minor < 5"
                    message: A very important bug fix has been released in EKCO 0.5.0. Please upgrade to the latest available version.
               - pass:
                     message: EKCO is up to date[.pre]

You might also want to alert current users of a bug that may affect them and encourage them to upgrade, but previously a kURL user would have needed to upgrade their cluster to pull in the updated version of EKCO that includes the updated support bundle spec. Instead, you can now use the new URI field to reference a spec that's hosted online. If Troubleshoot parses a URI it will download that manifest and use it.

Let's try setting the `uri` to a self-link:

[.pre]apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
     name: ekco
spec:
     uri: https://raw.githubusercontent.com/replicatedhq/troubleshoot-specs/main/in-cluster/ekco.yaml
     collectors: ...
     analyzers: ...[.pre]

Now, whenever this spec gets processed alone, or as part of a collection of specs that have been merged together, Troubleshoot will process the most up to date version of the spec hosted online.  If Troubleshoot can't reach the URI in the spec, it falls back to what's provided in the spec file so collection & analysis doesn't fail.

Here's our completed spec file, including the new URI field:

[.pre]apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundlemetadata:
     name: ekco
spec:
     uri: https://raw.githubusercontent.com/replicatedhq/troubleshoot-specs/main/in-cluster/ekco.yaml
     collectors:
          - data:
                name: static/ekco.yaml
                data: |-
                     apiVersion: apps/v1
                     kind: Deployment
                     metadata:
                          name: ekc-operator
                          namespace: kurl
                     spec:
                          containers:
                               - name: ekc-operator
                                  image: replicated/ekco:v0.4.1
     analyzers:
          - deploymentStatus:
               checkName: Check EKCO is operational
               name: ekc-operator
               namespace: kurl
               outcomes:
                    - fail:
                         when: absent
                         message: EKCO is not installed - please add the EKCO component to your kURL spec and re-run the installer script
                    - fail:
                          when: "< 1"message: EKCO does not have any ready replicas
                    - pass:
                    message: EKCO has at least 1 replica
          - textAnalyze:
                  checkName: Check installed EKCO version
                  fileName: static/ekco.yaml
                  regexGroups: 'image: replicated/ekco:v(?P<Major>\d+)\.(?P<Minor>\d+)\.(?P<Patch>\d+)'
                  outcomes:
                        - warn:
                        when: "Minor < 5"
                        message: A very important bug fix has been released in EKCO 0.5.0.  
Please upgrade to the latest available version.- pass:message: EKCO is up to date[.pre]

Going forward, when we push updates to our project, Troubleshoot will automatically use the latest version of our spec that captures the collective experience we've gained while supporting our application.

Reach out to us if you’re interested in learning more about the open source Troubleshoot project, or would like to better understand how Replicated can help you support your customers and their K8s environments.