Step-by-Step Guide: Troubleshooting A Failed Deployment Using the Compatibility Matrix

Evans Mungai and Kaylee McHugh
 | 
Jan 4, 2024

Recently, Replicated launched the Compatibility Matrix - a new tool for spinning up ephemeral Kubernetes clusters on over 60,000+ combinations of distributions, K8s versions, and configurations. Our own internal support team utilizes the Compatibility Matrix to re-create customer environments when debugging issues, and this same process can be followed by any ISV to quickly resolve their own customers’ issues. Evans Mungai, Sr. Customer Reliability Engineer at Replicated, walks through this process. 

Introduction

Hi, my name is Evans, a CRE at Replicated. I'm here to talk to you about how we use CMX (Compatibility Matrix) clusters to support our customers.

On a typical day, I'm usually responding to vendors through support cases that they raise in our Vendor Portal. These support cases would usually be related to installing applications at customer sites, customer configurations of these applications, upgrades, and what have you. Some of these installations, upgrades, or application configurations may be so challenging that we'd be required to try to reproduce them locally or in our own test environments.

This would take quite some time because we need to create clusters, destroy clusters, install customer applications, and try to create various configuration options for these applications.

In my suite of tools that I use, I have a number of Vendor Portal applications which are configured in a way that I try to mimic how a customer’s application would behave.

I have a Helm app, an air-gap app, and a KOTS application as well, all with various manifest and configuration modes that try to cover as much surface area of our product as possible.

At the same time I have VMs at my disposal, which I create and delete quite often as I set up clusters - such as kURL clusters, KIND, etc - for embedded cluster environments, or cloud provided clusters if I'd like to run GKE cluster, AKS, EKS, OpenShift and I can try to reproduce problems there. I started using CMX (Compatibility Matrix), which is a product for creating various Kubernetes clusters.

Today, I'll demo troubleshooting a sample application on a CMX (Compatibility Matrix) cluster to show how I do this. While I'll be highlighting how I debug issues for our own customers, these exact same steps can be easily followed by any ISV to debug their own customers' issues when deploying their application.

Step 1: Getting a Customer’s Information 

Whenever a case is raised, the first thing we ask for from a vendor is a support bundle.

I have one here, and this use case that we are looking at is of a KOTS application where the admin console is running properly, but we are unable to deploy the customer's application.

The customer's application in this case is a simple nginx pod which needs to be deployed. That deployment is failing. 

Whenever we receive a support bundle, the first thing I do is check the overall health of the cluster. The support bundle contains Kubernetes manifests and various other config files that are collected by the support bundle binary at the customer side.

Step 2: Checking the Customer’s Config 

Since we have the manifests here we have a tool called sbctl, which essentially starts a Kubernetes API server and surfaces the Kubernetes manifests as a read-only Kubernetes cluster.

Before we do that, let's just check - so as you can see, there's no cluster running. Now let’s start an sbctl cluster.

Then we check whether all pods are healthy. They look healthy, which is okay. There’s nothing more to follow up on with regards to the health of the cluster. 

Since we know that KOTS Admin Console could not deploy the customer's application, the next thing we do is check the logs of the KOTS Admin Console.

We're seeing a bunch of errors, which is indicative of something going wrong. As we look closer, we can see here that the KOTS Admin Console was applying resources. These are Kubernetes manifests, which are part of the customer's release. At this point, as you can see, there's a pair of - “applying resource… applied resource… applying resource” - but this fails. 

So this is the resource that's failing here. The rest did succeed. My next step here is going to look at the customer's release.

Step 3: Investigating the Customer’s Release 

The customer's release manifests are collected in the support bundle, which we're going to take a look at in a moment.

We’ll now open the customer's release. Remember - we need a resource called app-config.

Naturally this directory would contain many, many more files. It might be a bit of a process looking for what you'd like to find here, but eventually we end up finding it.

This is the config. These are the configs that couldn't be applied. 

We can see it has some configuration options here. What I would do immediately is try to deploy this in my own cluster.

Assuming by looking at this I couldn't figure out what the problem is, I'll try to re-create this resource in my own cluster.

Step 4: Bringing in the Compatibility Matrix 

Now, we're going to create a cluster in CMX (Compatibility Matrix). 

[.pre] -replicated cluster create –distribution-openshift [.pre]
(I’ll create an OpenShift one).

And as it's creating, I can already show you various versions we have available. 

As you can see, we have quite a number of them - AKS, K3S, Kind - and this allows me, using the same CLI, to create any kind of cluster without having to leave this terminal or trying to interact with various other cloud provider tools or interfaces. I just use the Replicated cluster CLI. Now it's being verified.

As it's doing that, the next thing I intend to do in this case would be to create an application. 

I took the liberty of creating a demo application here, and what I'd like to do now is create an application that would deploy this thing we're seeing here. 

We'll create a new release, so we know it's a KOTS application. We're trying to create a config map - just need to create some new resources here. Let’s use one of the existing ones since this is a throwaway release.

Alright - we have two configs here. What we'll do is come to the customer's config and collect those two configuration options which are right here, and add them to my config file.

Let's create the release and then promote it to unstable. 

And I have a customer called demo-customer, whose license is here.

Well, actually, I don't have the license, but you know what? I'll just download the license.

Move it here. 

Now I have a demo-customer. Let's check that my cluster got created today. That took about a minute. And right here, we have our OpenShift cluster.

I'll now deploy my test application in the new cluster.

We hope to see the same failure that we were seeing in the support bundle.

Step 5: Seeing if the Local Error is the same as the Customer’s Error 

I'll launch a shell, which essentially fronts my cluster right here. Just like sbctl, this is going to launch a shell which exports a cube config.

And with that, you should be able to run kubectl within that shell.

Lets now install the application.

I’ll go to channels, and I'll come and try to install this application. 

Let's install it into the default namespace with a shared password, “mypass”, (pick a better password when you’re doing this yourself!). This should install my admin console application and as it's doing that let's look at the release.

Let's look at the release here as the application gets installed.

Now we can already see in Vendor Portal that linting is already pointing us to an issue with this specific configuration. There is a warning here. When I'm looking at cases and see these kinds of things, I dig into them before I go on to create clusters. Let's not do that so that we can get to see how installing in CMX works.

 Alright, so now the next thing we'll do is go to the admin console.

Okay, let's choose a file.

Let's continue.

This is the process that it takes to try and create a customer's application. If I was launching a VM, which does take about 5 to 30 minutes depending on my setup, I'd not be this far by now. 

Alright, let's deploy the release.

Alright, now we see some deployment failures which is what I was expecting to see.

We can see the same problem that the customer is seeing in their environment. We can check logs in the terminal.

It's the same problem that we are seeing in the support bundle that we received. 

Step 6: Fixing the Issue

The next thing I would do is to investigate the solution. I do know what the solution is, which is the same thing that we are seeing here in the warning.

Essentially, this needs to be a string, but “parse bool” here returns an actual boolean, not a string.

What we would do now is correct the problem - we need to wrap this within a string.

 

But there may be cases where you have no clue, or the linter does not catch your problem, and it would require iterating through making changes and upgrading your application. 

I’ll now update the release and promote it.

Then check for updates in KOTS.

It would require this cycle of changing, upgrading your application, deploying a new version, seeing whether that fixes the problem until you get to a point where you understand what the root cause is before getting back to the customer. 

Let's see whether the deployment succeeds this time.

Yes, the deployment did succeed! I expected that. 

Now that we know what the problem is, we'd go back to customer’s support bundle which we looked at earlier

Since we figured out the root cause, we'd go back to the customer, take a look at their release then suggest the changes they need to make.

Conclusion and the Value of the Compatibility Matrix 

From this demo, I hope you've gotten to see how we interact with CMX (Compatibility Matrix). 

I managed to create a cluster which is right here - an OpenShift cluster.

 

I managed to get quicker access to disposable k8s clusters. Using the same replicated CLI interface, I'm able to create different distros from different providers - be it cloud, be it embedded cluster setups, and very various versions as you've seen. You'll get all various versions of Kubernetes as well.

I have a wide selection of case classes as I've mentioned, and then this consistent interface. I hope this was helpful and I hope you enjoy at least using CMX clusters or find it helpful to use CMX disposable clusters to troubleshoot your applications. Thank you very much.

Want to learn more about what Replicated does to help vendors distribute software to self-hosted environments? We would love to show you -- click here to schedule a demo.