The Hidden Pain of DIY On-prem K8s-based Software Distribution

Nikki Rouda
 | 
Aug 1, 2022

This blog is going to be a little different – rather than tell you about all the wonderful ways Replicated is helping software vendors, we’re going to explore what the experience is for those companies who try to build their own DIY software distribution tooling. This outline is based on a SaaS company and/or traditional on-premises software company that is delivering their app to customer Kubernetes (K8s) environments in cloud for the first time. Think of it as an alternate history based on a composite of many people’s experiences. We hope you don’t make the same mistakes!

A timeline of hope and pain

Day 0 - The sales or product team asks engineering simple sounding questions: “Can we deliver our SaaS application into our customer’s self-hosted Kubernetes environments?” or “Now that we’ve modernized and containerized our application, can we distribute it to customer-managed clusters in the cloud?” Either way, what they are really saying is:

“Our prospects keep asking for us to do this, and we’re leaving money on the table every time we say ‘no.’”

Day 1 - “How hard can it be?” The lead engineer spends a couple weekends hacking on a rough solution -- very excited to build something new. It seems to be fairly straightforward to refactor the app to work in any AWS or customer-hosted environment, right? We could use Terraform, maybe.

A single location app installer
Your app only runs in one cloud environment

Day 30 - The field engineers deliver the app to their first customer-hosted K8s cluster running in an AWS Virtual Private Cloud (VPC.) The proof-of-concept (POC) installation doesn’t go as smoothly as hoped, but after a couple of escalations to engineering and some patience from the customer, they finally get the app deployed. High fives!

Day 45 - The lead engineer has shipped several updates and changes to the new “on-prem” K8s installer to make it work. A production install is started in a different environment, but it’s not working the same way, and no one is quite sure why. More and more engineering time is being spent on Zoom with the customer whose frustration is growing steadily. Other modernization, innovation, and/or backlog work is starting to take priority, and this project is starting to look a lot more complicated than expected. The impact:

The sales team is getting a bit nervous about their account and escalating to management. 

Day 60 - The project is no longer fun at all and continues to suck time and people. The Terraform scripts are failing security reviews at some companies. The lead engineer asks their manager to get them off this ASAP because they are burning out. Company doesn’t want to halt the project because product and sales are close to closing this customer. There are a surprising number of on-prem and K8s cluster-based opportunities in the pipeline, and in this economy, the VP of Sales doesn’t want to turn away any revenue. The head of engineering begrudgingly assigns more engineers to work on the on-prem installer project, delaying the schedule for other planned app features and innovations. 

A confusion of DIY app installer components
The DIY installer is getting complicated

Day 180 - A lot has gone on in the last four months. New customers are running the installer, but each one has a slightly different environment and installation requirements. A few examples:

  • While the first customer accepted the Ubuntu-based installer, the next wanted a RHEL installer. So the team spent two weeks building a second package and designing CI/CD pipeline to build and test this in parallel with the Ubuntu-based package.
  • Two government and financial services customers needed air gap installers. Engineers decided this is too much effort with everything else going on. This represents a substantial hit to the revenue stream that drove the idea in the first place.

Day 270 - With mixed failures and success, the on-prem K8s install initiative carries on in fits and starts. More issues keep popping up. The install success rate is hovering around 50%, where half the attempted installs end with the customer getting fed up and losing trust. Other customers and prospects keep asking for it, and a number of big accounts are now deployed with it, so it seems impossible to turn back, but the quagmire is getting deeper:

  • One customer runs into some CVEs which block an install, and it’s an all-hands-on-deck late-night scramble to patch the vulnerabilities and get everything stable again. 
  • Several customers have now (auto-)upgraded their Linux operating systems, which unfortunately broke the app packages, requiring rework and updates to the installer. Looks like this will happen at least once a quarter.
  • Mysterious storage and networking failures have required 10+ hours of hands-on troubleshooting across several weeks.
  • The first customer to install has yet to upgrade their installer and is at risk due to unpatched bugs, which were long-ago fixed in newer versions. Because the first version was not built with a self-serve upgrade path in mind, engineers spend another 10+ hours helping them perform a very manual migration to the latest version of the tool.
  • Despite management efforts to bring in other team members to the project, the lead engineer who built v1 is still constantly pulled into on-prem install support escalations.
  • One end customer had modified the base image for Ubuntu to change the names of all the default network interfaces. More mysterious network issues cause problems until this change is discovered.
  • In environments where the customer brings their own Kubernetes cluster, the team has come across ten different flavors of Kubernetes ingress that need to be supported by the application configuration. Every single one has taken hours to fix and taken that time away from other engineering work.
  • Several end customers need enterprise long-term support (LTS) versions which creates internal chaos and more fire fighting. Need to hire and train a lot of support engineers on Kubernetes or just keep escalating to engineering…
Inevitable bugs are starting to cause issues

Day 360 - One year in, the engineering team, exasperated and burnt out, holds another all-hand-on-deck meeting to reset and figure out what to do. Everyone dreads doing a rotation on the on-prem installer team; some people actively seek to get off the team. A few veteran engineers sit permanently on the team because they understand that without them, a big source of revenue would be in jeopardy. Engineering and product leadership agree to deemphasize new feature work to give the team up to 50% of their time for three months to invest back into the install tooling. While they’re at it, engineering agrees to spend significant time developing the air gap installer that more and more customers are requesting. The team develops a wishlist for everything they’d want:

  • Set up CI/CD and automated testing for all releases of the application itself in all supported environments.
  • Convert the rag-tag of hard-to-maintain bash scripts used to collect diagnostic info into a CLI tool that can be delivered with the installer. Consolidate into a framework that allows field engineers to contribute to the list of information that gets collected. Stretch goal: Package the internal scripts used to analyze these log bundles for common errors into a tool that end customers can run in their own environment.
  • Design so that the team can centralize on one architecture and install method, and solutions architects working with customers don’t need to hack a bunch of strange custom configurations for specific customer environments.
  • Give customers the option to bring an external database instead of using a datastore embedded in the application - this should help address some of the catastrophic failures in storage and networking.
  • Offer snapshot and restore functionality that will work in the majority of customer environments - have a hunch that this will include SFTP, NFS, SAN, and maybe others. Need to do some discovery with the product team and several key customers to scope it out.
  • Automate scanning for CVEs in all code and enforce a policy of not shipping a release without patching all CVEs for which a patch is available.
  • Invest time in ensuring the build/test process for developers in local environments can be shortened from 10+ minutes to under 30 seconds.
  • Automate testing for all installer versions on a quickly growing multi-dimensional support matrix of OS versions, Kubernetes versions, add-ons, cloud providers, and other dimensions.
  • Build a specific “Area of Responsibility” for a product team to ensure they can support new versions of operating systems within 30 days of release
  • Adopt an aggressive policy of deprecating old versions to reduce the total number of things that need to be maintained and patched.

Day 390 - The team is making progress, and even the lead engineers who built v1 are engaged again. A few improvements are made, and momentum is building, but there’s still so much to do. The most knowledgeable people are still getting pulled into many support escalations with currently-deployed customers and new customers.

A long slow path to do updates
Updates are taking too long to deliver

Day 480 - The three-month sprint has now sprawled out to six months. With half the team still improving the build/test/distribute/support platform for on-prem installs, app feature development is still behind pace. Work on an air gap installer has not even reached the prototype phase. With half the backend team focused on infrastructure-flavored tasks, frontend engineers staffed to work on a SaaS application or other modernization efforts are consistently running out of things to do. What happens next:

Disillusioned and completely burned out, the two engineers who built v1 of the installer and have the deepest knowledge of the project leave to go join small startups founded by former colleagues. This sets the team back even further.

Some might read this cautionary tale and conclude that distributing their software to customer-managed on-prem K8s and private cloud environments simply isn’t worth the pain. But 80% of all software spending still goes to applications that aren’t pure SaaS, and most organizations now expect applications to be K8s-friendly. We’re seeing a looming trend of application boomerangs from the cloud for reasons of security, compliance, performance, and cost.

There’s got to be a better way to solve the hard problems outlined above and still increase your addressable market! Ask us how Replicated can help you sell more, install faster, and efficiently support your customers.

No items found.