At RepliCon Q4 2022, one of Replicated’s Senior Product Managers, Dex Horthy, sat down with Dilan Orrino, DJ Mountney, and Jason Plum from GitLab. In this interview, they chatted about how the GitLab Enterprise team thinks of excellence in software distribution, dug into what they learned creating a product that powers the greatest companies in the world, and shared their favorite sandwiches. This interview is available as a video on our YouTube channel, and highlights from the interview are below.
Dex: Welcome to Jason, DJ and Dilan. I would love to just have each person on the panel quickly say your name, what you work on at GitLab, and what you like to do for fun. I will also accept what is the best sandwich.
Jason: Hi everybody. I’m a Staff Distribution Engineer at GitLab. I lead the technical team in architecture design decisions in how we end up producing consumable methods of deploying GitLab.
That's about as simple as I can try and get it because what I want to point out is people think, “oh that's relatively simple”. It might be just DevOps or infrastructure as code, but we are talking about the difference between one person on Community Edition with the Raspberry Pi and gitlab.com with 10 million users and the same code doing all of it.
Whether we're talking about the large number of customers that we have that are 5,000 plus, or the insane number of customers that we have that are between 10 and 500, what we do has to work for all of them.
I’m an avid bowler and I also play some card games. I'm also going to answer the sandwich question - and that's a proper Cuban.
DJ: I’m the Distribution Build Engineering Manager at GitLab. I've been at GitLab about the same amount of time as Jason. We've been on the distribution team for a little over six years now working here at GitLab on distributing our product to customers of all sizes.
I'm based in Victoria, BC Canada. In terms of my favorite sandwich, I'm pretty simple in that regard. I just like a peanut butter and honey sandwich.
Dilan: I'm the Senior Product Manager here at GitLab for distribution. I work with Jason and DJ regularly to plan out what the future of software distribution continues to look like, and also continue to sense what our customers need in terms of installation, to configuration, all the way through to continued success with their instance. This is a challenge because we’re building a tool for a large enterprise that is fairly well versed in managing an application like GitLab, to a small subset of users who have a lot going on, so they can't necessarily dedicate somebody to just managing what we have to offer.
I’m based out here in North Carolina, I keep busy by working on the house and I’ve got a couple old cars that I like to work on. I'll answer the sandwich question as well, since everyone else did. I think a nice BLT with some fresh ingredients is always a winner.
Dex: Thank you, I’m Dex, I spent the last four years running our Customer Engineering team. As of last week, I’ve transitioned to a Product Manager on the vendor experience side. I think the best sandwich is a classic Italian hero with six different kinds of meats and cheeses and covered in vinegar and hot peppers.
Let's get into it. I know GitLab has a unique approach to how you solve the problem of distributing software into customer environments, and that you have an entire team and product group with product managers dedicated to the infrastructure that allows you to deliver out into customer environments. How is that team structured and when did it become clear that you wanted to invest in that and put an entire engineering team behind that practice?
DJ: First I'll touch on what the team is really briefly. Jason will probably go into more details in a little bit. For the distribution team - the team exists in its current form to treat the installation and upgrade of GitLab as a product feature of GitLab that we want to work as best as possible.
So, the team is hyper-focused on delivering based on that. In terms of how did we determine that it needed to exist? Why does this team exist? At GitLab it was pretty organic, and I think the end goal of productizing – it wasn't the initial goal.
It’s a pretty old team at GitLab - GitLab had very few developers when it was formed. So, those developers were handling everything, including deploying to infrastructure for the .com product. But, we had large customers that were self-managed, so the support team built a tool to help assist with that.
The GitLab support team did that, and because we had so few developers to maintain that tool, it made sense to reduce the number of tools we were using. We adopted that same tool for deploying to Gitlab.com. We scaled as GitLab grew, and we hired more people to our infrastructure department. We hired more folks in development and in support that were specialized in those areas, but we still had this tool that had been created that was necessary, and a separate team to maintain that tooling came into existence.
Dex: It's interesting because you look at the way that a lot of people consume GitLab and you can see this makes sense given how open source is just sewn into the fabric of how y'all do everything. It should be available to pick up and bring off the shelf and run it wherever.
But it sounds like in the early days it actually came from almost a commercial drive where you have these big customers that want to run this way, and so you’re going to get some engineers on it and see if you can make that happen. Is that right?
Jason: Yeah. It was largely that, and earlier versions of GitLab meant you had to install it from source, even if you were an enterprise customer. That is doable until the application gets to a certain size in terms of complexity and components. Distribution as it is today actually came from a team that was once called Build. Our job was to build a product that was actually consumable by a customer (everyone is a customer to our team, whether it's gitlab.com, our dedicated product that we've just made public, or the self-install, the self-managed instances, whether they be in the cloud or on-prem, everybody consumes exactly the same thing).
Build's job was “make this thing, put all the components together and make it easy and replicable to actually consume for any number of customers at any scale”. With our distribution today, we focus largely on that same original goal, but now we're really focusing on that flexibility and scale capability.
Dex: Getting into flexibility and scaling, I understand y'all do a lot in terms of incorporating DevOps and SRE principles that are typically pointed at SaaS environments. How does the GitLab team incorporate those ideas and translate them into a world where you are managing and building software that you're not going to have direct access to?
Jason: So that's a fun one, right? That's where the flexibility that I spoke about early comes into play. It's the fact that there are some components that don't run well in Kubernetes because they are highly dynamic based on user load. There's some components that run really well in Kubernetes and everybody wants to do it in Kubernetes, but there's some understanding that you need to have on how the applications function. When we are talking about our traditional product - It's the Omnibus GitLab - it's a package that you install, app-get, install, et cetera, right? That has everything that you need to actually stand up and run it on the distribution that you're choosing to use, and it'll work. And you can technically take one that you've installed in RHEL 8 and set up a secondary node in Debian 7 (9 now) and you can link those together and they work just fine, right?
So the outputs are that this is consistent no matter how you're doing it and where you're doing it or what your particular choice of operation is. Whether that's, “I want to have business support, so I'm using RHEL 8”, or “I'm okay with handling my own infrastructure and I'm gonna use Debian 9 by choice.” Trust me, I like Debian, but I also understand I need to have business support in going with something like Red Hat or Alma or something like that.
When it comes to dealing with the scale, we honestly never know what the customer's behaviors are going to be, and we have to make some recommendations for them architecture wise, but we have to accept that we also can't make those decisions for them. We don't get to enforce specific decisions on how they do things. I don't get to pick that you always use AWS RDS for your database because there's no such thing on prem. You may not live in a country that has access to AWS.
Dex: So does that present challenges in how you help your users maximize their chance for success? How do you maximize the chance that when they get this thing off the shelf and they try to install it, whether they're using RDS or Azure or some MySQL, that they stood up themselves, that it’s going to go well?
DJ: Yeah, we have essentially reference architectures that we publish that are heavily tested against and operate at several different scales that they can reference. The key thing is, in our software development style, that the tools that our team is working on are out there in the open. And because it’s the installation software, it’s also the most exposed to the sys admins and the folks setting up the software. So we get a lot of engagement where the community and our customers are able to help us make the product configurable for their environments.
And so we have those reference architectures, but a big part of it is allowing them to re-engage with us. A lot of our architectural work is actually around “how do we make customized configuration. How does that best fit in our installation product without making things unmanageable?”
Dex: Shifting gears a little bit to the product side of the conversation, Dilan, I know we talked for all of 30 seconds about metrics and making data-driven decisions around how you build and enhance the product that is the distribution vessel for all of GitLab. What sorts of things do you measure and how does that help you make decisions as a product and engineering team about where to focus your efforts?
Dilan: Our main performance indicator that the distribution team is working against is our upgrade rate. So I know that we've talked about installing on a lot of different types of distributions and making it work in a lot of custom environments which we all are tracking those as well, but the number of places [GitLab] can run isn’t a great metric to track.
We're working on upgrade rate, which is the percentage of instances on the latest three versions of GitLab.
So that I would say is our big starting point, and, I think, all the variables that come with that. If we have a good upgrade rate, it means one, it's easy to install because we have a lot of users using it, and it's easy to continue to use and maintain and stay up to date because GitLab releases every month. I don't know if it's already been said but that is its own challenge, but it allows us to be incredibly agile and release features all the time. If our customers are not constantly upgrading, it could lend a customer to not be able to access the latest things that we're doing all the time.
And maybe even on a more important note is security. We are currently only maintaining the latest three versions for security reasons, from a patch release perspective. DJ or Jason can clarify what that means. But it's important in many facets is what I'm getting at regarding upgrade rate.
So if we start there on upgrade rate, there's a lot of extra things that kind of come with that. So we can distill it down from upgrade rate to, okay, “what type of distributions are upgrading the best?” And “what size of distributions are upgrading?” “Is a 10 person GitLab instance upgrading more frequently or better than a 10,000 user?”
So we try and make decisions,, should we be building tools for these larger enterprises or a startup with a few folks working there? I guess that's just breaking the surface. From there, we try to see which milestones were troublesome.
So this is just a good example: We do have required upgrades. Jason mentioned, we want to be able to go from one milestone to another. Sometimes it's unavoidable for us to jump from 14.1 to 15.3 because in there, not only is there a major release that we've done, but also a minor release that we've implemented a required upgrade.
Again, these are sometimes unavoidable based on features that we'd like to add in other product areas. But what we've found is that those are troublesome for customers. And it has a large impact on our upgrade rate. So I'm just kinda leaning down a product decision tree here. And so now we're like, Hey, we need to build a tool, or we need to figure out these required upgrades or get rid of required upgrades.
Somebody in support posted that we’ve been getting a lot of support requests that it’s difficult to upgrade and it usually has to do with these required upgrades. The required upgrades that we’ve had over the years are getting more and more frequent, so we don’t want to do that anymore.
The function to fix this is much more challenging to say “end required upgrades.” So MVC, which we work with a lot from a product perspective at GitLab, which is our minimal viable change, which is simply “let’s take the unexpectedness out of a required upgrade.” Sometimes we just have to put one in without a lot of notice or maybe no notice because we want to keep the features rolling. An easy change is just “let’s schedule them now.”
In 16.0 we’re discussing - let’s announce a required upgrade halfway through or maybe two spaced out so that they’re not unexpected. A customer can work towards their plan for upgrade.
Dex: Then you have the breaking features funneled into those mid-release milestones instead of having them popping up randomly throughout the process as we need to ship new product.
It's really fascinating that idea of upgrades being the core of the most important thing that y'all track and how fast you can get people on new versions, because it's not only I'm sure as a PM important to you to be able to ship product and we can get into the interplay of agile and DevOps and definition of done.
What good is a feature if you ship it and only 10% of your customers can even use it because they're on old versions. But the other thing I'm sure and I want to hear a little bit from Jason on this is that idea that any instance over a year old is just a support nightmare waiting to happen, and if they try to upgrade and jump through four versions, that's gonna be a problem. And if they're not upgrading and you go to fix it, you're carrying this debt of our support staff and our team needs to know about all these old things and how they worked in 10 different flavors.
But Jason, I know you are really close to end customers and while being a staff engineer and working and doing architecture, you've also spent a lot of time in the field. I would love to hear a little bit about how the team cultivates this sort of DevOps mindset of sharing knowledge of the field roles and the engineering roles and making sure that people who are building are out in the field seeing it work and people who are out in the field seeing it work are being brought in to understand and empowered to improve the reliability of the software.
Jason: So let me prefix this real quick. When it comes to why we care so much about the UX of upgrades, we want you to be secure and make you sure you're on the latest patch, just so you're getting the best functionality and the best features and most importantly, that you are secure. But here's the thing, if your customers can't upgrade because they were afraid to, what are they doing with your product?
Your greatest value is a retained customer. So that being said, our team does work with customers. It's actually a KPI on a regular basis that everybody on the team works with at least one customer call per quarter, because we want them to see the perspective that the customers have in the way that they're doing that.
Because our team needs multiple perspectives for how things work to be able to do what we do, at the scale we do. I tend to refer to this as four viewpoints. You have the people that are individually managing a component, right? These are the people that are maybe the rabbit living under the trees in the forest.
And then you have the people that are managing the runtime of that application or that application as a deployment, right? Or even interconnected within the application. These are the people that are caring about a particular grove of trees. They might see the rabbits, but they're not gonna go chasing them.
You've got one level up from that, which is the traditional system admin that cares about the platform, right? You're caring about your cluster and how the application's behaving in that. You don't necessarily even need to care about the particular application until you start hearing problems coming from that grove and you'll never see the individual rabbits.
There's one more perspective - even higher than that - which is the platform engineers, the people that have to manage company-wide tasks, right? The tools that build the tools that make the things that work. Those are the people that are standing on the mountain looking at multiple views, multiple forests.
They generally don't care about any particular infestation of rabbits. They care that the forest looks healthy, right? So you have the perspectives of the architects, the engineers, the developers, the bug hunters. You've got all of these perspectives that add value to what we produce. Because for a customer to enjoy the product, no matter how many features, it has to be reliable, it has to be easy to maintain, and it's gotta be easy for whoever has to do the work to know where to go find the work.
So we regularly involve our support engineers. Give us feedback. We'll support you. You support us. Our solutions architects, we provide them with information in regards to what their architecture should look like. Are they living up to that? Is there gonna be a strain based on what the user patterns actually are?
Because you can design the best system in the world. It's used by humans. Something is gonna be different than you expect. You can say “this can handle 3000 requests per second” and that handles all 3000 of your users just fine. And then you attach CI and fleet management and it goes from 3000 RPS to 10,000 RPS and your system's trying to carry the load, right? We have these perspectives all the way through and we try to make sure that we deal with this when we have a problem because a component is misbehaving. We go and we try to look and then we'll call in to the experts that are from the teams that are developing that component. Be like, “ look, here's what we're seeing. This seems a little odd. Can you help us understand what this behavior is?” There's a network issue, a disk IO issue, these kinds of things on a virtual machine network and disk IO issues. These are easy to trace in Kubernetes. Whoa, we got a different ballgame.
Dex: We are technically at time. I wanna thank the GitLab folks. Thank you so much for joining!