AI Training Data Anonymization with Subsalt

At RepliCon, our quarterly virtual conference about all things software distribution, Ian Zink, Senior Developer Advocate at Replicated, was able to chat with Luke Segars, the CTO and co-founder of Subsalt. Subsalt makes high-quality anonymized data available throughout your enterprise, ensures compliance with HIPAA and other data privacy laws, and unlocks the value of your regulated data. They use Replicated to offer an on-prem deployment model of their platform. The transcript of our interview is below. You can also view the recording here:

Ian: Thanks for being here, Luke.

Luke: Great to be here. Thanks for having me.

Ian: I always love to hear about the awesome things you're doing over at Subsalt. It’s such a cool area you're in right now, right at that intersection of AI and data privacy. Just wanted to have you on and talk to us a little bit about what your company is, what it does, and then we'll just go in a little bit about how Replicated fits into all that. If you could tell us a little bit about what you're doing and how it all works, that would be awesome.

Luke: For sure. Thanks for having me. Like you said, I'm the CTO of Subsalt. My co-founders and I founded the company about 18 months ago, so we're still pretty early on. We de-identify sensitive data for healthcare institutions, financial institutions and other groups who have a lot of useful data that in general is hard to use because it's sensitive. As a result it ends up not getting used or it takes six months to get access to use it. Our product is a cloud native piece of data infrastructure and we've been Replicated clients for coming up on a year now. We use Replicated for all of our customer deployments.

We started as a managed SaaS offering, but it quickly became clear that we were going to need to be able to deploy close to the data because in a lot of cases it can't leave customer VPCs. That's when we turned to Replicated, started working with you all and honestly, tips like the ones you were just sharing really helped us get started and get our previously cloud-hosted product something that we could actually deploy into customer environments.

Ian: Awesome. So, how big is your current team?

Luke: We are six people. Six people strong.

Ian: Oh wow. You, you've actually grown since I think the last time I asked you that question! That's pretty awesome.

Luke: Oh yeah, we're blowing up. That's three technical people, including myself, and so when we originally did our first VPC deployment, we were two technical people which honestly blew my mind that that even could happen. And honestly, it would've been a really big challenge without having some tool to help us along the way.

Ian: When I think about your company and I think about the size of it and think ‘they're shipping updates and doing all these things’ it's actually really cool that you're able to, as such a tiny team, manage a SaaS product and an on-prem product and get those updates out to your customers.

Luke: It blew my mind. I'm actually pretty amazed as well, if I may say so. This worked really well, and honestly, we now promote the on-prem deployment model as much, if not more than, our managed version. It just makes sense, especially in healthcare, to be able to deploy something close to the data, which is often pretty locked up. I would call it our primary deployment method now.

Ian: Oh, wow. That's really cool. Talk a little bit about how you deploy Replicated and what a customer engagement looks like. How you get it out there, and then how the data flows around.

Luke: We de-identify data sets. Medical records is a good kind of conceptual model for what a relevant dataset would be. What our product actually does is it runs a bunch of generative models on the underlying data sets.

We can learn patterns, for example, in a covid testing data set, and then get rid of the real data and produce what's called synthetic data that is extremely high fidelity and guaranteed to be de-identified under HIPAA in the case of healthcare so teams can use it much more permissively. You don't have to be quite as strict about who can get access to it. It doesn't have to be just the short list of data scientists. It can actually be a bigger group within the company or perhaps outside of the company, and we do all that with generative AI methods. We have a Helm based deployment that we typically ship into customer environments for a proof of value.

This would be early-contract or even pre-contract in some cases. We're letting them install it, feel it out, run some updates, run some data through, and that's done with Replicated. And then, we can pretty easily flip their license when they convert to a non-trial license, a full-time license, at which point they can run updates and kind of manage the tool on their own.

We actually use a lot of the best practices probably because we learn them from you all, but we use a lot of the best practices you were just running through in terms of how we ship updates. We have our CI/CD system, whenever we push a new version of the product, it pushes those images out to Replicated’s container registry.

Then, we notify customers and they can just go in with KOTS and click update and it usually works pretty well. The first time we used it we actually got explicit compliments on how straightforward it was from their DevOps team, which was great to hear. I wasn't expecting it, but they were outwardly complimentary about how straightforward it was.

Ian: Yeah, that's great. You mentioned license stuff a little bit. Can you talk a little bit about how you use license management? I think you've done some interesting things there as far as managing how your customers use features in the license management functionality.

Luke: Yeah, interesting. I don't know if these would fall into best practice category, but for us a lot of the stuff you all have built into licenses made it very easy to convert from a SaaS-based technical product, but also a team that's just more familiar with building SaaS products, into something that has deployment specific variables and some credentials stuff. How do we package all that kind of stuff up for 10 different customers who need to have 10 different sets of API keys, for example.

We use license fields for that. It's very easy for someone setting up an account within Subsalt. They get a series of text boxes that they fill out, drop in the relevant credentials, and then they can press publish more or less. We're never exchanging explicitly sensitive keys with the customer's team at all. They just receive them and don't even have to worry about it. They just get the license file and drag and drop it and we can manage everything else from our side. So it allowed us to cut some corners on simplifying stuff. We don't have to worry as much about - okay, there's six different API keys, how do we coordinate all the stuff? They can just grab a JSON file and move it to the right place and then it just kind of works.

Ian: That's great! We have a couple minutes left, but I did have one question that I keep thinking about Subsalt when I see it. All this stuff with Chat GPT and these large language models… what do you think about that and how that interacts with on-prem? I just want some general thoughts because I'm sure you've got a lot of thoughts on that.

Luke: Yeah. Well, I remember taking AI classes back in college and I remember the list of things I thought wouldn’t be possible with AI in our lifetimes. That list is shattered at this point. It actually is pretty amazing how quickly LLMs in particular have advanced. We were just talking earlier this week within the team about on-prem LLMs and how that exists now. You can go online and just get one which means that we could consider integrating LLMs into our on-prem products. That wasn't something that I saw coming at all, it's pretty amazing.

I actually was just talking to someone else on the team this morning about a lot of the use cases that come out are conversational use cases. I think there's a ton of interesting stuff around data management and semantic understanding of data. How do you actually understand which of these fields are sensitive according to HIPAA? That's a really hard piece of software to write by hand because it's not like ‘if the column name is social security number, then mark as sensitive’. It's a very complicated kind of gray area sort of problem.

I actually think there's a lot of interesting non-conversational use cases for that kind of stuff, which is just taking gray areas and pulling structure out of them. That's what I personally am most excited about, as well as some of the developer experience stuff. You know, helping to make more maintainable code, automated code reviews, that kind of stuff, I think is going to be pretty interesting as well internally.

But, in terms of advancing the product and bringing things like LLMs into customer environments, being able to do some of the behind the scenes non-conversational stuff for us as a data infrastructure product - I think we haven't even started scratching the surface.

Ian: I hadn't thought about LLMs as a data anonymizer before, but that's actually a pretty exciting use case. It feels like our industry has just hit a real pivotal kind of moment here with AI and data and it's really interesting how it's going to impact everything.

Luke: Totally agree. And yeah… I think of what we do and how I think of what you all do is making things that were previously really hard a lot less hard. And for us, it's a data access thing - questions have been pervasive in the data industry for a while around how to do things like metadata extraction.

How do you find which fields are sensitive? How do you find which combinations of fields might be sensitive? This is sensitive only if there's a birthdate column somewhere else too. There's a bunch of kind of fuzzy stuff that's really hard to codify. So, I think we might be getting to a point where some of that can magically be done behind the scenes without throwing $25K at a bunch of GPUs and waiting two months.

Ian: We did have one question I wanted to answer from the audience, which was - how are you handling the need for GPU processing if at all?

Luke: So, we definitely do need GPU processing. We did most of our GPU related configuration work in Helm and it has worked relatively seamlessly.

We typically have customers set up two node pools. We have some processes that use GPUs, like our data pipelines, that actually do the machine learning. Then, we have other processes that run application servers that are just CPU based. Then, within Helm we can put different resource requirements. From a Replicated perspective we'll do some pre-flight checks, make sure the node pools exist, and make sure we have access to GPUs. If we don't, we can mention that that needs to happen or the pipelines aren't going to run, and just make it so that users don't have to figure out by digging through Kubernetes logs or something, what's going wrong. They can know that before they get too far down the path. So far we've deployed on two of the three major clouds to date with our Replicated installer. It's worked pretty seamlessly, to be honest. Well, seamlessly once you get through the Helm stuff, but seamlessly from a KOTS installer perspective.

Ian: Luke, thanks so much for joining us today. I could always spend more time talking with you, but I'm sure we'll talk before too long about some of the other things we're working on together. So thanks so much for joining us.

Luke: Absolutely. Thanks for having me, and thanks for all you all do.

Interested in exploring how Replicated can help your company support an on-prem deployment model like Subsalt's? Request a demo anytime.