Announcing: Customer Data Export Improvements

Dex Horthy
Nov 15, 2023

Today, we’re excited to announce new and improved ways to export and explore data about your customers and their instances. 

We've heard from customers that they want instance data in the hands of their analysts so that they can combine it with other data, like from CRMs, and build custom analyses and reporting. We are introducing three ways to do that now.

  1. CSV -- for doing quick analyses, like creating a pivot table on how many instances are running per app version
  2. Instance Export Endpoint -- for constructing repeatable reporting, e.g. extracting configuration details daily and using them to drive a Tableau dashboard
  3. Bulk events export -- for the team that wants to analyze time series data, like “instances on each kubernetes version over time” or “time to install by customer cohorts”

Why we built Data Export

In working with the 100+ software vendors who use Replicated to deliver their customer-hosted software, we’ve found that teams working at scale need good data to make the right decisions. While exposing this data in the vendor portal via several collection and reporting features was a great start, we found that this wasn’t enough once teams started to hit 20-30 customers and beyond. These teams are using data about their customers’ instances and usage to drive decisions at the product, sales, and strategic levels, and we found that many software vendors wanted to consume this information from more centralized places than the Vendor portal:

  1. In a CRM system like Salesforce or Gainsight
  2. In a Data warehouse like Redshift, Snowflake, or BigQuery
  3. In a BI tool like Looker, Tableau, or PowerBI 

At the core, we wanted to deliver features that would allow analysts, analytics engineers, data engineers, etc to make the most of this data. With new options for CSV Instances Export, JSON Instances Export, and Bulk Event Export, vendors now have the option to review data in the Vendor Portal, or export it via APIs or CSVs into any other system. We hope that in making this usage data available, we’ll enable your team to make better decisions about where to focus efforts across product, sales, engineering, and customer success.

The Instances CSV

While some teams have been using the existing export methods for years, we wanted to address several shortcomings in the current methods:

  1. The Customers CSV lacks many of the details that we can now provide since delivering the enhanced Instance Detail page
  2. The Customers CSV only delivers one row per customer, and when a customer has multiple instances, arbitrary aggregations needed to be performed across instance-specific fields like app status and app version
  1. For example, the customers CSV will show the app version of the most-recently-checked-in instance, which may or may not be ideal for a specific use case.

The new report addresses both of these by providing a report that adds a number of useful columns, and delivers 1 row per instance so that you and your team can decide if and how you want to aggregate data for a customer.

Once you have it, you can process that CSV and/or move it into your tool of choice. The below example uses Google Sheets:

Some notable new data points here are:

  1. Improved accuracy of cloud provider reporting, plus reporting for cloud provider region
  2. Reporting of KOTS version and Kubernetes Version
  3. Fields like customer_created_atinstance_first_seen_at and instance_first_ready_at for analyzing install timelines
  4. Inflated JSON payloads into independent columns for Custom Metrics and Entitlements

For a full list of columns with data definitions, see: Export Customer and Instance Data.

The Instance JSON Endpoint

While CSVs provide a simple standard for export, we also know that for some teams, CSV management can be complex. JSON APIs give benefits like typed data, OpenAPI schemas, and generally just tend to be easier to consume and parse. This is true for both simple scripts and for workflow orchestrators like Airflow or Meltano.

We identified multiple issues with the existing JSON methods for exporting data

  1. There are multiple possible ways to get similar data, but no go-to endpoint optimized for export
  2. Existing endpoints all include some amount of noise that bloats payloads and makes them harder to work with
  3. Some existing endpoints lack sane defaults and controls for filtering out inactive instances or archived customers

To that end, we’re publishing a new endpoint for exporting instances data as JSON. You can see an example request/response below

    JSON export is further documented at Export Customer and Instance Data in our docs site.

Bulk Exporting events

Knowing the state of every instance is valuable, but we find that analysts and analytics engineers are also trying to answer a lot of questions that revolve around knowing the history of an instance. For example:

  1. Based on historical trends, what % of instances could be expected to experience downtime in a given week? Is this getting better or worse over time?
  1. What % of trial licenses install successfully? How long does it take for trial licenses to be installed? What % of those convert to paid licenses? How long does conversion take? How does this differ across different cohorts of customers?
  1. Across all issued customer licenses, how many installation attempts are in-progress? How many were started but have now been abandoned?

By analyzing historical time-series data, vendors can:

  1. Identify trends and potential problem areas before they become overwhelming
  2. Demonstrate the impact and success of recent product, process, or tooling initiatives
  3. Empower teams to understand how they’re doing and apply all of their creative thinking to solving the core problems in on-premise software

While some of these time-series style views could be “hacked” by querying the current state of all instances regularly and snapshotting the changes, we wanted to provide first-class support for understanding the history of instance upgrades. 

To that end, we’re publishing a new endpoint that allows for fetching events for all instances and customers.

Example Request / Response

From this data, any historical / time series views can be constructed. The data can be filtered by date range, event type, and more.

Tying it all Together

To demo this functionality and what’s possible with it, we’ve open-sourced a few Example Analytics Notebooks showing some analyses that can be done with the Replicated data, including a Kaplan-Meier analysis of time-to-install, an example of using Meltano to load CSV data into a SQL database and query it back out, and some timeseries analysis of Kubernetes Version Adoption. We hope this serves as inspiration and guidance as you explore these features!

Next Steps

We’re looking for feedback on this functionality. If you’d like to be a design partner, please log a feature request. While API and CSV export is available today, we’ll look to continue improving the integration points that enable you to move this data into the systems where you need it. We’re also actively developing functionality for enabling telemetry collection from air-gapped environments, and will aim to include Data Export in that work. If you’d like to be an alpha tester for air-gapped telemetry, let us know.

Want to learn more about what Replicated does to help vendors distribute software to self-hosted environments? We would love to show you -- click here to schedule a demo.