prometheus alerting

Observability has emerged front-and-center in the DevOps toolkit for optimizing organizational performance and delivering high-output, high-quality software. Observability methods and tools are evolving rapidly, from scraping and sampling to tracing and indexing. Observability solutions provide visibility into critical health and performance indicators in your technology stack, bringing monitoring, alerting, and reporting together into one management solution. But teams who primarily distribute apps into end-user environments face a new set of challenges.

When DevOps orgs deliver their code to their servers (first-party), they know precisely how to approach their infrastructures to decipher system state and behavior and solve problems. They benefit from direct access to the monitoring stack, application SMEs, and metrics. Within minutes, a team member can look somewhere such as their Grafana dashboards and Splunk queries (each with which they’re highly familiar) and gain a deep understanding of the system. 

This simplicity is not true about open-source software (OSS) and commercial off-the-shelf (COTS) applications where end-users deploy someone else’s code into their infrastructure. As an end-user, you do not know the ins and outs of a third-party or community application’s architecture and likely have no idea which logs, metrics, or traces you would inspect to understand how to fix an issue.

A COTS maintainer won’t have access to the end user’s infrastructure because of the high-security risk involved. Then there’s the effort. A COTS software provider can’t just SSH into foreign infrastructure to query process logs or play 20 questions with kubectl. Imagine approaching a COTS maintainer with an issue and saying, “Here’s my problem, why don’t you give me your public key? I’ll give you a VPN config, and you can access my network and tell me what’s wrong with my database instance.” Not going to happen.

So what can COTS maintainers do instead? With close to a hundred observability tools in the CNCF ecosystem across logging, tracing, and monitoring, there are tens of thousands of possible stacks an end-user might implement in their environment. That’s tens of thousands of different stacks that a COTS maintainer might need to support in their application to allow all their end-users to observe the application in their environment effectively.

As it stands, the current (painful) state of the art for enabling observability in end-user environments involves sharing a runbook and instructing end users on how to configure scraping, metrics, and thresholds for each service in the stack. The downsides of this more manual approach include:

  • Manual errors
  • More heavy lifting
  • More documentation
  • User resistance – some end-user teams opt not to participate. Troubleshooting becomes harder for them and your team if you get pulled in.

You don’t want to spend your precious resources building manual observability processes. For end-users running tons of workloads on thousands of applications, manual configuration of observability stacks with runbooks won’t scale. The following guide will walk you through what we believe is state of the art for a Vendor/User partnership for monitoring OTS applications leveraging OSS cloud-native observability tools.

Prometheus Operator for End-to-End Kubernetes Cluster Monitoring

The following developer-focused section walks through an example project that allows Replicated users to integrate directly with KOTS and an embedded Prometheus monitoring stack to push vendor-defined alerts into an end customer’s monitoring and incident pipeline. Prometheus scrapes time series data as a source for generating alerts. Teams benefit from a multi-dimensional time series data model that enables visibility, metrics, and diagnostics to fix issues.

This guide covers some practical problem solving for the COTS observability issues, building on the foundation laid by the prometheus-operator and kube-prometheus projects. The example assumes we’re working in a cluster that already has kube-prometheus installed. If you’re shipping Kubernetes applications to end-users who don’t have a cluster yet, you can use kURL to automatically embed a fully functional kube-prometheus stack in with your application. kURL, Replicated’s open-source Kubernetes distribution creator, features well-integrated support for Prometheus. You can find instructions for including the kURL Prometheus Add On in kURL’s installer specification for Metrics & Monitoring.

Because this addon is included by default, you can also quickly bootstrap a single-node cluster with kube-prometheus ready to go by spinning up a supported linux vm and running:

curl | sudo bash

We will demonstrate the end-to-end integration between kURL, KOTS, and Prometheus which, when shipped together, can automatically configure scraping and alerting on app-level Prometheus metrics, without the end user needing to understand architecture or alertable thresholds. We’ll use a trivial application that exposes an example “temperature” metric and configure Prometheus to trigger an alert when the temperature goes above a certain threshold.

The Flaky App

We’ll monitor a flaky app called flaky-app. Most notably, this app has an indicator temperature_celsius. When this value is above 85, a warning should be triggered. Above a value of 90, a critical alert should be triggered.

The deployment and service can be found in ./monitoring/flaky-app.yaml. The golang source code can be found in ./cmd/flaky-app. The frontend is static HTML in bad_javascript.go (it certainly cannot be described as good javascript by any measure).

If you want to try out the example, you can do so by running the following kURL installer script. During the installation process, you’ll be asked to upload a license file, which is available in the github repo.

curl -sSL | sudo bash

For more information on how to install the application with kURL (embedded cluster) feel free to check the documentation.

Once the kubernetes cluster is deployed, you can access the Kubernetes Off-the-Shelf (KOTS) application installer on port 8800. Once the application is deployed, you can access it on port 3000.

The application stores a single temperature value in memory. It has controls and API endpoints to modify the temperature up or down.

Prometheus alerting 1

Changing the temperature will show a visual state change in the application, and we’ll explore how this affects the monitoring systems on the backend.

prometheus alerting 2

Most notably, the application will expose the value of this temperature at /metrics for Prometheus to pick up.

prometheus alerting 3

Monitoring Metrics

Part of the application is a ServiceMonitor custom resource for the Prometheus operator.

When we go to our Prometheus instance (available on port 30900), we should see the Prometheus configuration updated with a scrape job for this service:

prometheus alerting 5

When this configuration is picked up, an additional Prometheus target should be available.

prometheus alerting 6

We can now graph the value of temperature_celsius over time using the graph viewer:

prometheus alerting 7

The same graph is also added to the Application Installer dashboard.

prometheus alerting 7

Temperature Alerting

In order to add an alerting rule for this temperature to match our 85 and 90 degree alert thresholds, the flaky-app has 2 alert rules defined in the ./monitoring/flaky-app-alertrules.yaml alert rules.

When we browse to Prometheus Alerts  we should see two alerts in the Prometheus dashboard:

Navigating to alert manager on :30903, we can also see this alert firing in AlertManager:

prometheus alerting 9

Configuring Alert Sinks

By default, no alerts are configured in the alert manager. In this project, we also configured alerts to a webhook sink and an SMTP email sink.

To demonstrate this easily, the project includes an optional request bin container to capture and inspect webhook payloads, and a MailCatcher container to capture and inspect email alerts. We’ll take a look at these shortly. For now, the first step is to write our kots-config.yaml to display these alerting options to the user:

prometheus alerting 10

prometheus alerting 11

prometheus alerting 12

To convert this config into something alertmanager understands, we deployed alertmanager-secret.yaml, which configures the alerting rules for AlertManager.

Note that this is deployed to the monitoring namespace and effectively patches the default AlertManager config that ships with kube-prometheus. The secret has two routes configured – one for the webhook and one for the smtp mailcatcher. An alternative CRD-based AlertManagerConfig method might work, but has not been tested.

Once this is deployed, we can confirm the configuration in AlertManager (it can sometimes take up to 5 minutes for AlertManager to pick up new configuration changes):

prometheus alerting 13

To view the actual alerts we can use the links from the flaky-app UI, or the kotsadm dashboard:

prometheus alerting 14

In requestbin, we can see the alert payloads:

prometheus alerting 15

In mailcatcher, we can see the email alerts:

prometheus alerting 16


kube-prometheus empowers app maintainers to operationalize and scale their software distribution to end customers with a pragmatic, minimalist approach. In the above use case, we explored how Replicated’s OSS tools make it easy for organizations to build on top of core Prometheus primitives to take advantage of event stamping and alerting, whether you’re on-prem, in the multi-cloud, even in an air-gapped data center (what we call “multi-prem”). There are plenty of metrics you could potentially start alerting and enumerating based on:

  • Saturation: CPU usage, memory usage, disk I/O
  • Errors, e.g., HTTP 500’s
  • Traffic: number of requests over time
  • Latency: response times

Click here to schedule a demo and discover how Replicated helps software vendors distribute their  Kubernetes apps to the Enterprise.