End-to-End Prometheus Alerting Example with Replicated
Observability has emerged front-and-center in the DevOps toolkit for optimizing organizational performance and delivering high-output, high-quality software. Observability methods and tools are evolving rapidly, from scraping and sampling to tracing and indexing. Observability solutions provide visibility into critical health and performance indicators in your technology stack, bringing monitoring, alerting, and reporting together into one management solution. But teams who primarily distribute apps into end-user environments face a new set of challenges.
When DevOps orgs deliver their code to their servers (first-party), they know precisely how to approach their infrastructures to decipher system state and behavior and solve problems. They benefit from direct access to the monitoring stack, application SMEs, and metrics. Within minutes, a team member can look somewhere such as their Grafana dashboards and Splunk queries (each with which they’re highly familiar) and gain a deep understanding of the system.
This simplicity is not true about open-source software (OSS) and commercial off-the-shelf (COTS) applications where end-users deploy someone else’s code into their infrastructure. As an end-user, you do not know the ins and outs of a third-party or community application’s architecture and likely have no idea which logs, metrics, or traces you would inspect to understand how to fix an issue.
A COTS maintainer won’t have access to the end user’s infrastructure because of the high-security risk involved. Then there’s the effort. A COTS software provider can’t just SSH into foreign infrastructure to query process logs or play 20 questions with kubectl. Imagine approaching a COTS maintainer with an issue and saying, “Here’s my problem, why don’t you give me your public key? I’ll give you a VPN config, and you can access my network and tell me what’s wrong with my database instance.” Not going to happen.
So what can COTS maintainers do instead? With close to a hundred observability tools in the CNCF ecosystem across logging, tracing, and monitoring, there are tens of thousands of possible stacks an end-user might implement in their environment. That’s tens of thousands of different stacks that a COTS maintainer might need to support in their application to allow all their end-users to observe the application in their environment effectively.
As it stands, the current (painful) state of the art for enabling observability in end-user environments involves sharing a runbook and instructing end users on how to configure scraping, metrics, and thresholds for each service in the stack. The downsides of this more manual approach include:
- Manual errors
- More heavy lifting
- More documentation
- User resistance – some end-user teams opt not to participate. Troubleshooting becomes harder for them and your team if you get pulled in.
You don’t want to spend your precious resources building manual observability processes. For end-users running tons of workloads on thousands of applications, manual configuration of observability stacks with runbooks won’t scale. The following guide will walk you through what we believe is state of the art for a Vendor/User partnership for monitoring OTS applications leveraging OSS cloud-native observability tools.
Prometheus Operator for End-to-End Kubernetes Cluster Monitoring
The following developer-focused section walks through an example project that allows Replicated users to integrate directly with KOTS and an embedded Prometheus monitoring stack to push vendor-defined alerts into an end customer’s monitoring and incident pipeline. Prometheus scrapes time series data as a source for generating alerts. Teams benefit from a multi-dimensional time series data model that enables visibility, metrics, and diagnostics to fix issues.
This guide covers some practical problem solving for the COTS observability issues, building on the foundation laid by the prometheus-operator and kube-prometheus projects. The example assumes we’re working in a cluster that already has kube-prometheus installed. If you’re shipping Kubernetes applications to end-users who don’t have a cluster yet, you can use kURL to automatically embed a fully functional kube-prometheus stack in with your application. kURL, Replicated’s open-source Kubernetes distribution creator, features well-integrated support for Prometheus. You can find instructions for including the kURL Prometheus Add On in kURL’s installer specification for Metrics & Monitoring.
Because this addon is included by default, you can also quickly bootstrap a single-node cluster with kube-prometheus ready to go by spinning up a supported linux vm and running:
curl https://k8s.kurl.sh/latest | sudo bash
We will demonstrate the end-to-end integration between kURL, KOTS, and Prometheus which, when shipped together, can automatically configure scraping and alerting on app-level Prometheus metrics, without the end user needing to understand architecture or alertable thresholds. We’ll use a trivial application that exposes an example “temperature” metric and configure Prometheus to trigger an alert when the temperature goes above a certain threshold.
The Flaky App
We’ll monitor a flaky app called flaky-app. Most notably, this app has an indicator temperature_celsius. When this value is above 85, a warning should be triggered. Above a value of 90, a critical alert should be triggered.
If you want to try out the example, you can do so by running the following kURL installer script. During the installation process, you’ll be asked to upload a license file, which is available in the github repo.
curl -sSL https://k8s.kurl.sh/monitoring | sudo bash
For more information on how to install the application with kURL (embedded cluster) feel free to check the documentation.
Once the kubernetes cluster is deployed, you can access the Kubernetes Off-the-Shelf (KOTS) application installer on port 8800. Once the application is deployed, you can access it on port 3000.
The application stores a single temperature value in memory. It has controls and API endpoints to modify the temperature up or down.
Changing the temperature will show a visual state change in the application, and we’ll explore how this affects the monitoring systems on the backend.
Most notably, the application will expose the value of this temperature at `/metrics` for Prometheus to pick up.
Part of the application is a ServiceMonitor custom resource for the Prometheus operator.
When we go to our Prometheus instance (available on port 30900), we should see the Prometheus configuration updated with a scrape job for this service:
When this configuration is picked up, an additional Prometheus target should be available.
We can now graph the value of `temperature_celsius` over time using the graph viewer:
The same graph is also added to the Application Installer dashboard.
In order to add an alerting rule for this temperature to match our 85 and 90 degree alert thresholds, the flaky-app has 2 alert rules defined in the ./monitoring/flaky-app-alertrules.yaml alert rules.
When we browse to Prometheus Alerts we should see two alerts in the Prometheus dashboard:
Navigating to alert manager on `:30903`, we can also see this alert firing in AlertManager:
Configuring Alert Sinks
By default, no alerts are configured in the alert manager. In this project, we also configured alerts to a webhook sink and an SMTP email sink.
To demonstrate this easily, the project includes an optional request bin container to capture and inspect webhook payloads, and a MailCatcher container to capture and inspect email alerts. We’ll take a look at these shortly. For now, the first step is to write our kots-config.yaml to display these alerting options to the user:
To convert this config into something alertmanager understands, we deployed alertmanager-secret.yaml, which configures the alerting rules for AlertManager.
Note that this is deployed to the monitoring namespace and effectively patches the default AlertManager config that ships with kube-prometheus. The secret has two routes configured – one for the webhook and one for the smtp mailcatcher. An alternative CRD-based AlertManagerConfig method might work but has not been tested.
Once this is deployed, we can confirm the configuration in AlertManager (it can sometimes take up to 5 minutes for AlertManager to pick up new configuration changes):
To view the actual alerts we can use the links from the flaky-app UI, or the kotsadm dashboard:
In requestbin, we can see the alert payloads:
In mailcatcher, we can see the email alerts:
kube-prometheus empowers app maintainers to operationalize and scale their software distribution to end customers with a pragmatic, minimalist approach. In the above use case, we explored how Replicated’s OSS tools make it easy for organizations to build on top of core Prometheus primitives to take advantage of event stamping and alerting, whether you’re on-prem, in the multi-cloud, even in an air-gapped data center (what we call “multi-prem”). There are plenty of metrics you could potentially start alerting and enumerating based on:
- Saturation: CPU usage, memory usage, disk I/O
- Errors, e.g., HTTP 500’s
- Traffic: number of requests over time
- Latency: response times
Click here to schedule a demo and discover how Replicated helps software vendors distribute their Kubernetes apps to the Enterprise.