Monitoring the edge with Prometheus pt. 1

Learn how team balena uses Prometheus to monitor devices at the edge- and how you can deploy something similar.

A method of self-similar monitoring for the edge.

Properly monitoring a fleet of devices is an evolving art. One of the current leaders in the server world for application and hardware monitoring is Prometheus, both for bare metal and as a first-class citizen in the Kubernetes world. In order to reduce the friction between the edge and the cloud, this project will deploy a Prometheus stack to monitor an entire fleet of balenaCloud devices (from a balena device, nonetheless!).

We have showcased Prometheus a few other times, and this tutorial expands on those to provide a fair bit more functionality.

This demo is the first part of a series on how to monitor your stack & fleet with Prometheus, everything from service discovery to instrumentation to alerting. Stay tuned for future updates!

The finished product:
node-exporter

Goals

Here are our goals with this tutorial:

  1. Prometheus monitoring stack monitoring a discovered fleet (from the fleet nonetheless!)
  2. Integrated Grafana for visualization also deployed to balena device(s)
  3. Service discovery mechanism to automatically detect new devices
  4. Basic machine monitoring deployed to a device using one open source exporter to expose service metrics

  5. Note: for this tutorial, we originally limited ourselves to one open source exporter (exporters are services that expose metrics for Prometheus to ingest) for simplicity’s sake. See the addendum below for deploying any number of exporters!

Requirements

  1. Two applications, one to do the monitoring (let us call this monitor) and one to be monitored (call this application thingy)
  2. This design is especially powerful if the application is multicontainer, though it need not be.

Set up

monitor application

  1. The monitoring stack can be deployed to many different locations. In this example, we will deploy it to balenaCloud and run it on a device within the fleet. Sign up for free if you don’t already have an account.

  2. Start by creating an application to deploy to. For the sake of the demo, let us call it monitor.

  3. Since the service discovery is configured purely via an environment variable, we will want to preset some to ensure our monitoring starts up without a hitch.

  4. Generate an API key and save the key in your application as an environment variable named API_KEY.

  5. If you plan to monitor remotely (i.e. via the public URLs), set an environment variable USE_PUBLIC_URLS to true

  6. Next, clone the example repository here and push it to your
    newly-created application using balena push (read more).

  7. If you now enable the public URL of the device(s) running the monitor application and navigate to the public URL, you should be able to view and access your very own Grafana instance.

    • Note: the default username/password is admin/admin, we recommend you change that as soon as you log in for the first time (Grafana will prompt you).

thingy application

  1. If you do not already have an application running that you would like to instrument, you can create a new application for demo purposes. Let us call this application thingy.
  2. At a bare minimum, to get the most from your device you will want to run node_exporter, which exports machine metrics like packet counters and memory usage. We will use this exporter to show how to configure and scrape a device, but there are many other useful exporters that may interest you as well:

  3. MQTT exporter

  4. Redis exporter
  5. OpenVPN exporter
  6. Redis exporter
  7. PostgreSQL exporter
  8. Scan this list for any other open-source code you may be running. If an exporter exists for your preferred database/message queue/application, it is always a good practice to track it. Since there are many pre-baked exporters and dashboards, you can monitor almost everything you did not write with minimal setup. The real power comes when instrumenting your own code, more on that in another post!

  9. Using our node_exporter example, add your exporter to your docker-compose.yml to configure the on-device scraping process. If you are not using multicontainer mode, you can just daemonize the node_exporter process as part of your single container application.

  10. If using public URLs, ensure that the public URLs are enabled for the devices you want to monitor.

  11. Find or create a dashboard in Grafana to visualize what you need from the data you are now collecting (make sure the datasource type is Prometheus!)

  12. Drop the dashboard json blob into the grafana/dashboards directory, following the node_exporter example here.

Multi-exporter setup

In order to configure multiple exporters such as node_exporter and perhaps the blackbox_exporter (in order to monitor your own backend remotely), you can use our example project called the meta-exporter to accomplish that. Simply add the meta-exporter docker-compose.yml setup to your project and copy over the source code. You will need to set the META_EXPORTER_PORT environment variable to whatever single port you plan to use (port 80 in the case that you are using public URLs to scrape), and a comma-separated list of {{port}}/{{endpoint}} exporters that you have configured. In the example setup, both node_exporter and blackbox_exporter have been configured to export data about the device’s OS as well as the balenaCloud backend. In the meta-exporter example, you can find a few This setup is quite powerful in terms of collecting lots of debugging information!

Upon completion, baletheus should log letting you know it is updating the registry of devices:
baletheus-logging

Bonus points

The real power of PromQL (Prometheus’ query language) comes when filtering by tags, which are metadata associated with different timeseries. Since baletheus by default exposes a bevy of tags to Prometheus, it is trivial to begin dissecting your data by commit/OS version/device type. This feature will allow you to track changes side-by-side and be more sure than ever when promoting a new OS version or code release to production.

At this point, you should be able to monitor any number of (single) exporters and create beautiful graphs and visualizations for those devices/exporters/applications. This tutorial is just the tip of the iceberg, Grafana and Prometheus are incredibly active communities that are evolving every day. Some other things potentially worth investigating (though mostly outside the scope of this tutorial):

Final notes

Grafana and Prometheus are both fairly robust, resource-intensive applications. While it is possible to deploy a full monitoring stack following the instructions above if you have any data retention requirements we recommend either streaming the timeseries to a persistent backend or deploying directly in the cloud for any production deployment. Prometheus makes use of persistent storage (which can shorten the life of some media like SD cards), though Grafana should be fully configurable proscriptively.

This tutorial has been adjusted to make Grafana as lightweight as possible to run on an edge device. Since this tutorial attempts to minimize disk writes, upon every subsequent deploy the admin password will need to be reset.

Alternatively, feel free to configure a more persistent storage medium One of the niceties of a pull-based monitoring system is you can redeploy the same stack in multiple places without reconfiguring the clients, saving the headache of changing the whole fleet. Tell us how you monitor your own stack & fleet in the forums!

Pictures are worth more

data flow for baletheus

Glossary

Prometheus:
Pull-based monitoring system and time series database

Grafana:
Visualization platform for time series data

Alertmanager:
Prometheus-project secondary component that handles alerting

Service discovery: Supported mechanism to add new scrape targets to Prometheus backend

Sidecar: Process that runs alongside an application, aggregating data and exporting when scraped by Prometheus

Exporter:
Sidecar process that runs alongside an application and returns metrics describing the state of the application


Posted

in

Tags:

Notable Replies

  1. Avatar for fbinky fbinky says:

    Andrew this is very cool. But it is very out of date. For example “application” → “fleet.” Also, the docker images you are looking for are not found (removing the version numbers seems to help). But then there’s a compatibility problem with go/prometheus. Is there an officially support baletheus?

  2. @fbinky – Thanks for the note! Great catch on this being a bit of an outdated project. Sorry that you’ve run into this bit of friction.

    Is there an officially supported baletheus?

    I dig that name haha. We actually haven’t officially turned it into a block, so a few folks are investigating how we can update it and turn it into an official block now. We’ll keep you informed. Thanks again for the great feedback.

Continue the discussion at forums.balena.io

Participants

Avatar for andrewnhem Avatar for fbinky