Monitoring linux stats with Prometheus.io

This is the first of two tutorials on monitoring machine metrics of your device fleet with Prometheus.io.

prometheus

Prometheus is a tool, initially built by soundcloud to monitor their servers, it is now open-source and completely community driven. It works by scraping “targets” which are endpoints that post key-attribute machine parseable data. Prometheus then stores each scrape as a frame in a time series database allowing you to query the database to execute graphs and other functions like alerts.

tl;dr This post runs through the configuration required to get prometheus.io running on a fleet of balena devices. Skip ahead if you’d like to go straight to deploying and running the demo.

Collecting

The first task is collecting the data we’d like to monitor and report it to a URL reachable by the Prometheus server. This is done by pluggable components which Prometheus calls exporters. We’re going to use a common exporter called the node_exporter which gathers Linux system stats like CPU, memory and disk usage. Here is a full list of the stats the node_exporter collects.

Once the exporter is running it’ll host the parseable data on port 9100, this is configurable by passing the flag -web.listen-address ":<PORT>" when spawning the exporter.

Once running visit https://<your-device-ip>:9100/metrics, you’ll see all the machine metrics from the node_exporter in plain-text.

Scraping

We now need to make sense of the data being collected that’s where the Prometheus server comes in. prometheus.yml holds the configuration that tells Prometheus how and which exporters to scrape. Let’s go run through it and explain the configurations.

The scrape_interval is the interval that Prometheus will scrape it’s targets which are exporter endpoints. This will control the granularity of the time-series database. We have set it to scrape every 5s this is for demo purposes, usually, you’d use something like the 60s.

global:
  scrape_interval: "5s"

Rule files specify a list of files from which rules are read, these rules and trigger webhooks to set alerts more on that a little later.

rule_files:
    - 'alert.rules'

The scrape_configs allow you to set targets for each scrape job. We will give it a job_name composed of the exporter name and a trimmed version of the RESIN_DEVICE_UUID we then add target endpoints which is the node_exporter server we mentioned in the previous section. Then we add some balena specific labels to make identifying the target easier.

scrape_configs:
  - job_name: "node"
    static_configs:
    - targets:
        - "localhost:9100"
      labels:
        resin_app: RESIN_APP_ID
        resin_device_uuid: RESIN_DEVICE_UUID

These labels are replaced at runtime with real values from balena environment using the config.sh.

Once the Prometheus server is running, activate and visit the device’s public URL.

On the prometheus dashboard, there won’t be much to start let’s add a graph. Select add graph and add 100 * (1 - avg by(instance)(irate(node_cpu{job='node',mode='idle'}[5m]))) to the expression input. This query will find the average % CPU between the last two data points going back as far as 5 minutes, if no data point before that exists. Now we are able to query scraped data, great!

Prometheus also comes with a pre-configured console for node_exporter jobs to view this visiting https://<your-device-url>/consoles/node.html. Voila!

Alerts

So now we have a way to collect data as well as query that data. Let’s set up some alerts, these will query the scraped instances and send us an email if those queries evaluate to an undesirable value.

The alerting is handled by a separate component, the alertmanager.

The first thing to do is create some rules that Prometheus can check after each scrape. We have defined several rules in alert.rules to get you started, let’s take a look at one of them.

ALERT cpu_threshold_exceeded
  IF (100 * (1 - avg by(job)(irate(node_cpu{mode='idle'}[5m])))) > THRESHOLD_CPU
  ANNOTATIONS {
    summary = "Instance {{ $labels.instance }} CPU usage is dangerously high",
    description = "This device's CPU usage has exceeded the threshold with a value of {{ $value }}.",
  }

Here we define the alert and pass it the query we mentioned earlier as an if statement. If the statement finds that the value exceeds our THRESHOLD_CPU it will trigger the alert. We have also added ANNOTATIONS to make the alert more descriptive.

We then need to configure the Alertmanager to handle the actions required when an alert is fired. The configuration can be found in alertmanager.yml. The Alertmanager isn’t a full SMTP server itself, however it can pass on emails to something like Gmail which can send them on our behalf.

route:
  group_by: [Alertname]
  # Send all notifications to me.
  receiver: email-me
  # When a new group of alerts is created by an incoming alert, wait at
  # least 'group_wait' to send the initial notification.
  # This way ensures that you get multiple alerts for the same group that start
  # firing shortly after another are batched together on the first
  # notification.
  group_wait: 30s

When the first notification was sent, wait 'group_interval' to send a batch


of new alerts that started firing for that group.



  group_interval: 5m



If an alert has successfully been sent, wait 'repeat_interval' to


resend them.



  repeat_interval: 3h




templates:




'/etc/ALERTMANAGER_PATH/default.tmpl'




receivers:




name: email-me
email_configs:


to: GMAIL_ACCOUNT
from: GMAIL_ACCOUNT
smarthost: smtp.gmail.com:587
html: '{{ template "email.default.html" . }}'
auth_username: "GMAIL_ACCOUNT"
auth_identity: "GMAIL_ACCOUNT"
auth_password: "GMAIL_AUTH_TOKEN"

Here we catch all alerts and send the to the receiver email-me.
Take note of templates: - '/etc/ALERTMANAGER_PATH/default.tmpl' This overrides the default templates used for notifications with our own which you can see in default.tmpl.

Testing the alerts

Visit https://<your-device-url>/alerts, you’ll see a couple alerts we defined earlier in alert.rules The easiest one to test is the first, service_down, this alert is triggered when the targets (the instances prometheus server is scraping) drops to zero. Because we only have one target instance (node_exporter) we may simply kill that process to trigger the alert.

Using the web terminal run $ pkill node_exporter. If you then refresh https://<your-device-url>/alerts you’ll see the alert is firing. You will also get an email with the alert description and which device it is affecting.

Running the demo

Find the code here.

Provision you’re device(s) with balena
git clone https://github.com/balena-io-playground/balena-prometheus && cd balena-prometheus
git add remote balena <your-resin-app-endpoint>
git push balena master
Add the following variables as Application-wide environment variables

Key	Value	Default	Required
GMAIL_ACCOUNT	your Gmail email		*
GMAIL_AUTH_TOKEN	you Gmail password or auth token		*
THRESHOLD_CPU	max % of CPU in use	50
THRESHOLD_FS	min % of filesystem available	50
THRESHOLD_MEM	min MB of mem available	800
LOCAL_STORAGE_RETENTION	Period of data retention	360h0m0s

Coming up

This demo provides some basic fleet monitoring, but there it doesn’t provide a complete view of your entire fleets statistics in a single dashboard. In the next post we’ll connect all the devices to a central grafana dashboard for fleet-wide, as well as a device-specific views – stay tuned!

grafana

Any questions? – ask us on gitter!