Fleet-wide Machine Metrics Monitoring in 20mins

Great! Our click bait title worked! But seriously stick around, this tutorial will get you a Prometheus + Grafana monitoring setup for resin devices in no time.

So a couple weeks ago we kicked the tires of Prometheus to see if it's monitoring capabilities could be extended further than stock servers to the real world of dispersed IoT devices. We created a resin application that ran the node exporter (collects Linux stats), Prometheus server (scrapes node exporter data) and the Alert Manager (Sends notifications when rules are broken).

Our previous attempt was great because all the monitoring logic ran on the device so there was no need for a central server/database to be setup, but it was limiting for the same reasons as you couldn't see your entire fleets statistics in one place.

For a detailed explanation of what we did last time read the blog post.

Revisiting the project I made some significant changes with significantly better outcomes.

What are we building?

This time, we are building a machine metrics monitoring mechanism for resin.io devices. Each device will host it's machine metrics via its resin enabled web URL.

A central Prometheus server will then use the resin API to discover these devices and scrape the metrics. The Prometheus Server Data will then be queried by a Grafana frontend and will display two dashboards "A Fleet view" and a "Single device view".

The best part is that this is all containerized so setup will be (almost) automatic.

The Device / Server Split

The app is split into two parts, a device and a server portion. The device portion is only responsible for running the node_exporter and exposing it via its resin web URL, the server portion is responsible for running the Prometheus Server and the Alert Manager as well as two new services.

The device portion is deployed via resin.io, while the server portion can be run on your local machine or hosted on remote server. For full instructions on deployment skip ahead.

Discovery

Discovery is a custom node.js service that uses the resin API to keep Prometheus server aware of all of your resin devices.

It works by using the resin Node.js SDK to poll the resin API for devices belonging to the configured application and then formats and writes the results to a JSON file that is readable by the Prometheus server.

resin.auth.login(credentials, function(error) {  
  if (error != null) {
    throw error;
  }

  console.log("Successfully authenticated with resin API")
  setInterval(function(){
    resin.models.device.getAllByApplication(process.env.RESIN_APP_NAME).then(function(devices) {
      if (error) throw error;
      // format array and save it as json file
      saveJson(_.map(devices, format));
    });
  }, process.env.DISCOVERY_INTERVAL);
});

The Prometheus server watches this JSON file and updates it's targets accordingly. Therefore you are able to synchronise Prometheus and resin.io accurate to the DISCOVERY_INTERVAL which is set to 30000ms by default.

All the discovered targets are viewable by visiting <prometheus-server-ip>:80/targets.

all_devices

Grafana

We also added Grafana, a popular graphing library with a handy Prometheus plugin. The great thing about grafana is that it is very easily configurable. Below we automatically load the Prometheus data source on startup via the api:

curl 'http://admin:admin@127.0.0.1:3000/api/datasources' -X POST -H 'Content-Type: application/json;charset=UTF-8' --data-binary '{"name":"Prometheus","type":"prometheus","url":"http://localhost:80","access":"proxy","isDefault":true}'  

And then point grafana to a directory of JSON files with pre-made dashboards in the grafana.ini:

[dashboards.json]
enabled = true  
path = /var/lib/grafana/dashboards  

I've created two basic dashboards all_devices and single_device. The all_devices dashboard gives you a quick summary of the entire fleet for common metrics like CPU, memory and disk fullness.

all_devices

The single_device dashboard gives you a more detailed view of the same metrics using grafana's templating feature.

single_device

Running the App

Deploying the device portion
  1. Provision you're device(s) with resin.io
  2. git clone git@github.com:resin-io-projects/resin-prometheus-device.git && cd resin-prometheus-device
  3. git add remote resin <your-resin-app-endpoint>
  4. git push resin master
  5. Enable your devices resin web URL

Of course, you'd typically run an actual application alongside the node_exporter. To do this you'd just add you app's logic to the start.sh

# Start the node exporter
echo "Your application code should go here!"  
cd /etc/node_exporter-$NODE_EXPORTER_VERSION.$DIST_ARCH \  
  && ./node_exporter -web.listen-address ":80"
Running the server portion

As I mentioned this can be run locally or on a remote server.

  1. git clone git@github.com:resin-io-projects/resin-prometheus-server.git
  2. Add required environment variables in Dockerfile or at runtime.
  3. Optional: If you'd like persistent grafana storage run: docker run -d -v /var/lib/grafana --name grafana-storage busybox:latest
  4. docker build -t prometheus .
  5. docker run -t -i -p 80:80 -p 3000:3000 --name resinMonitor prometheus

Once both the device side and server side code are running. You should be presented with two dashboards as well as an alert management system, that will alert when any rules are broken.

There is no login required to view the dashboards. But if you'd like to edit them you can use the default Grafana login.

user: admin  
password: admin  

Once you have made your dashboard changes export them as JSON and save them to /dashboards folder with the existing ones then rebuild the container.


To Summarise

We have set up basic machine metrics monitoring solution for IoT devices in just a few commands. Resin allowed us to deploy the same code (the node_exporter) to multiple devices as well as us a web-accessible URL for each device. At the same time the resin api allowed us to easily integrate those devices with 3rd party services like Prometheus and Grafana.

Prometheus gave us some super extendable services (checkout all their others) and Grafana allowed us to automatically setup and customise visualisations. Safe to say these three work really well together.

I hope to keep improving this project with time, if you'd like to help please peruse the issues and submit a PR if you get the feels. As always if you have any questions ask me @craig-mulligan on resin chat.

comments powered by Disqus
Terms of Service | Privacy Statement | Master agreement | Copyright 2019 Balena | All Rights Reserved