Improve this doc

Device Diagnostics

Getting Started

The balenaCloud Dashboard includes the ability to run a set of diagnostics on a device to determine its current condition. This should, in most cases, be the first step in attempting to diagnose an issue without having to actually access the device via SSH. Ensuring diagnostics and health checks are examined first ensures that you have a good idea of the state a device is in before SSHing into it, as well as ensuring that the information can be accessed later if required (should a device be in a catastrophic state). This helps greatly in a support post-mortem should one be required.

Currently, diagnosis feature is only available via the Dashboard.

To run device diagnostics through balenaCloud dashboard, head to the Diagnostics tab in the sidebar and click the Run checks button to start the tests.

Read more about other diagnostics you can run after checking your device:

  1. Device Diagnostics
  2. Supervisor State

Check Descriptions

As part of the diagnostics suite, you will find a group of checks that can be collectively run on-device. Below is a description of each check and what each means or how to triage.

A check in this context is defined as a function that returns a result (good/bad status plus some descriptive text), whereas a command is simply a data collection tool without any filtering or logic built in. Checks are intended to be used by everyone, while typically command output is used by support/subject matter experts and power users.

diagnostics

Diagnostics are split into three separate sections: Device health checks, Device diagnostics and Supervisor state.

check_balenaOS

Summary

This check confirms that the version of balenaOS is >2.x. There is further confirmation that the OS release has not since been removed from production.

As of May 1, 2019, balenaOS 1.x has been deprecated. These OSes are now unsupported. For more information, read our blog post: https://www.balena.io/blog/all-good-things-come-to-an-end-including-balenaos-1-x/.

Triage

Upgrade your device to the latest balenaOS 2.x (contact support if running 1.x).

Depends on

Parts of this check depend on fully functional networking stack (see check_networking).

check_under_voltage

Summary

Often seen on Raspberry Pi devices, these kernel messages indicate that the power supply is insufficient for the device and any peripherals that might be attached. These errors also precede seemingly erratic behavior.

Triage

Replace the power supply with a known-good supply (supplying at least 5V / >2.5A).

check_memory

Summary

This check simply confirms that a given device is running at a given memory threshold (set to 90% at the moment). Oversubscribed memory can lead to OOM events (learn more about the out-of-memory killer here).

Triage

Using a tool like top, scan the process table for which process(es) are consuming the most memory (%VSZ) and check for memory leaks in those services.

check_container_engine

Summary

This check confirms the container engine is up and healthy. Additionally, this check confirms that there have been no unclean restarts of the engine. These restarts could be caused by crashlooping. The container engine is an integral part of the balenaCloud pipeline.

Triage

It is best to let balena's support team take a look before restarting the container engine. At the very least, take a diagnostics snapshot before restarting anything.

check_supervisor

Summary

This check confirms the Supervisor is up and healthy. The Supervisor is an integral part of the balenaCloud pipeline. The Supervisor depends on the container engine being healthy (see check_container_engine). There is also a check to confirm the running Supervisor is a released version, and that the Supervisor is running the intended release from the API.

Triage

It is best to let balena's support team take a look before restarting the supervisor. At the very least, take a diagnostics snapshot before restarting anything.

check_localdisk

Summary

This check combines a few metrics about the local storage media and reports back any potential issues.

test_disk_space

Summary

This test simply confirms that a given device is running beneath a given disk utilization threshold (set to 90% at the moment). If a local disk fills up, there are often knock-on issues in the supervisor and release containers.

Triage

Run du -a /mnt/data/docker | sort -nr | head -10 in the hostOS shell to list the ten largest files and directories. If the results indicate large files in /mnt/data/docker/containers, this result often indicates a leakage in a container that can be cleaned up (runaway logs, too much local data, etc). Further info can be found in the Device Debugging Masterclass.

test_write_latency

Summary

This test compares each partition's average write latency to a predefined target (1s). There are some caveats to this test that are worth considering. Since it attempts to categorize a distribution with a point sample, the reported sample size should always be considered. Smaller sample sizes are prone to fluctuations that do not necessarily indicate failure. Additionally, the metric sampled is merely the number of writes disregarding the size of each write, which again may be noisy with few samples. Writes come primarily from application workloads and less often from operating system operations. For more information, see the relevant kernel documentation.

Triage

Slow disk writes could indicate faulty hardware or heavy disk I/O. It is best to investigate the hardware further for signs of degradation.

test_disk_expansion

Summary

This test confirms that the host OS properly and fully expanded the partition at boot (>80% of the total disk space has been allocated).

Triage

Failure to expand the root filesystem can indicate an unhealthy storage medium or potentially a failure during the provisioning process. It is best to contact support, replace the storage media and re-provision the device.

test_data_partition_mounted

Summary

This test confirms that the data partition for the device has been mounted properly.

Triage

Failure to mount the data partition can indicate an unhealthy storage medium or other problems on the device. It is best to contact support to investigate further.

check_timesync

Summary

This check confirms that the system clock is actively disciplined.

Triage

Confirm that NTP is not blocked at the network level, and that any specified upstream NTP servers are accessible. If absolutely necessary, it is possible to temporarily sync the clock using HTTP headers (though this change will not persist across reboots). Further info can be found in the Device Debugging Masterclass.

Depends on

This check depends on a fully functional networking stack (see check_networking).

check_temperature

Summary

This check looks for evidence of high temperature and CPU throttling.

test_temperature_now

Summary

If there are sensors, this check confirms that the temperature is below 80C (at which point throttling begins).

test_throttling_dmesg

Summary

This looks for evidence of CPU throttling in kernel log messages.

check_os_rollback

Summary

This check confirms that the host OS has not noted any failed boots & rollbacks.

Triage

More information available here, contact support to investigate fully.

check_networking

Summary

This check tests various common failures at install locations required for a healthy container lifecycle. More information on networking requirements can be found here.

test_upstream_dns

This test confirms that certain FQDNs are resolvable by each of the configured upstream DNS addresses. Only the failed upstream DNS addresses will be shown in the test results.

test_wifi

This test confirms that if a device is using wifi, the signal level is above a threshold.

test_ping

This test confirms that packets are not dropped during a ICMP ping.

test_ipv4_stack

This test confirms that the device can reach a public IPv4 endpoint when an IPv4 route is detected.

test_ipv6_stack

This test confirms that the device can reach a public IPv6 endpoint when an IPv6 route is detected. If necessary you can disable IPv6 entirely on a device if it is experiencing issues.

test_balena_api

This test confirms that the device can communicate with the balenaCloud API. Commonly, firewalls or MiTM devices can cause SSL failures here.

test_dockerhub

This test confirms that the device can communicate with the Docker Hub.

Depends on

This test depends on the container engine being healthy (see check_container_engine).

test_balena_registry

This test is an end-to-end check that tries to authenticate with the balenaCloud registry, confirming that all other points in the networking stack are behaving properly.

Depends on

This test depends on the container engine being healthy (see check_container_engine).

Triage

Depending on what part of this check failed, there are various fixes and workarounds. Most however will involve a restrictive local network, or an unreliable connection.

check_user_services

Summary

Any checks with names beginning with check_service_ come from user-defined services. These checks interrogate the engine to see if any services are restarting uncleanly/unexpectedly or failing health checks. We allow users to provide their own health checks using the HEALTHCHECK directive defined in the Dockerfile or docker-compose file. Any health check output will be collected as-is, truncated to 100 characters, and shown as output along with the exit code.

Triage

Investigate the logs of whichever service(s) are restarting uncleanly or failing healthchecks. This issue could be a bug in the error handling or start-up of the aforementioned unhealthy services. These checks are wholly limited in scope to user services and should be triaged by the developer.

Depends on

This check depends on the container engine being healthy (see check_container_engine).

DIAGNOSE_VERSION=4.22.18