What is Prometheus

To provide information on the wellbeing and performance of Euler cluster nodes, we use an open-source monitoring tool called Prometheus. With its help, users may query realtime status of various node components and services. Euler’s prometheus service may be accessed at https://euler.wacc.wisc.edu/status. However, it is only available to the wisc VPN.

How to Use Prometheus

The first step to using Prometheus is to follow this link. Your device must be connected the UW VPN or the website will be unreachable. Once you are on the page, you should see a field with the place holder Expression…. In this box, you may search for the metic of interest or type a customizable PromQL expression in this field. prometheus web-interface

Most practical metrics will be prefixed with “node” or “nvidia”.

After entering an expression or metric, click execute. A list of all nodes with matching data will appear under the Console tab. The console tab will only display the most recent value applicable to the expression. Switch to the Graph tab for a graphical view. The graph will display results of the entered expression over time.
prometheus web-interface graph

By default, the time axis will be displayed in GMT timezone.

Underneath the graph view will be a list of every entry applicable with the parameters defined in the given expression. They may vary by node and other properties. Clicking on an entry will narrow out all other entries so only the selected one will show. Multiple entires may be focused this way.

Frequently Used Expressions

  • node_load15, a measure of CPU utilization which averages running process at any one moment. This metric takes an average over a period of 15 minutes
  • nvidia_gpu_duty_cycle, a measure of GPU utilization in the form of a percent of time over the past sample period during which one or more kernels were executed
  • “node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes”, the sum of these two expressions reports memory used. Similar to the value returned under the used column of free -b
  • node_network_info, will always return 1 but allows users to see network device information for each node, such as; ipv6 address, device name, and status

Advance Expression Usage

Expressions are written in an original query language called PromQL which allows for complex and customizable queries. Below are helpful guides for common PromQL features and example syntaxes.

Instant Vector Selectors

To narrow data returned from expressions to specific nodes with matching properties, use vector selectors. In Prometheus Query Language, these properties are called label matchers. Label matchers are intrinsic to metrics, however the labels “instance” and “job” are global. Instant vector selectors must be enclosed by “{ }”.

For example, node_network_info{device="eth1",instance="vaughan04.wacc.wisc.edu:10900"} will always return the value 1 as long as the device eth1 is reporting on node vaughan04. The “device” label is applicable for metric node_network_info, but may not be for other metrics.

All euler nodes have multiple network interfaces with different device names. Not all nodes will have a device with name “eth1”; a node may be online if only one device is in the up operating state.

Regex matching

If typing out the full hostname when selecting an instance for an expression becomes tedious. You may use regex matching to shorten character length in an expression or define multiple labels with matching strings. Use the label matching operators =~ or !~ to enable regex matching.

For example, nvidia_gpu_duty_cycle{instance="germain.wacc.wisc.edu:10901"} is equivalent to nvidia_gpu_duty_cycle{instance=~"germain.*"}. Both expressions will return values for metric nvidia_gpu_duty_cycle on node germain.

Many other label matching operators are available, and are documented at the PromQL link at the bottom of this page.

Range Vector Selectors

To select metric values within a specific time range, use range selectors. Enclosed with square brackets “[ ]” at the end of a vector selector. Available units are; s, m, h, d, w, y.

For example, node_load15{instance="euler.wacc.wisc.edu"[5m] will return all reported values for metric node_load15 from the past 5 minutes for the head euler node.

Expressions with a range vector selector can not be graphed.

Offset modifiers

Expressions with an offset modifier will return metric values recorded at a specified time offset from realtime. The same result may be achieved using the graph tab. offset is applicable with instant and range vectors, but must be included immediately after a selector.

For example, node_network_info offset 2d will return values for metric node_network_info as they were reported two days ago.

Metric history is continuously deleted after a certain period. Queries for values that have expired will return no results.

Functions

Supplementary computation of metric values may be accomplished in PromQL expressions using functions. A full list of available functions may be found here.

For example, the function avg_over_time(range-vector) calculates the average value of all points in the specified interval. A good usage of this metric may be to calculate the average temperature of a GPU, such as avg_over_time(nvidia_gpu_temperature_celsius{instance="euler54.wacc.wisc.edu:10901",minor_number="1"}[4m]). This expression calculates the average temperature in Celsius of a GPU in euler54 over the last 4 minutes.

External Resources

Nvidia metrics

Prometheus QL

Node metrics