Detailed CPU metrics

 

Among HiveMQ’s hundreds of metrics is a group which offers detailed insights into the load the broker’s CPUs are experiencing.

It is advisable to add these metrics to your main monitoring dashboard, as they can give vital hints to the source of congestion, should any occur.

com.hivemq.system.os.global.cpu.total.usage.iowait com.hivemq.system.os.global.cpu.total.usage.irq com.hivemq.system.os.global.cpu.total.usage.nice com.hivemq.system.os.global.cpu.total.usage.softirq com.hivemq.system.os.global.cpu.total.usage.steal com.hivemq.system.os.global.cpu.total.usage.sys com.hivemq.system.os.global.cpu.total.usage.user

iowait

A rise here will indicate the processor is waiting on disk reads or writes to perform tasks. Seeing the metric rise, while overall CPU usage is low, usually indicates disks in use are not fast enough for the broker's operations and possibly even malfunctioning. In systems with high iowait it might be beneficial to investigate the system’s general I/O performance, such as checking disk or network utilization and latency.

However, it's important to interpret the iowait metric in conjunction with other CPU usage metrics and the specific workload of the system. In some cases, high iowait might be expected and not necessarily indicate a problem. Other metrics to look for hints that the system is not performing its tasks at the necessary speed are task related metrics (LINK)

irq / softirq

irq measures the percentage of CPU time spent on processing hardware Interrupt Requests (IRQs). IRQs are signals sent by hardware devices to the CPU to request attention or inform about a specific event. High IRQ usage can indicate heavy hardware activity and may be a sign of hardware issues or misconfigurations.

Similarly softirq measures the percentage of CPU time spent on processing Software Interrupt Requests (softirqs). These are software-generated interrupts used for handling network-related tasks, timers, and other kernel-related activities. High softirq usage may indicate a heavy load on the network stack.

nice

Processes can be assigned nice values to determine the priority at which they receive CPU time. The metric helps monitor how much CPU time is being allocated to different processes based on their priority.

steal

This metric represents the percentage of CPU time that the virtual machine (if the system is running in a virtualized environment) is waiting for the physical CPU to become available from the hypervisor. In virtualized environments, the physical CPU is shared among multiple virtual machines, and the "steal" time measures how much time the virtual machine has to wait for its turn on the physical CPU. High steal time can indicate resource contention on the host machine.

sys / user

This pair of metrics measure the percentage of CPU time spent in kernel (system-level operations) and user space space (user-level processes). Kernel space includes the core operating system processes and device drivers. whereas user space includes application-level processes.

Prometheus queries

rate(com_hivemq_system_os_global_cpu_total_usage_iowait{job = "$job"}[1m])

rate(com_hivemq_system_os_global_cpu_total_usage_irq{job = "$job"}[1m])

rate(com_hivemq_system_os_global_cpu_total_usage_nice{job = "$job"}[1m])

rate(com_hivemq_system_os_global_cpu_total_usage_softirq{job = "$job"}[1m])

rate(com_hivemq_system_os_global_cpu_total_usage_steal{job = "$job"}[1m])

rate(com_hivemq_system_os_global_cpu_total_usage_sys{job = "$job"}[1m])

rate(com_hivemq_system_os_global_cpu_total_usage_user{job = "$job"}[1m])

 

InfluxDB queries

SELECT mean("usage_user") + mean("usage_guest") + mean("usage_guest_nice") + mean("usage_iowait") + mean("usage_irq") + mean("usage_nice") + mean("usage_softirq") + mean("usage_steal") + mean("usage_system") FROM "cpu" WHERE $timeFilter GROUP BY time($shortInterval), "node" fill(null)