Detailed CPU metrics
- 1.1 iowait
- 1.2 irq / softirq
- 1.3 nice
- 1.4 steal
- 1.5 sys / user
- 2 Prometheus queries
- 3 InfluxDB queries
Among HiveMQ’s hundreds of metrics is a group which offers detailed insights into the load the broker’s CPUs are experiencing.
It is advisable to add these metrics to your main monitoring dashboard, as they can give vital hints to the source of congestion, should any occur.
com.hivemq.system.os.global.cpu.total.usage.iowait
com.hivemq.system.os.global.cpu.total.usage.irq
com.hivemq.system.os.global.cpu.total.usage.nice
com.hivemq.system.os.global.cpu.total.usage.softirq
com.hivemq.system.os.global.cpu.total.usage.steal
com.hivemq.system.os.global.cpu.total.usage.sys
com.hivemq.system.os.global.cpu.total.usage.user
iowait
A rise here will indicate the processor is waiting on disk reads or writes to perform tasks. Seeing the metric rise, while overall CPU usage is low, usually indicates disks in use are not fast enough for the broker's operations and possibly even malfunctioning. In systems with high iowait
it might be beneficial to investigate the system’s general I/O performance, such as checking disk or network utilization and latency.
However, it's important to interpret the iowait
metric in conjunction with other CPU usage metrics and the specific workload of the system. In some cases, high iowait
might be expected and not necessarily indicate a problem. Other metrics to look for hints that the system is not performing its tasks at the necessary speed are task related metrics (LINK)
irq / softirq
irq
measures the percentage of CPU time spent on processing hardware Interrupt Requests (IRQs). IRQs are signals sent by hardware devices to the CPU to request attention or inform about a specific event. High IRQ usage can indicate heavy hardware activity and may be a sign of hardware issues or misconfigurations.
Similarly softirq
measures the percentage of CPU time spent on processing Software Interrupt Requests (softirqs). These are software-generated interrupts used for handling network-related tasks, timers, and other kernel-related activities. High softirq usage may indicate a heavy load on the network stack.
nice
Processes can be assigned nice
values to determine the priority at which they receive CPU time. The metric helps monitor how much CPU time is being allocated to different processes based on their priority.
steal
This metric represents the percentage of CPU time that the virtual machine (if the system is running in a virtualized environment) is waiting for the physical CPU to become available from the hypervisor. In virtualized environments, the physical CPU is shared among multiple virtual machines, and the "steal" time measures how much time the virtual machine has to wait for its turn on the physical CPU. High steal
time can indicate resource contention on the host machine.
sys / user
This pair of metrics measure the percentage of CPU time spent in kernel (system-level operations) and user space space (user-level processes). Kernel space includes the core operating system processes and device drivers. whereas user space includes application-level processes.
Prometheus queries
rate(com_hivemq_system_os_global_cpu_total_usage_iowait{job = "$job"}[1m])
rate(com_hivemq_system_os_global_cpu_total_usage_irq{job = "$job"}[1m])
rate(com_hivemq_system_os_global_cpu_total_usage_nice{job = "$job"}[1m])
rate(com_hivemq_system_os_global_cpu_total_usage_softirq{job = "$job"}[1m])
rate(com_hivemq_system_os_global_cpu_total_usage_steal{job = "$job"}[1m])
rate(com_hivemq_system_os_global_cpu_total_usage_sys{job = "$job"}[1m])
rate(com_hivemq_system_os_global_cpu_total_usage_user{job = "$job"}[1m])
InfluxDB queries
SELECT mean("usage_user") + mean("usage_guest") + mean("usage_guest_nice") + mean("usage_iowait") + mean("usage_irq") + mean("usage_nice") + mean("usage_softirq") + mean("usage_steal") + mean("usage_system") FROM "cpu" WHERE $timeFilter GROUP BY time($shortInterval), "node" fill(null)