Effective Monitoring and Alerting for HiveMQ Clusters
Monitoring a HiveMQ cluster requires careful selection of metrics and alert thresholds to ensure system stability while minimizing unnecessary alarms. This article highlights key metrics, best practices for alerting, and strategies for maintaining cluster health.
Instructions
CPU and Memory Monitoring
CPU and memory are fundamental metrics for any HiveMQ deployment. However, single spikes are often misleading.
Best practice:
Set thresholds based on sustained usage rather than instantaneous values.
Example: Trigger an alert if CPU or memory usage exceeds 90% for more than 10 minutes.
By focusing on sustained usage, you avoid false alarms caused by brief spikes during normal operations.
Heap Usage and Garbage Collection (GC)
Monitoring Java heap usage and GC activity is crucial for JVM-based systems like HiveMQ.
Key metrics to alert on:
Heap usage exceeding 80%
GC pause times rather than frequency alone
Specific HiveMQ metrics:
com_hivemq_jvm_garbage_collector_G1_Old_Generation_timecom_hivemq_jvm_garbage_collector_G1_Old_Generation_count
Focusing on pause times and heap usage provides a more accurate indication of memory pressure than just counting GC events.
Cluster Node Monitoring
Cluster health is critical for high availability.
Best practice:
Alert if the number of nodes drops below the expected count.
Example:
< 3 nodeswhen you have a 3-node cluster
Maintaining awareness of node count ensures that your cluster can continue serving clients reliably.
Overload Protection and Backpressure
Monitoring overload protection (OP) and backpressure mechanisms is essential to prevent system failure under high load.
Recommendations:
Track the percentage of clients affected by OP.
Set alerts to trigger only if conditions are sustained (e.g., OP level = 10 for 5 minutes).
Short spikes are normal and shouldn’t trigger alerts. By alerting only on sustained conditions, you ensure that notifications reflect true system stress.
Other Useful Alerts
Additional metrics can help identify problems before they impact operations:
Metric | Recommended Threshold / Notes |
|---|---|
TLS handshake failures |
|
License expiry | Alert when |
Disk usage (PVC enabled) | Monitor available space to prevent write errors |
I/O wait times |
|
Monitoring these metrics ensures comprehensive coverage of both system performance and operational compliance.
Best Practices for Cluster Upgrades
When performing rolling upgrades:
Always add a new node first, then remove the old one.
This ensures the cluster maintains sufficient capacity and avoids unnecessary overload risk.
Following this strategy prevents downtime and keeps the system responsive during maintenance.
Each cluster has unique requirements, so monitoring metrics and thresholds must be tailored to its deployment and expected load.