Effective Monitoring and Alerting for HiveMQ Clusters

Effective Monitoring and Alerting for HiveMQ Clusters

Monitoring a HiveMQ cluster requires careful selection of metrics and alert thresholds to ensure system stability while minimizing unnecessary alarms. This article highlights key metrics, best practices for alerting, and strategies for maintaining cluster health.

 Instructions

  1. CPU and Memory Monitoring

    1. CPU and memory are fundamental metrics for any HiveMQ deployment. However, single spikes are often misleading.

      Best practice:

      • Set thresholds based on sustained usage rather than instantaneous values.

        • Example: Trigger an alert if CPU or memory usage exceeds 90% for more than 10 minutes.

      By focusing on sustained usage, you avoid false alarms caused by brief spikes during normal operations.

  2. Heap Usage and Garbage Collection (GC)

    1. Monitoring Java heap usage and GC activity is crucial for JVM-based systems like HiveMQ.

      Key metrics to alert on:

      • Heap usage exceeding 80%

      • GC pause times rather than frequency alone

      Specific HiveMQ metrics:

      • com_hivemq_jvm_garbage_collector_G1_Old_Generation_time

      • com_hivemq_jvm_garbage_collector_G1_Old_Generation_count

      Focusing on pause times and heap usage provides a more accurate indication of memory pressure than just counting GC events.

  3. Cluster Node Monitoring

    1. Cluster health is critical for high availability.

      Best practice:

      • Alert if the number of nodes drops below the expected count.

        • Example: < 3 nodes when you have a 3-node cluster

      Maintaining awareness of node count ensures that your cluster can continue serving clients reliably.

  4. Overload Protection and Backpressure

    1. Monitoring overload protection (OP) and backpressure mechanisms is essential to prevent system failure under high load.

      Recommendations:

      • Track the percentage of clients affected by OP.

      • Set alerts to trigger only if conditions are sustained (e.g., OP level = 10 for 5 minutes).

      Short spikes are normal and shouldn’t trigger alerts. By alerting only on sustained conditions, you ensure that notifications reflect true system stress.

  5. Other Useful Alerts

    1. Additional metrics can help identify problems before they impact operations:

Metric

Recommended Threshold / Notes

Metric

Recommended Threshold / Notes

TLS handshake failures

com_hivemq_tls_handshakes_failed_count

License expiry

Alert when <30 days using com_hivemq_license_days_till_expire

Disk usage (PVC enabled)

Monitor available space to prevent write errors

I/O wait times

com_hivemq_system_os_global_cpu_total_usage_iowait > 10%

Monitoring these metrics ensures comprehensive coverage of both system performance and operational compliance.

  1. Best Practices for Cluster Upgrades

    1. When performing rolling upgrades:

      • Always add a new node first, then remove the old one.

      • This ensures the cluster maintains sufficient capacity and avoids unnecessary overload risk.

      Following this strategy prevents downtime and keeps the system responsive during maintenance.

Each cluster has unique requirements, so monitoring metrics and thresholds must be tailored to its deployment and expected load.

 

 Related articles