Replication task metrics

All members of a HiveMQ cluster will output log messages to indicate replication is in progress while the process is ongoing

INFO  - Starting cluster replication process. This may take a while. Please do not shut down HiveMQ.
[...]
INFO  - Replication is still in progress. Please do not shut down HiveMQ.

While these messages are being logged, the cluster is at risk of data loss, should more than replica count -1 brokers be removed from the cluster.

Once all necessary data has been exchanged the brokers will log the following upon complete data exchange:

INFO  - Finished cluster replication successfully in 30000 ms.

This indicates that replication for all necessary persistent data to reach its replication factor on the target hosts has been achieved. However, this log message does not mark the completion of all replication related tasks. The individual brokers may still be observing I/O and CPU load stemming from the replication process.

In order to ensure the cluster has reached its base line load with the traffic it is experiencing, the metrics give further insight.

A join process (where a fresh broker with no state joins the cluster) has been completed once the following tasks return to 0:

com.hivemq.internal.singlewriter.*
      topic-tree.remove-locally.queued
      client-session-subscription-persistence.remove-locally.queued
      client-session-persistence.remove-locally.queued
      client-queue-persistence.remove-local.queued
      client-event-persistence.remove-bucket.queued

Further the replication batches should also have reached 0 again:

com.hivemq.replication.batches-queued
com.hivemq.replication.batches-sent

Observing these metrics in addition to the mentioned log line can be helpful when performing rolling upgrades in clusters that are operating close to the limits of their hardware’s capabilities.

Ensuring replication related tasks are resolved prior to rotating the next node is the least impactful way to perform such topology changes.