Reducing load during a rolling upgrade

Finishing cluster replication” log message only confirms that data loss does not occur. There are still background replication related tasks running.

Please wait for these tasks to finish before continuing the rolling upgrade process. The individual brokers may still be observing I/O and CPU load stemming from the replication process. In order to ensure the cluster has reached its baseline load with the traffic it is experiencing, Please check the metrics in https://hivemq.atlassian.net/wiki/spaces/KB/pages/2500591652

Observing these metrics in addition to the mentioned log line can be helpful when performing rolling upgrades in clusters that are operating close to the limits of their hardware’s capabilities.

Context

While doing a rolling upgrade, you would notice the following messages in the logs

INFO - Starting cluster replication process. This may take a while. Please do not shut down HiveMQ. [...] INFO - Replication is still in progress. Please do not shut down HiveMQ.

All members of a HiveMQ cluster will output log messages to indicate replication is in progress while the process is ongoing.

Issue:

While these messages are being logged, the cluster is at risk of data loss, should more than replica count -1 brokers be removed from the cluster.

Solution:

Once all necessary data has been exchanged the brokers will log the following upon complete data exchange:

INFO - Finished cluster replication successfully in 30000 ms.

This indicates that replication for all necessary persistent data to reach its replication factor on the target hosts has been achieved.

Important

The individual brokers may still be observing I/O and CPU load stemming from the replication process. In order to ensure the cluster has reached its baseline load with the traffic it is experiencing, Please check the metrics in https://hivemq.atlassian.net/wiki/spaces/KB/pages/2500591652