Restart cluster statefully when join replication process is taking too long

\uD83E\uDD14 Problem

This problem can be encountered with HiveMQ broker versions before 4.18 when Client Event History is enabled, the cluster has been running for a long time while client connections are low.

The problem might reveal itself during the cluster replication process showing longer than expected up to “never-ending” joining time.

\uD83C\uDF31 Solution

Check the metrics and make sure that the Client Event Count is growing while the node is in the joining state
Check the config.xml and make sure that Client Event History is enabled

For all nodes of the HiveMQ cluster: Modify the HiveMQ config.xml and disable the Client Event History. Save the changes to the file.

<client-event-history>
            <enabled>false</enabled>
            <lifetime>604800</lifetime> <!-- 7 days -->
</client-event-history>

Stop the nodes but one. Stop only those nodes, which cannot finish the join process to the cluster. Stop the nodes one-by-one while watching the hivemq.log log for successful shutdown.
On the last node of the cluster, modify the run.sh file so that it ensures a stateful restart.
```
JAVA_OPTS="$JAVA_OPTS -DstatefulCluster=true
```
Restart the service at the last node of the cluster (stateful restart)
Start the service on the rest nodes of the cluster one by one, monitoring the logs for successful start and the end of the join replication process.
Upgrade the cluster to the version past 4.18 to ensure that the Client Event History issue is fixed.
Enable the Client Event History on the nodes of the cluster and restart in a stateful manner.

Restart cluster statefully when join replication process is taking too long

\uD83E\uDD14 Problem

\uD83C\uDF31 Solution

\uD83D\uDCCE Related articles