Restart cluster statefully when join replication process is taking too long

 Problem

This problem can be encountered with HiveMQ broker versions before 4.18 when Client Event History is enabled, the cluster has been running for a long time while client connections are low.

The problem might reveal itself during the cluster replication process showing longer than expected up to “never-ending” joining time.

 Solution

  1. Check the metrics and make sure that the Client Event Count is growing while the node is in the joining state
    com_hivemq_persistence_executor_client_events_tasks

  2. Check the config.xml and make sure that Client Event History is enabled
    /opt/hivemq/conf/config.xml

    <client-event-history> <enabled>true</enabled> <lifetime>604800</lifetime> <!-- 7 days --> </client-event-history>

     

  3. If both checks are positive, then use the following workaround:

  4. For all nodes of the HiveMQ cluster: Modify the HiveMQ config.xml and disable the Client Event History. Save the changes to the file.

    <client-event-history> <enabled>false</enabled> <lifetime>604800</lifetime> <!-- 7 days --> </client-event-history>

     

  5. One by one, stop all nodes of the cluster, except the last one. Stop only those nodes, which cannot finish the join process to the cluster. When stopping a node, monitor the hivemq.log log for successful shutdown. Only after that stop the next node.

  6. On the last node of the cluster, modify the run.sh file so that it ensures a stateful restart.

    JAVA_OPTS="$JAVA_OPTS -DstatefulCluster=true

     

  7. Restart the service at the last node of the cluster (stateful restart)

  8. Start the service on the rest nodes of the cluster one by one, monitoring the logs for successful start and the end of the join replication process.

  9. Upgrade the cluster to the version past 4.18 to ensure that the Client Event History issue is fixed.

  10. Enable the Client Event History on the nodes of the cluster and restart in a stateful manner.