Disaster recovery runbook for EC2/VM instances

This article explains how to recover your cluster and its persistent data in case of the HiveMQ cluster is Down or you observe the following warning in your hivemq.log

Not all replicas are currently reachable. More nodes than the replication factor of X have left the cluster in a too short time frame.

 Instructions

  1. The first step is to save persistent data and restore the availability of the cluster so that clients can reconnect back.

    1. Move the data folder of all the crashed/stopped nodes to a new location i.e different location, another machine or a storage system.

      1. In case of VM’s / EC2 instances, For example: move data to /tmp folder

        mv /opt/hivemq/data /tmp
      2. Repeat the above step until all data folders of all instances are copied to a new location.

    2. Restart the cluster and let them form a cluster to restore the availability. Once done clients can reconnect them.

  2. One cluster is up and running, the next step is to restore/recover persistent data.

    1. You need to start a new machine/pod with sufficient CPU and memory (minimum 4CPU, 4GB RAM) and sufficient disk space (at least 2.5 times the size of all data folders combined). Available hardware resources, especially disks (SSDs preferred) have a direct effect on the duration of the recovery. Make sure to have Java and the same sysemctl settings similar to HiveMQ machines do have.

    2. copy all data folders from the previously saved location to this machine. For example, lets copy all data folders from /tmp directory. Please make sure you give the correct uniqu names to the folders you are copying to avoid overwriting the contents of data folders.

      scp -i key.pem -r ec2-user@<private ip>:/tmp/data <destination folder name>

      After copying all data folders you will see something similar in your machine.

    3. once all data folders are copied, the next step is to download the HiveMQ Recovery tool. You can download it or execute wget command from your machine.

      wget https://www.hivemq.com/releases/tools/hivemq-recovery-tool-4.9.0.zip
    4. Unzip the download tool.

    5. Next step run the HiveMQ Recovery tool with all data folders as parameters. You can also check our detailed documentation.
      The command to run

      For example:

      Note: This step can take minutes up to hours depending on the amount of data that has to be restored.

    6. Once the recovery command is executed, you will find the new folder with a backup file in your export folder location.
      For example:

  3. Now you have a backup file ready to be restored on the running HiveMQ cluster. There are two ways to restore backup i.e via Control Center WebUI or via REST API.

    1. If you choose to restore via the control center then follow the below steps

      1. Use the backup file generated by the recovery tool and upload it via the browser under the Admin > Backup page in HiveMQ’s Control Center

      2. Import progress is shown in the Control Center and once it's completed you will get a message about it. You can always verify import progress in your monitoring dashboard.