Understanding Health API Status Codes and Responses

Understanding Health API Status Codes and Responses

This article clarifies why the main /health endpoint can return an HTTP 503 Service Unavailable status code while other health endpoints like /liveness and /readiness return an HTTP 200 OK.


Symptom

You may observe that the main /health endpoint of the Health API returns a status of DOWN with an HTTP status code of 503. However, at the same time, the /liveness and /readiness endpoints both report a status of UP and return an HTTP status code of 200.


Cause and Explanation

This behavior is by design and is crucial for maintaining service stability, especially in containerized environments like Kubernetes (K8s). The different health endpoints serve distinct purposes.

 

Health Status to HTTP Code Mapping

The Health API maps the overall health status to specific HTTP status codes, following a pattern common in frameworks like Spring Boot:

  • UP: 200 OK

  • UNKNOWN: 200 OK

  • DEGRADED: 200 OK

  • DOWN: 503 Service Unavailable

If the /health endpoint's JSON payload shows "status": "DOWN", it will correctly return an HTTP 503 status code.

 

The Different Scopes of Health Endpoints

The key to understanding this behavior is that /health, /liveness, and /readiness are not the same; they aggregate different sets of components.

  • /liveness: This endpoint is intentionally hard-coded to return UP (200 OK). In Kubernetes, a failing liveness probe would cause the pod to be restarted. This default prevents K8s from restarting a broker just because a non-critical component, like an extension, has failed.

  • /readiness: This endpoint signals whether the broker is ready to accept traffic. It aggregates critical components like the cluster and MQTT listeners. It will return 200 OK even in a DEGRADED state because the broker is still operational. For example, if a hot-reload of a certificate fails, the old certificate is still in memory, and the broker can function. Marking it as not ready would unnecessarily stop traffic to the entire cluster.

  • /health: This is the main system health endpoint. It provides a comprehensive, aggregated view of all health components in a tree-like structure. If any single component reports a DOWN status, the overall /health status will also be DOWN, resulting in a 503 response.


Solution and Diagnosis

A DEGRADED or DOWN status indicates a problem that needs to be resolved, even if the broker remains operational.

 

1. Diagnose the Root Cause

To find out which specific component is failing, you must inspect the JSON response from the /health endpoint. Traverse the component tree within the JSON to identify the leaf component that is reporting the unhealthy status and review its details.

The HiveMQ Platform Operator 2.0 simplifies this process by automatically aggregating the health status and details from all nodes, allowing you to quickly see which components are unhealthy via the custom resource status and K8s events.

 

2. Set Up Proper Monitoring

Because a DEGRADED or DOWN status does not always mean the service is unavailable, you should set up monitoring based on the provided metrics rather than just relying on the HTTP status code of the main /health endpoint. This allows you to create alerts for specific issues without triggering actions that might take down an entire, still-functioning cluster.

 

3. Example: Fixing a Failed Extension

A common cause for a DOWN status is a misconfigured or failed extension. For example, if an extension fails to start because its configuration is missing or invalid, it will report a DOWN status, which propagates up to the main /health endpoint.

To fix this at runtime without a broker restart, you can:

  1. Provide a valid, minimal configuration for the failing extension to allow it to start successfully.

  2. Once the extension is running, you can disable it again if it's not needed.