HTTP HealthChecks for a Resilient Platform

This post was written by Chris O’Dell co-author of Team Guide to Software Releasability.

Discover the acclaimed **Team Guides for Software** – practical books on operability, business metrics, testability, releasability

HTTP Healthchecks are a simple concept. A server exposes an endpoint over HTTP which is periodically called by another coordinating service. In many cases this other service is a loadbalancer that’s using the check to determine whether the instance should continue to receive traffic. The aim is to create the impression to external users of a resilient system that’s tolerant to faults.

Healthchecks can also be displayed on a dashboard, such as simple-dashboard, allowing for a basic level of real time monitoring information.

HTTP Healthchecks are not a new concept

The widely used loadbalancing software haproxy introduced HTTP checks for its Server Monitoring functionality in version 1.1.16 which was released some time in 2005. The functionality allowed haproxy to remove unresponsive machines from the pool serving incoming requests. Initially this monitoring functionality was based on TCP for fast checks, but they found that intermediary services such as firewalls could acknowledge a request before it reached the server. Higher level HTTP support was added, and to keep it fast, only the HTTP status code would be checked. HTTP responses of 2xx and 3xx would be considered valid ones.

Initially, haproxy supported only OPTIONS requests to the root URI but functionality was soon added to support specified HTTP Methods and URIs. With the ability to specify endpoints, applications in the servers could be coded to include a self-reporting healthcheck endpoint which could take into account application level information when deciding on a response code. As the body is ignored, text would sometimes be added to allow for a human readable description to be added.

Enter The Cloud

Cloud Loadbalancers such as AWS’s ELBs use HTTP Healthchecks in the same manner – to remove instances from the pool if they are deemed unresponsive. They also use the checks to add instances to the pool. This allows for some of the “self-healing” functionality of the cloud. If an instance becomes unresponsive, the ELB can terminate it and the Auto Scaling Group can initialise a replacement. Therefore, if you take advantage of auto scaling, your HTTP Healthchecks need to be fast and also reflect the readiness of your application to receive requests.

Implementing HTTP Healthchecks

By default, most web servers will return 200 at the root if it is not behind a level of authorisation. As this functionality is provided by the server itself it can only be used to determine that the web server itself is running and not necessarily the application. This can lead to instances where the application has experienced a fatal error, but the web server is unaffected and still returning a 200 to any health checks. Users of this application will experience the errors whilst monitoring may display no issues. A more representative option would be a healthcheck endpoint coded into the application.

Expose a healthcheck endpoint from your application

From within your application expose a dedicated endpoint for the sole purpose of healthcheck reporting, for example /health. As this endpoint is processed by the application it is a better indicator of the application’s overall health and ability to process requests. A body can also be attached to the response which gives more information about the service, for example the current version number and any error details. As this endpoint is likely to be unsecured, be sure not to include any sensitive information such as login details and stack traces.

> GET /health/check HTTP/1.1
> Host: localhost
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Cache-Control: max-age=0, must-revalidate, no-cache, no-store
< Expires: Thu, 01 Jan 1970 00:00:00
< Content-Type: application/json; charset=utf-8
{
 "name": "paymentsapi",
 "version": "1.12.123"
}

Common Mistakes

Slow responding healthchecks

Healthchecks need to be designed to be lightweight and fast. The aim is to get a status of the service without causing any impact on the customer facing traffic. There should be no need for any processing. This is especially important when under high load. If the healthchecks respond slower than the timeout, you run the risk of having your server marked as down and if the entire pool is responding slowly to healthchecks then your entire fleet could be removed and you’ll have an outage.

Relying on the body of the response

Much like with the slow responding healthchecks, if the service checking for health is parsing the response body it could become a blocker under high load. It is also more susceptible to change and fragility than the HTTP standard response codes. By not relying on the standard HTTP codes we lose the advantage of loadbalancing services managing the lifecycle of instances.

Using the status of dependencies to determine application health

Probably the most common mistake is to include the status of dependencies when determining the application’s health. In scenarios where health checks are used to determine whether a server is in or out of service (and whether it should be terminated) incorporating the status of downstream resources can lead to cascade failures.

If the healthcheck reports a failure due to a transitory issue connecting to a backend resource, for example a database, do we really want to pull the instance out of service? This creates flapping behaviours with delayed response times as each instance is gradually brought back into service.

To avoid this, use the status code of healthcheck endpoint to report the application’s own health and nothing more. Details of downstream issues can be included in the body and should be logged to an aggregated logging system. This combination ensures the instance stays in service when there’s no application fault whilst also making downstream issues visible.

> GET /health/check HTTP/1.1
> Host: localhost
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Cache-Control: max-age=0, must-revalidate, no-cache, no-store
< Expires: Thu, 01 Jan 1970 00:00:00
< Content-Type: application/json; charset=utf-8
{
 "name": "paymentsapi",
 "version": "1.12.123",
 "errors": [
  {
   "message": "Error connecting to database"
  }
 ]
}

Chaining HTTP Healthchecks

In this mistake, as part of an application’s healthcheck, calls are made to the healthcheck endpoints of dependency APIs. The thought process behind this is that your application needs data from another service and cannot operate without it. The idea of chaining healthchecks initially seems to be a good thing, but it has similar drawbacks to the above.

Another devastating effect of this mistake occurs in architectures where a series of dependencies are accessed in a chain. If an intermittent failure occurs to one of the lower services each subsequent service in the platform exaggerates the blip and all the servers would come down like dominoes.

Alternative approaches include considering if the data from the other service or action is truly essential. Would it be perfectly acceptable to work with cached data for a certain period of time? Could we apply graceful degradation and “turn off” affected functionality? Could we re-architect such that messaging is used with retries to create a loose connection between services? The next consideration is preventing thundering herds once the service is back online by employing circuit breakers and exponential backoffs.

Now you’re thinking with portals distributed systems

It’s at this point that things get really interesting. When HTTP Healthchecks on their own are not sufficient to ensure resilience it’s time to look into distributed systems theory and the techniques for working with them.

“A distributed system is a model in which components located on networked computers communicate and coordinate their actions by passing messages.^[1] The components interact with each other in order to achieve a common goal. Three significant characteristics of distributed systems are: concurrency of components, lack of a global clock, and independent failure of components.^[1] Examples of distributed systems vary from SOA-based systems to massively multiplayer online games to peer-to-peer applications.” – Wikipedia

You should become familiar with CAP Theorem, the Byzantine Generals Problem and more. Here is an excellent introduction “Distributed Systems Theory for the Distributed Systems Engineer“

I mentioned a couple of techniques earlier: the use of caching, circuit breakers with exponential backoffs, and graceful degradation. There are many more techniques and it’s a vibrant area of research.

All in all, remember that it is inherently impossible to make distributed systems reliable. The best we can do is develop tolerance and recovery strategies. I’ll leave you with this witty and funny article from James Mickens “The Saddest Moment“.