Health-checking is the process by which the health state of each component of a distributed system is monitored with the purpose of distributing traffic across service instances.
While conceptually simple, health-checking can often become as complex as the services they monitor. The operational readiness of a single service instance might be defined by multiple metrics, all of which must be efficiently processed and reduced into a single binary outcome. To complicate things further, the definition of operational readiness may vary according to the state of the overall service cluster: if too many instances are degraded, it is often beneficial to relax service levels. Building a scalable health-checking system which addresses all of these concerns while remaining reactive to failures can be extremely challenging.
This talk explains how we leveraged Serf to build a production distributed health-checking system that we use at Fastly, a globally distributed edge cloud. Our design borrows techniques from machine learning, signal processing and control theory to drive stable traffic allocation while quickly and accurately identifying failures.