How can developers safely rely on third-party web services without sacrificing their own SLA? Once you start using an API, you must monitor it. Unfortunately, there is no standard way of doing so. Perhaps it's time to ask SaaS providers to be transparent about their health status.
Web applications use more and more third-party web services. In production, they are used for analytics, log storage, A/B testing, comments, search, emailing, media, queuing, etc. In development, it's for version control, automated build, backup, authentication, you name it. Using third-party web services accelerates development, but it also weakens applications. Indeed, an outage in one of the services may cause an outage of the whole application.
Developers must implement failure scenarios for when third-party services fail. But they must also monitor these services to be notified whenever one breaks. And this is where the problem begins: each service provider offers a different way to determine its API status - if any.
Most services don't provide any specific health check functionality at all. You just have to trust their capacity to be up 100% of the time. Those providing status information do so for humans, not for machines. Their status page usually shows a history of recent outages, and gives an overall status for the platform. Each service reports status and uptime in a different way. For instance, compare the following status pages:
This kind of page is where you go when you already believe that a service has a problem. These pages don't help to detect outages, but only to confirm them. If you need to make sure such a service is up and running, and receive notifications when it's not, you must identify which resources are critical for your usage, and ping these resources regularly. The status pages don't make this task easier - unless you want to do web scraping to get uptime data from HTML.
Some services understand this requirement, and also provide a status resource, designed for machines. For instance, here is the type of response returned by the recently released GitHub Status API:
GET /api/status.json Host: status.github.com HTTP/1.1 200 OK Content-Type: application/json;charset=utf-8 Content-Length: 55 {"status": "good","last_updated":"2012-12-11T08:14:17Z"}
Unfortunately, there are too few of these service status APIs. Worse: the few services offering such an API all have a different interface.
It would be fabulous if all third-party web services provided a standardized API to let monitoring systems check their health status. The basic requirements for this API are: simple to request, with a response easy to decode programmatically, and a low HTTP footprint.
Monitoring systems need to know if a resource is fully operational, and when the last health check was made. This can easily be expressed by answering a GET
request, something like:
GET /up HTTP/1.1 Host: www.example.com
If the server returns HTTP status 200, it means that the system is up and running. The Last-Modified
header is the perfect place to mention the last time the system actually checked its own health.
HTTP/1.1 200 OK Last-Modified: Sun, 09 Dec 2012 14:11:45 GMT
Any other response type (500, timeout...) means that the system suffers an outage to some extend. A 404 response means that the provider didn't implement any status resource for machines.
What's in the response body is entirely up to the service provider. An HTML report, a JSON object, and XML document... I don't think the body requires standardization, since all the necessary information for machines is already in the HTTP response header. In fact, to make things even simpler, since the response doesn't need a body, a HEAD
request is enough for machines. Let's keep GET
for browser and humans.
HEAD /up HTTP/1.1 Host: www.example.com HTTP/1.1 200 OK Last-Modified: Sun, 09 Dec 2012 14:11:45 GMT
How should a given service check for its own health? Implementation is the service's responsibility. A genuine health check should imply a remote monitoring tool like Pingdom or my own Uptime. The information can also come from a user-centric analytics service like Google Analytics. Alternately, a server-side script checking the availability of internal infrastructure components (database, storage, web server, cpu...) could do the trick. However, this resource should always be dynamic, and provide honest feedback on the service status from a customer's point of view.
We don't need anything more. If all the web services we use in our web applications could provide this HEAD /up
resource, monitoring them all would be easy, and adding a new dependency to yet another service wouldn't require reinventing the wheel.
I would love the HEAD /up
resource, or something similar, to become broadly available, just like robots.txt
, or favicon.ico
. If you run a SaaS provider, in my opinion, you should provide such a tool to your customers.
I've written a Head-Up
Request For Comments on GitHub. If you feel concerned about this problem, whether from a customer point of view or from a SaaS provider point of view, please come and discuss in the related Head-Up
Google Group. There are a lot of things you may want to discuss: serving the /up
resource from a cookie-less subdomain, adding more health status information as response to a GET /up
request, providing this data in JSON or XML... The discussion could last for a long time. Or, as we all have jobs, we could work towards a quick and pragmatic consensus on a very small set of features, and let the implementations start quickly.
If an agreement was ever found, that would be for the greater good. I can imagine new services aggregating and comparing health status information from many APIs in a smart way (like api-status.com but with a larger selection of services). The net effect would probably be to increase the general quality of service of SaaS platforms. All right, I'm dreaming. But if you share the same dream, I'm looking forward to your support!
Tweet
Published on 11 Dec 2012