Google: Short And Sweet, Here's Why Gmail Was Down
Calling the Gmail outage, which lasted roughly 100 minutes Tuesday and affected millions of consumer and business users of Google's Gmail service, a "big deal," Ben Treynor, Google vice president of engineering and a Google site reliability czar, wrote this in a blog post:
"We know how many people rely on Gmail for personal and professional communications, and we take it very seriously when there's a problem with the service."
In Google's words, here's what caused Gmail to get temporarily TKOed:
This morning [Tuesday] (Pacific Time) we took a small fraction of Gmail's servers offline to perform routine upgrades. This isn't in itself a problem -- we do this all the time, and Gmail's Web interface runs in many locations and just sends traffic to other locations when one is offline.
However, as we now know, we had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers -- servers which direct Web queries to the appropriate Gmail server for response. At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system "stop sending us traffic, we're too slow!" This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded. As a result, people couldn't access Gmail via the Web interface because their requests couldn't be routed to a Gmail server. IMAP/POP access and mail processing continued to work normally because these requests don't use the same routers.
Treynor wrote that Google's engineering team was on top of the failures in seconds and brought additional request routers online to distribute traffic among them, helping the Gmail Web interface to come back online.
Google said it will not investigate further to ensure an outage like Tuesday's doesn't happen again. It is also increasing request router capacity beyond peak demand for some air cover and boosting request router failure isolation.
"We'll be hard at work over the next few weeks implementing these and other Gmail reliability improvements," he wrote.