Google: Memory Bug Caused Google Docs Cloud Outage
The Wednesday Google Docs cloud outage, which made Google Document Lists, Google Documents, Google Drawings and Google Apps Scripts inaccessible for the majority of Google Apps users, was caused by a change that had been designed to improve real-time collaboration within the document list. That change exposed a memory management bug, Google Engineering Director Alan Warren wrote in a blog post detailing the Google Docs outage.
"Every time a Google Doc is modified, a machine looks up the servers that need to be updated," Warren explained. "Due to the memory management bug, the lookup machines didn't recycle their memory properly after each lookup, causing them to eventually run out of memory and restart. While they restarted, their load was picked up by the remaining lookup machines -- making them run out of memory even faster. This meant that eventually the servers couldn't properly process a large fraction of the requests to access document lists, documents, drawings and scripts which led to the outage you saw on Wednesday."
Warren wrote that Google's automated monitoring noticed that attempts to access documents were failing at an increased rate and Google was alerted a minute later when the failure rate increased sharply. Once engineers realized the problem was connected to the feature change, they started rolling it back. That occurred 23 minutes after the first alert, Warren wrote. At the same time, Google doubled the capacity of the lookup service to soften the impact of the memory management bog. That rollback completed 24 minutes later and 5 minutes after that the outage was over.
Warren said that Google is scrutinizing the timeline of the Google Docs outage and is putting in place steps to avoid a future cloud outage and decrease the amount of time needed to discover and fix any problems that arise. Google is also working to limit the scope that any single problem can have.
"We intend to take all these steps; some are not easy, but we're committed to keeping Google's services exceptionally reliable," Warren wrote. "In the meantime, rest assured that we take every outage very, very seriously, and as always we'll post a full incident report of what happened to the Apps Dashboard once our investigation is complete. Again, we apologize for the inconvenience and frustration which the outage has caused."
The Google Docs outage last week was the first of a pair of high profile cloud outages. Later in the week -- late Thursday into Friday -- Microsoft Office 365 and some Windows Live online services like Hotmail and SkyDrive were also knocked out of commission. Microsoft blamed that several-hour cloud outage on a DNS issue.