Google Cloud Network Disruption Blamed On Configuration Change

‘…Google’s engineering teams are now conducting a thorough post-mortem to ensure we understand all the contributing factors to both the network capacity loss and the slow restoration,” Google Cloud engineering vice president Benjamin Treynor Sloss says. ‘We will then have a focused engineering sprint to ensure we have not only fixed the direct cause of the problem, but also guarded against the entire class of issues illustrated by this event.’

ARTICLE TITLE HERE

A configuration change was the culprit behind Google Cloud’s almost four-hour network service disruption on Sunday that slowed or prevented the use of Google services including Google Cloud Platform, YouTube, Gmail and Google Drive, according to the cloud provider.

The configuration change, which was slated for a small number of servers in a single region, was incorrectly applied to a larger number of servers in several neighboring regions, causing those regions to stop using more than half of their available network capacity, Google Cloud engineering vice president Benjamin Treynor Sloss said in a blog post.

“The network traffic to/from those regions then tried to fit into the remaining network capacity, but it did not,” Sloss said. “The network became congested, and our networking systems correctly triaged the traffic overload and dropped larger, less latency-sensitive traffic in order to preserve smaller latency-sensitive traffic flows, much as urgent packages may be couriered by bicycle through even the worst traffic jam.”

The disruption lasted from 11:45 a.m. to 3:40 p.m. on Sunday.

Although Google engineering teams detected the issue within seconds, the diagnosis and correction took far longer than their target of a few minutes, according to Sloss. The same network congestion that created the service degradation slowed the teams’ ability to restore the correct configurations, he said.

“The Google teams were keenly aware that every minute which passed represented another minute of user impact and brought on additional help to parallelize restoration efforts,” he said.

The impact of the disruption, which reduced regional network capacity, varied widely, according to Sloss.

“For most Google users, there was little or no visible change to their service -- search queries might have been a fraction of a second slower than usual for a few minutes but soon returned to normal, their Gmail continued to operate without a hiccup and so on,” he said. “However, for users who rely on services homed in the affected regions, the impact was substantial, particularly for services like YouTube or Google Cloud Storage, which use large amounts of network bandwidth to operate.”

YouTube had a 2.5 percent drop in views for one hour, while Google Cloud Storage saw a 30 percent reduction in traffic. About 1 percent of active Gmail users experienced problems with their accounts, but that represents millions of users who couldn’t receive or send email. And low-bandwidth services like Google Search recorded a short-lived increase in latency as they switched to serving from unaffected regions and then returned to normal.

Fast Company reported that the disruption also affected third-party apps and services that use Google Cloud for hosting, including Snapchat and Apple’s iCloud services, as well as Nest-branded smart home products. It cited Twitter reports of people unable to use their Nest thermostats, smart locks and cameras.

“…Google’s engineering teams are now conducting a thorough post-mortem to ensure we understand all the contributing factors to both the network capacity loss and the slow restoration,” Sloss said. “We will then have a focused engineering sprint to ensure we have not only fixed the direct cause of the problem, but also guarded against the entire class of issues illustrated by this event.”