Facebook: Automated System Failure Led To Outage

networking

The automated system is supposed to check for invalid configuration values in Facebook's cache and replace them with updated values from the persistent store, but that didn't work because the persistent store was also invalid, wrote Robert Johnson, director of software engineering at Facebook, wrote on his Facebook wall in a post titled "Bad Day at Facebook."

"Today we made a change to the persistent copy of a configuration value that was interpreted as invalid. This meant that every single client saw the invalid value and attempted to fix it," Johnson wrote. "Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second."

To make matters worse, every time a user got an error attempting to query one of the databases, the automated system interpreted it as an invalid value, and deleted the corresponding cache key, Johnson continued.

"This meant that even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn’t allow the databases to recover."

id
unit-1659132512259
type
Sponsored post

Facebook had to shut off the site to stop all traffic to the database cluster. Once the databases had recovered and the root cause was fixed, users were slowly allowed back onto the site, Johnson said.

"For now we’ve turned off the system that attempts to correct configuration values. We’re exploring new designs for this configuration system following design patterns of other systems at Facebook that deal more gracefully with feedback loops and transient spikes," he wrote. "We apologize again for the site outage, and we want you to know that we take the performance and reliability of Facebook very seriously."

It was Facebook's worst outage in more than four years, and may have inadvertently caused millions of people to be more productive at work Thursday, but user feedback on Johnson's explanation were generally positive.

"The fact you were able to bring this back up within 2.5 hours is impressive for a system supporting so many hundred million users. Especially with your tiny staff. Well done guys!," wrote one user.