The internet momentarily went dark last week following a global outage at American cloud computing services provider Fastly.
Around 6:00 p.m. (Philippine time) on June 8, online users worldwide encountered “service unavailable” or “connection failure” errors while accessing popular websites.
Sites reported down or experiencing issues include Amazon, BuzzFeed, CNN, PayPal, Pinterest, Reddit, Spotify, The Guardian, The New York Times, Twitch, Twitter, UK Government website, and Vimeo, among others.
The incident was immediately traced to Fastly, which at the time said it was investigating potential impact to performance with its content delivery network (CDN) services.
A CDN is a geographically distributed network of servers that work together to provide fast delivery of internet content. When a leading CDN provider faces technical issues, this will likely affect every website they support. (Read: Airdrop security flaw affects 1.5 billion Apple users)
“This outage was broad and severe, and we’re truly sorry for the impact to our customers and everyone who relies on them,” Fastly Senior Vice President of Engineering and Infrastructure Nick Rockwell said.
According to the company’s official report on the global CDN disruption, the incident affected Asia-Pacific, South America, North America, South Africa, India, and Europe.
Fastly’s worldwide outage was due to an undiscovered software bug that surfaced when it was triggered by a valid customer configuration change. They detected the disruption within a minute, identified and isolated the cause, then disabled the configuration.
In just under an hour, 95% of their network was operating as normal. It took less than three hours overall until Fastly observed recovery of all services and declared the incident resolved.
“Even though there were specific conditions that triggered this outage, we should have anticipated it,” Rockwell said.
Just a month before the outage, Fastly began a software deployment that introduced a bug that could be triggered by a specific customer configuration under specific circumstances. Exactly that happened on June 8 when a customer pushed a valid configuration change, including the conditions that set off the bug, causing 85% of their network to return errors.
“We provide mission critical services, and we treat any action that can cause service issues with the utmost sensitivity and priority. We apologize to our customers and those who rely on them for the outage and sincerely thank the community for its support,” the company official said.
Moving forward: Fastly is deploying the bug fix across their network as quickly and safely as possible. They are conducting a complete post mortem of the processes and practices followed during the outage, figuring out why they didn’t detect the bug during software quality assurance and testing processes. Doing so will help them evaluate ways to improve their remediation time.
“We have been—and will continue to—innovate and invest in fundamental changes to the safety of our underlying platforms. Broadly, this means fully leveraging the isolation capabilities of WebAssembly and Compute@Edge to build greater resiliency from the ground up,” Rockwell said.
Fastly said it will continue to update its community regarding its progress. Meanwhile, their customers are encouraged to email firstname.lastname@example.org for more information.
If you like reading our content, why not show your appreciation by treating us to a cup of coffee? (or two, if you’re feeling generous)