Image Generated from Bleeping Computer

On Thursday, June 12, a massive Google Cloud failure brought down some of the most used applications, including music streaming giant Spotify, communication hub Discord, and Google’s own email service, Gmail. 

The cloud failure brought the internet landscape to a standstill, impacting millions of users worldwide, and lasted several hours, highlighting the fragile interconnectedness of our digital world and the reliance on cloud infrastructure. Let’s break down when it started, what went wrong, which platforms got hit, and what’s next for the future.

Outage Timeline: When Did it Start and Which Platforms Did it Affect?

The cloud failure began around 10:45 a.m. PST when many users complained of their inability to use their apps, as a pop-up error message kept coming up saying it couldn’t connect. The downtime lasted up until 18:18 PST on Thursday. 

During this period, Downdetector, a platform that provides real-time information about the connectivity/service status of various online services and websites around the world, reported a surge of complaints from tens of thousands of users across major platforms, including Spotify, Cloudflare, Discord, Snapchat, Twitch, Anthropic, Shopify, all Google Cloud services, and Workspace products.

The detector recorded over 10,000 incident reports related to Google Cloud and more than 44,000 reports for Spotify around 2:46 p.m. ET in the U.S., and further reports for Google Meet/Google Search and Discord, with the former having over 4,000 reports and the latter having more than 8,000 reports.

What caused the cloud failure?

The disruption was as a result of an error in a new feature added to Google’s Service Control to manage quota policies for Google Cloud and its customers. Let’s break it down into simpler terms.

Google’s cloud service is like a company (with branches in different regions) with different departments many people want access to, and to get access or work with the company or a specific department(s), you have to go through intermediaries called APIs (Application Programming Interfaces). An API is a set of rules and protocols that allow different software applications to communicate with each other, share data, and leverage each other’s functionalities.

APIs are like a set of instructions the middlemen provide for customers to access the company’s services or access its data. This way the customer doesn’t have to research from scratch, understand complex working behind how the system works, or go through hoops to individually collaborate with different departments to get a job done; the API handles it all. 

A common example of the use of an API is online shopping, where after you have selected your purchase and you are about to pay, the website utilizes an API to communicate with a payment gateway (a wallet or bank) to process your payment.

To make things organized, however, there’s a management and control center (Google API management and control center) that checks the customers brought in by the APIs and ensures they are following the rules and using the right API. A key part of this system is called Service Control, which is in charge of managing and enforcing policies for Google and Google API.

On May 29th, Google added a new check to its Service Control system to better manage quota policy checks. However, there was a fault in the code, as it didn’t have a way to handle errors if something unexpected happened. 

On June 12, someone accidentally put a rule into the system with some blank spaces, and when Service Control tried to read this rule, it encountered the blank spaces and crashed. And since the rulebook was copied globally, it affected the global network of service controls.

Thanks to Google’s quick response, the problem was quickly identified by its Site Reliability Engineering team, and necessary measures were taken to disable the faulty rule. It took 40 minutes for the rule to be disabled globally and get the service control systems back online.

However, things didn’t bounce back immediately, as some large regions, like us-central-1, got overwhelmed when the service control restarted. And since Service Control doesn’t have a “wait a minute and try again” strategy, it worsened the situation, causing system recovery to delay by an extra 2 hours and 40 minutes, as Google had to restart and redirect internet traffic.

The Aftermath

“We deeply apologize for the impact this outage has had.” Google wrote in an incident report following the downtime. “Google Cloud customers and their users trust their businesses to Google, and we will do better.” Thomas Kurian, CEO of Google’s Cloud unit, also took to Twitter to apologize for the disruption the downtime caused customers.

Google further pledged to prevent such an occurrence from repeating itself by implementing measures to:

  • Prevent their APIs from falling due to invalid or corrupt data.
  • Prevent situations where metadata is sent globally without proper protection, testing, and monitoring. 
  • Teach the system how to test and handle cases of invalid data.

The Interdependence Problem: A Wake-Up Call for the Future of Cloud Reliability.

The Google Cloud failure serves as a stark reminder of the cons regarding placing reliance on a few major cloud providers. This downtime affected many independent platforms and services that built upon Google’s Cloud infrastructure, and while the interconnectedness offers scalability and efficiency, it creates a single point of failure that can disrupt the daily lives of millions.

The incident has reignited conversations and debates around cloud reliability, the need for robust disaster recovery services, and diversifying cloud dependencies. As the internet continues to evolve, this incident will shape future discussions on cloud computing, network resilience, and stability.

Share.

I am a content writer with over three years of experience. I specialize in creating clear, engaging, and value-driven content across diverse niches, and I’m now focused on the tech and business space. My strong research skills, paired with a natural storytelling ability, enable me to break down complex topics into compelling, reader-friendly articles. As an avid reader and music lover, I bring creativity, insight, and a sharp eye for detail to every piece I write.

Comments are closed.

Exit mobile version