Amazon Web Services Inc. has said it will make changes to its cloud Service Health Dashboard in the wake of a major outage last week that took multiple connected services, including financial apps and food delivery platforms, offline for several hours.
In a report on the impact of the event last Friday, Amazon said the problems first began at its US-East-1 data center region in Virginia at 10.30 a.m. ET on Tuesday, Dec. 7.
Amazon blamed an “automated activity” that was meant to scale capacity for one of its services hosted in the main AWS network. That activity apparently triggered “unexpected behavior” from a large number of clients within the internal network. Due to this, multiple devices connecting an internal Amazon network with an AWS network became overloaded.
The incident hurt AWS cloud services such as AWS EC2, which provides virtual server capacity for multiple enterprises. Many services were taken offline for several hours, resulting in widespread disruption for Amazon’s customers. Reports said popular streaming services such as Netflix and Disney+ went down, while connected devices such as Amazon.com Inc.’s Ring security cameras and iRobot Corp’s Roomba vacuums also stopped working.
Amazon suffered too, because many of its warehouse and delivery employees use applications powered by AWS to do their jobs. Reports said Amazon workers were unable to scan packages or see their delivery routes for much of Tuesday as they waited for AWS engineers to restore service.
Some AWS services came back online within a few hours, but others – such as AWS EventBridge, a developer tool, didn’t return fully until 9.40 p.m. ET.
AWS is generally a very reliable service. The last major incident affecting AWS occurred in 2017, when an employee accidentally turned off more servers than intended during repairs of a billing system. But Tuesday’s outage was a big blow to AWS’s reputation, undermining claims that cloud infrastructure is reliable and enterprise-ready. AWS apologized to its customers for the disruption.
AWS also admitted it struggled to keep customers aware of what was happening during the incident. It had problems updating its Service Health Dashboard, which is the primary status page for AWS customers. Many customers also complained they were unable to create support tickets during the disruption.
“As the impact to services during this event all stemmed from a single root cause, we opted to provide updates via a global banner on the Service Health Dashboard, which we have since learned makes it difficult for some customers to find information about this issue,” AWS said.
Many customers also complained they were unable to create support tickets during the disruption.
AWS has promised to take action, with a new version of the Service Health Dashboard arriving in early 2022 that will make it easier to understand service impact. It’s also planning to launch a new support system architecture that spans multiple AWS regions to ensure there will be no delays in communicating with customers.