Here’s a timeline of what went wrong, and when it was fixed. Note, in particular, the window from roughly 1:00 AM to 1:48 PM PST when several of Amazon’s availability zones were partially unavailable. (For a glossary of Amazon Web Service terminology, see the bottom of this post.)
Created by StevePro on Apr 25, 2011
Last updated: 04/26/11 at 03:08 PM
Tags: AWS Amazon Major Fail EC2 Crash Cloud
Amazon reports that all RDS databases are back online
http://www.randomhacks.net/articles/2011/04/25/aws-outage-timeline-and-recovery-strategy-downtimes
Amazon reports that all EBS volumes are back online.
http://www.randomhacks.net/articles/2011/04/25/aws-outage-timeline-and-recovery-strategy-downtimes
Amazon re-enables RDS APIs in the affected zone, but not all databases have been recovered:
http://www.randomhacks.net/articles/2011/04/25/aws-outage-timeline-and-recovery-strategy-downtimes
Amazon finishes re-enabling their APIs for all recovered volumes in the affected zone. Not all EBS volumes have been recovered yet, however.
We continue to see stability in the service and are confident now that that the service is operating normally for all API calls and all restored EBS volumes.
http://www.randomhacks.net/articles/2011/04/25/aws-outage-timeline-and-recovery-strategy-downtimes
We're back! Awesome. Let us know if you encounter any errors. #bringbackthetimelines #jr203
http://bit.ly/gxALHs
4 out of 20 machines are still acting up so when our genius engineers fix those we should be good to go. #bringbackthetimelines
http://bit.ly/h2OmfK
Amazon is still wrestling with control plane congestion. Quick update. We’ve tried a couple of ideas to remove the bottleneck in opening up the APIs, each time we’ve learned more but haven’t yet solved the problem. We are making progress, but much more slowly than we’d hoped. Right now we’re setting up more control plane components that should be capable of working through the backlog of attach/detach state changes for EBS volumes. These are coming online, and we’ve been seeing progress on the backlog, but it’s still too early to tell how much this will accelerate the process for us.
Amazon reports that “control plane” congestion is limiting the speed at which they can recover the remaining volumes.
http://www.randomhacks.net/articles/2011/04/25/aws-outage-timeline-and-recovery-strategy-downtimes
yay!
Amazon reports “majority” of EBS volumes in affected zone have been recovered. Remaining volumes will require a more time-consuming recovery process.
http://www.randomhacks.net/articles/2011/04/25/aws-outage-timeline-and-recovery-strategy-downtimes
Amazon restores access to “majority” of multi-AZ RDS databases. (There’s nothing in the Amazon timeline to indicate when all of the multi-AZ RDS databases came back online.)
http://www.randomhacks.net/articles/2011/04/25/aws-outage-timeline-and-recovery-strategy-downtimes
EBS volumes and EC2 instances are now working correctly in all but one availability zone.
http://www.randomhacks.net/articles/2011/04/25/aws-outage-timeline-and-recovery-strategy-downtimes
Oh no. AWS is having some issue. That means Dipity is down. Hoping they can work it out quickly. http://bit.ly/guUfZZ?
Amazon reports that RDS databases replicated across multiple Availability Zones are not failing over as expected. This is a big deal, because these multi-AZ RDS databases are intended to be an expensive, highly-reliable option for storing data.
http://www.randomhacks.net/articles/2011/04/25/aws-outage-timeline-and-recovery-strategy-downtimes
Amazon admits they are seeing problems with EBS volumes and EC2 instances in US East 1. The outage affects multiple availability zones. Amazon later described the problem as follows:
A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it’s difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We’re starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.
http://www.randomhacks.net/articles/2011/04/25/aws-outage-timeline-and-recovery-strategy-downtimes

