Amazon explains outage that took out a large chunk of the internet

Amazon has explained the Web Services outage that knocked parts of the internet offline for several hours on December 7th — and promised more clarity if this happens in the future. As CNBCreports, Amazon revealed an automated capacity scaling feature led to “unexpected behavior” from internal network clients. Devices connecting that internal network to AWS were swamped, stalling communications.

The nature of the failure prevented teams from pinpointing and fixing the problem, Amazon added. They had to use logs to find out what happened, and internal tools were also affected. The rescuers were “extremely deliberate” in restoring service to avoid breaking still-functional workloads, and had to contend with a “latent issue” that prevented networking clients from backing off and giving systems a chance to recover.

The AWS division has temporarily disabled the scaling that led to the problem, and won’t switch it back on until there are solutions in place. A fix for the latent glitch is coming within two weeks, Amazon said. There’s also an extra network configuration to shield devices in the event of a repeat failure.

You might have an easier time understanding crises the next time around. A new version of AWS’ service status dashboard is due in early 2022 to provide a clearer view of any outages, and a multi-region support system will help Amazon get in touch with customers that much sooner. These won’t bring AWS back any faster during an incident, but they may eliminate some of the mystery when services go dark — important when victims include everything from Disney+ to Roomba vacuums.