Design for failure lessons learnt from the Sydney AWS outage

Sydney’s wild weather brought down an availability zone in AWS’s AP-SOUTHEAST-2 Region on Sunday night.

Websites went down, customer service calls went up, twitter went nuts, engineers scrambled to find work arounds and management started asking “Why?”. 

If your website crashed, you know by now that it’s probably because your application wasn’t designed for region failure.

One outage should not be reason for you to start thinking that the cloud isn’t right for you, or that you should move service providers. But it should make you revisit your architecture. 

Failure in cloud services is inevitable regardless of your provider. Outages happen so you must design for failure. Your actual infrastructure availability is irrelevant to your application availability. 100% uptime should be achievable even when your cloud provider has an outage regardless of its size.


The pain faced by many due to Sunday’s AWS outage may have been avoided. Here are our top tips:

Be redundant across multiple availability zones. Spread your applications over at least two AZs within a region. It’s as simple as adding a new AZ to your load balancer and starting an instance in that AZ. This Udemy course gives you a good place to start.

Be redundant across multiple regions. Being redundant across multiple AZs doesn’t guarantee you’re resilient to failure if there is an outage. itnews says of Sydney’s AWS outage “API call failures in the affected availability zone also meant that those hosted there were unable to failover elsewhere, despite having multi-zone redundancy in place for such events.” The best practice is to have both a multi-zone and multi-region failover in place.

Try Chaos Monkey. The Netflix Chaos Monkey tool allows you to proactively launch attack code against your infrastructure to cause failures and give you the chance to fix potential problems before they occur on their own. 

Visualize your infrastructure. Being able to see a diagram of your current infrastructure quickly shows you if it’s tolerant to failure. In seconds you can see where the vulnerabilities are and go about fixing them. Plus, it’s much easier to explain to your team and management why your infrastructure needs to be designed a certain way.

Unbalanced_VPC.png


The quickest way to get this view is with Hava. 30 seconds and you’ve got a diagram of your actual AWS infrastructure. There’s no dragging and dropping or trawling through consoles. 

If you’re not certain of how resilient your architecture is, sign up for a two week trial of Hava and see for yourself.


 SHOW ME MY ARCHITECTURE

More From Hava