Strategy

When your cloud becomes a thunderstorm: Lessons from the Azure outage

Ed Price
Ed Price

First, it started raining.

On Sept. 4, 2018 at 3:42 a.m., Central Daylight Time, a severe weather storm produced lightning strikes that caused a power outage at a Microsoft data center in San Antonio, Texas. Microsoft Azure, a leading cloud services provider, fell victim to an act of Mother Nature.

Less than an hour later, Microsoft reported that "storage servers in the data center began shutting down as a result of unsafe temperatures." The shutdown resulted in an outage of dozens of Microsoft services in the south-central region of the U.S., including Office 365 and Azure cloud services. While most outages last just over 24 hours, it took three days to restore services fully. 

This widespread outage is a weighty—and for many companies, expensive—reminder that careful consideration of business continuity planning (BCP) and disaster recovery planning (DRP) is essential when migrating applications to the cloud. In this article, I’ll detail the various BCP and DRP options available to organizations, and make recommendations on which options work best, depending on how critical an application is to day-to-day business functions.

The cloud failed

Nearly all large enterprises are migrating their business-critical applications to the cloud. For those that were unfortunate enough to host in the south-central region of the U.S., the result was a total outage of those applications. It was a total blackout. The outage caused millions of dollars in lost productivity. For one company we work with, the storm and outage caused a disruption of business for a full 16 hours. 

In hindsight, it's easy to suggest this was just a case of bad BCP or DRP. These protocols are simply cost-versus-risk analyses. For most businesses, the cloud is infallible—it never fails. It's expected to work. Since the perceived risks were low, typical BCP or DRP responded appropriately.  

Then it happened: The cloud went down. 

The same day, two non-regional Azure resources that are usually considered fail-proof, Azure Active Directory and Azure Resource Manager, also failed. This storm should serve as a wake-up call: We all must reconsider how to evaluate risk when creating BCP and DRPs. 

Cloud migration, business continuity planning, and disaster recovery planning

We consistently recommend or are asked to help migrate critical business products to the cloud, and have learned enough that we propose specific flavors of BCP and DRP to use.


Redeploy on disaster approach

Redeploy after the disaster

In this approach, the application is redeployed from scratch at the time of the disaster event. This approach is best applied to non-critical applications that don’t require a guaranteed recovery time. Redeployment also requires non-regional, shared resources—like Azure Active Directory and Azure Resources Manager—on which a new environment can be created. The system can then be rebuilt and redeployed on a new cloud hosting site, restoring services within hours or days, depending on the complexity of the DRP.


Active/passive approach

Warm spare (active/passive)

With this approach, a secondary hosting service is created in an alternate region, and roles are deployed to guarantee minimal capacity; however, the roles don’t receive production traffic. This approach is useful for applications that have not been designed to distribute traffic across regions. This model also has a strong dependency on non-regional and shared resources to be able to migrate to the warm spare as the new production environment. 


active/active approach

Hot spare (active/active)

Hot spare applications are designed to receive a production load in multiple regions. The cloud services in each region might be configured for a higher capacity than required for disaster recovery purposes. Alternatively, the cloud services might scale out, as necessary, at the time of a disaster and failover. This approach requires a substantial investment in application design, but has significant benefits: a quick and guaranteed recovery time, continuous testing of all recovery locations, and efficient usage of capacity.


multi-cloud approach

Multi-cloud provider (Azure/AWS/Google/IBM)

With this approach, the application will be architected to ensure that no dependencies are inherent to a cloud-specific service. This way, if a disaster occurs with any given cloud provider, the application can restart or shift all traffic to the unaffected cloud. This approach requires the most significant investment in application design and infrastructure planning, but the result is the highest availability. A multi-cloud approach is costly and complex but offers the lowest guaranteed recovery time with no dependencies on a single cloud provider. 


Ultimately, BCP and DRP are business risk decisions. The organization we work with decided that a 'warm-spare' model would be sufficient, even with their business-critical application. Their 'warm-spare' model was to have a production instance at one of its sites—in this case, the Azure U.S. South Central site which failed — and to have a warm spare at another Azure site.  

The organization's second warm spare site was a bit unusual: It received database updates, but, unlike most backup sites, did not receive web or application traffic. 

Under this concept, if the 'hot' site ever has an issue, a manual intervention is required. This intervention redirects traffic to the 'warm' backup site. At this point, the databases sync, resulting in a small outage window, but the site is up and running within an agreed upon service-level agreement. Everything was set—or was it?

The perfect storm (In this case, a literal thunderstorm)

We had co-developed a BCP/DRP for a business-critical application. We had tested the plan to ensure that the failover to the backup warm spare cloud site happened quickly and smoothly. Our customer was happy: The balance of risk and cost was palatable, and the plan had been demonstrated, through testing, to work. 

Things went from bad to worse.   

Violent storms caused a power outage and the eventual shutdown of servers at the Azure data center where we hosted the hot site. The result was a total outage of all services—business-critical system, down!

We had tested and validated solid BCP/DRPs to spin the application up at the alternative warm site cloud data center. No problem, right? Not quite. Not only was the specific Azure cloud data center site down, but some of the non-regional services (namely Azure Active Directory, Azure Service Manager, and Azure Resource Manager) were also down or experiencing severe degradation at the same time.  

These non-regional or shared services are critical, as they're used in building infrastructure-as-a-service (IaaS) and application service environments (ASEs) in a platform-as-a-service (PaaS) model. 

We had a plan, but couldn't execute it

The tools we'd planned on weren't available. We couldn't use the Azure's administrative user interfaces for Azure Service Manager and Azure Resource Manager. We couldn't run remote PowerShell s scripts, and other methods at our disposal to bring up the infrastructure (ASE, WEB and other components in our solution) since those non-regional services from Azure were also unavailable or degraded. In the bold below, you can see how the Azure Resource Manager outage limited the ability for many Azure users in the south-central region to be able to bring up new environments. (Note: You can check out Microsoft's status history of the event here)

The Azure outage is proof: disaster recovery plans can fail

A BCP/DRP is key to any application, whether its hosted on-site or by a cloud provider. The key to a plan that works is to identify dependencies (in this case, Azure Service Manager and Azure Active Delivery) and consider the impact to your business if those dependencies fail or are severely downgraded.

It's also wise to assess your application and identify the key SLAs based on your business-critical level. Once key stakeholders have agreed upon the business-critical level, you can define SLAs for recovery time objective (RTO) and recovery point objective (RPO). A business-critical application should have an RTO of less than four hours. This means that, at a minimum, business-critical applications should be built in an active-passive model. A safer bet for applications that directly affect revenue operations is to adopt an active-active or multi-cloud model.  

How critical are your applications?

Here is a helpful way to define how critical an application is to business functionality.

  • Very high: Absolutely mission-critical for business. Safety of life and limb is on the line.
  • High: Exploitation causes serious brand damage and financial loss with a long-term business impact.
  • Medium: Applications connected to the internet that process financial or private customer information.
  • Low: Typically, these are internal applications with non-critical business impact.
  • Very Low: Applications with no material business impact.

Lessons from the Azure outage might echo "Murphy's Law:" Anything that can go wrong will go wrong. In this case, that includes cloud services. When evaluating BCP and DRP, it’s important to consider the costs if something goes wrong. As we’ve seen in some painful headlines from the past year, failure to consider unlikely scenarios can result in catastrophic business consequences

Be ready: Another storm is coming

Unsure about the needs of your application? Here are some ‘health check’ steps you can take in your assessment: 

  • Identify your Service Level Agreements for the application 
  • What is your Recovery Time Objective (RTO)? 
  • What is your Recovery Point Objective (RPO)? 
  • Identify and establish your resiliency model 
  • Identify your optimal DR model: Redeploy, active-passive, active-active, or multi-cloud 
  • Establish automated monitoring and failover 
  • Identify any dependencies in the cloud or third, especially for a single cloud provider. In this instance, Azure non-regional service failures 
  • Simulate failure of shared or non-regional components 
  • Testing should occur at least annually, but optimally two or more times a year

Rethinking your plan? Devbridge can help your company mitigate risk and avoid getting caught in a storm.