Submitted by Harold Mack
Recovery from the loss of a data center was the focus of a recently completed WashU IT tabletop exercise. If you have been in WashU IT long enough, you might remember several years ago when power went out in the primary WashU IT data center. Services were restored in 4 hours.
Is this a story about a lottery winner? Was resolution in 4 hours miraculous? No, WashU IT recovers computing services routinely. Equipment failures, software maintenance that does not behave as expected, as well as human error are among many everyday failures that can disrupt complex computer systems. Those recoveries happen quickly. Notices are sent announcing those recoveries. Safeguard tools are used to minimize the time services are interrupted. These disruptions do not sound like disasters, and they are not. How do we avoid disasters? We prepare.
WashU IT prepares for potential disasters in ways that yield operating benefits, such as in the case referenced above, the recovery of the primary data center. Our data centers are protected with several sources of electrical power; all of which are overridden by fire suppression. This is because it is not good to circulate air in a burning data center. Fire suppression shuts down all power.
A leaky water valve was to blame for the power outage in the WashU IT data center. While there was not a fire, the leaky water valve was interpreted by equipment to mean that fire suppression was underway, and that power should be shut off. The good news is that WashU IT personnel sprang into recovery mode, enlisting the help of Facilities Planning and Management to address the water leak and restore power, and complete recovery in 4 hours.
It is important to note a long length of time is required to recover from a disaster. Gathering teams to do recoveries, communicating with people affected, collecting tools that shorten recovery times – these are exercises practiced in anticipation of a disaster.
WashU IT recently completed an exercise – referred to as a tabletop exercise – as a way to anticipate how to manage a potential disaster. This intentional planning ensures a simulation can take place, with no data center being harmed during the exercise. Tabletop exercises are an effective way to practice disaster recovery as a mental exercise.
For the 2022 Tabletop Exercise, recovery teams imagined the impacts of a bomb explosion. In this scenario, West Campus Data Center (WCDC) was unusable for more than a week. Part of the realism was to recover services without knowing the condition of the WCDC. The bomb damaged, but did not destroy, the West Campus building.
Recovery time – the amount of time necessary before an application could be used – varied based on the level of disaster recovery preparation.
- Critical applications – those that have duplicate servers running in two data centers – could be used in 4 hours. Because this was an imagined scenario; each time estimate was based on what “could or would” be needed.
- A second tier of applications that have servers in two data centers (but only the server in the failed data center was running) could be used in 24 hours. Recovery of this second tier was accomplished by an operation called disaster recovery fail over.
- A third tier of applications and files were protected by backup copies. Backup copies can be used to rebuild servers which would take all 5 days. Some of the third tier could be built in growth capacity in the Research Data Center. Other parts of the third tier would be rebuilt in Azure. Azure is Microsoft’s public cloud.
- A fourth group of servers was run by WashU IT for departments and schools. WashU IT has no information about the disaster recovery preparations done by the departments and schools. Working in conjunction with the departments and schools, computing capacity would be located to replace the capacity in the failed data center.
The tabletop exercise differed from the real-world recovery. After power was restored in the real world, services were provided with the same equipment used before fire suppression shut down power. As imagined for the tabletop exercise, after the disaster, computing was moved to where capacity exists; some to a different WashU IT data center and some to Azure.
Both recoveries are built on fundamental processes, which include communicating with people affected by the service interruption and having an incident manager direct a conversation among engineers who perform the recovery and taking an inventory of information about all that needs to be recovered.
Each year, the tabletop exercise identifies ways to improve recovery from a disaster. While these ways are imagined during simulations, they often lead to enhanced standard operations for WashU IT and the customers we serve.