Continuing along in our series of why enterprises need great backups, today we are going to look more closely at disaster recovery. Enterprises rely on data backups to ensure that their businesses can make a successful recovery in the event of a disaster—whether natural or human. When a disaster impacts your business, how do you work through the process and ensure a successful disaster recovery?
In my experience, working at an enterprise in the financial industry, our team had a very well thought out disaster recovery strategy with a remote co-lo, and a manual process by which a failover could be initiated. The disaster recovery failover software didn’t have any level of integration into enterprise backups of data. Our disaster recovery process was tested on a routine basis with all hands on deck, but not in a way that was true to actual failover and user impact. It was done in a networking bubble to ensure that no user downtime would occur even during disaster recovery testing. This testing strategy posed several limitations, including one important fact: no real end user in the business had any clue what to do during a real disaster. Why put all the time and energy into architecting and designing something that you only test in a bubble?
Ready or Not
One day, a fire alarm went off in the building where the data center was located. A team member attempted to relocate to initiate a failover, but fire trucks prevented anyone from coming or going. Fire evacuation procedures prevented people from evacuating to the parking structure. The disaster recovery could not be initiated, because the team that needed to initiate the failover was prevented from doing so.
At that moment we also realized that our general user population would have no idea how to use the system in a disaster recovery failover scenario. So it was time to assess risk and adjust the plan. Even the best-laid plans need modification and realignment.
Thankfully the fire was quickly put out, the data center was safe and no actual failover needed to be initiated. But it was an eye-opener for the team and most importantly the business. Expectations on disaster recovery processes needed to be reset.
Lessons Learned
- Toolset Re-evaluation Required – This conversation was necessary to ensure the proper alignment of keeping the business going regardless of the event. This did include an evaluation of backup and recovery processes and some additional capabilities that these vendors may be able to offer.
- Testing processes – The ability to test with the general user population was evaluated, but the goal was still to test without impacting users. Finding a tool that would allow for testing in the background was necessary for the long-term.
- The ability to failover – Making sure there is the ability to do a failover no matter where the team was located also became a key requirement.
The final takeaways from this issue were to implement a backup solution that would support the disaster recovery process end to end. Automation and Orchestration being key.
For other entries in this blog series, check out my posts on major disasters gone both right and wrong, ransomware, Exchange online, and even Exchange on-premises.