Disaster recovery (DR) is the process of restoring access to applications and data as well as functionality to IT infrastructure after disruptive events such as a fire, flood, system failure, human error, or ransomware attack. The goal is to be operational amidst the disaster and return systems to normal as soon as possible. Disaster recovery is closely related to business continuity known together as BCDR. Although the two terms are often used interchangeably, there is a key difference. Whereas disaster recovery restores technology systems as rapidly as possible, business continuity is focused on keeping the organization as a whole operational. As such, disaster recovery is an important part of business continuity strategies.
Data and the digital technologies that create, store, process, and analyze it are essential for the running of a business. When a disaster strikes, business and reputation can be significantly impacted due to operational failures and the non-availability of mission-critical IT systems. Businesses must get access to data and restore functionality to systems and infrastructure as swiftly as possible. This is where DR comes in.
A robust disaster recovery strategy is critical to help organizations:
Recovery from a disaster can be a manual or automated process, depending on the business operation. Typically, digital DR involves getting mission-critical business applications, databases, and other IT systems up and running as quickly as possible to reduce downtime and prevent data loss.
Most organizations create a formal action plan that outlines the who, what, and how IT recovery should take place. The order of resources to be brought back up within measured time objectives to complete the recovery are typically detailed in the runbook.
Organizations have options for restoring applications and data replicated and mirrored to a secondary site. For example, the alternate site may have company-owned servers with mirrored applications and data waiting to be activated as a failover option in case of a disaster. Other organizations may choose a DR as a service capability from a cloud provider.
In all cases, automated disaster recovery converges point-in-time snapshots, replication, and automated failover and failback orchestration.
A disaster recovery plan, sometimes also called a DR runbook, is a core element of any business continuity plan. Once the plan is written, it should be regularly tested and modified to ensure it remains operational.
Disaster recovery plans will typically include two key metrics to prioritize the bringing up of key applications and data based on their criticality. This is typically measured in minutes, hours, days, or weeks:
The most effective DR planning and plans will detail the people and processes responsible for bringing which systems up in what order and sequence to address system dependencies and minimize downtime. Teams that use automated DR solutions that orchestrate their DR runbooks and processes can respond quickly and failover when an incident occurs.
Similar to self-managed DR, disaster recovery as a service (DRaaS for short) also offers an automated way for organizations to control their data recovery and application availability service-level agreements, but without the cost and complexity of deploying and operating the secondary site themselves. The solution gives organizations the ability to rapidly recover while taking benefits of the cloud.
With DRaaS, organizations can spin up on-demand, pay-as-you-go, cloud infrastructure only when it’s needed. That eliminates costly and hard-to-manage secondary data centers that sit idle for most of the time. Teams enjoy near-zero downtime and minimize data loss across many service-level agreements (SLAs) for a variety of applications using disaster recovery services.
To create a disaster recovery plan, or DR runbook, requires first conducting an assessment of all of the people, processes, and technologies involved in IT. Without knowing this information before an unexpected, negative event—whether that’s a hurricane, flood, ransomware attack, or human mistake—it’s impossible to get back up and running fast. The DR plan may or may not be a component of a larger business continuity plan for restoring additional operations. DR plans typically focus on restoring IT systems as rapidly as possible from downtime.
Your DR plan should outline and include:
From boardrooms to backrooms, all employees have some responsibility for safeguarding their organization’s data. CIOs and other IT leaders typically take the lead in setting up disaster recovery plans and technologies by working with executives and teams to prioritize the data, applications, and IT infrastructure that needs to be protected. An important part of this process is defining what resources are mission-critical —or absolutely required to operate—versus business-critical which is important to have but will not disrupt revenue or safety. Another important element of the process is to determine the service-level agreements (SLAs) others across the business have for specific capabilities. This can help IT teams determine whether they want to have on-site recovery responsibilities or team with a service or cloud provider to recover data, apps, and infrastructure. For example, will they choose on-premises or a cloud option, such as DRaaS with AWS, Microsoft Azure, or Google Cloud? Today, disaster recovery in AWS, DR in Azure, and DR in Google Cloud are growing in popularity.
Once strategic protection decisions are made, teams looking to set up disaster recovery plans and services can discuss how to restore operations in more detail. This is where the DR runbook comes in as it includes information about the people, processes, and technology requirements for recovery. Yet a DR runbook cannot sit on a shelf, but rather it must be tested regularly to ensure it remains relevant. Ease of maintenance and testing of DR capabilities will be another important consideration at this point.
Disaster recovery plan testing is going through each of the many steps outlined in the runbook to ensure the organization’s disaster recovery plan doesn’t have any gaps or errors. Testing of the DR plan ensures IT systems can and will be restored in the most timely and effective manner possible should the worst-case scenario occur.
For some, a DR solution that unifies backup and automates DR in a single solution to reduce complexity and costs of separate point solutions will be highly attractive because it supports both on-prem and cloud workloads with near-zero downtime and data loss.
As with every IT initiative, disaster recovery service and solution costs vary. Depending on the plan to isolate data physically or virtually, recovery costs can involve physically retrieving information from an offsite location hundreds of miles away from the primary location. Depending on the scale of the data or inconsistency of weather in locations, some organizations may choose to set up one or more secondary sites, which often involves installing multiple instances of costly hardware and software to replicate and store an exact copy of production data—and keep it running 24/7 just in case. Recent technology advancements, such as cloud computing and next-gen data management are significantly reducing DR costs. This is good news for organizations because depending on the severity of it, downtime can be catastrophic for an organization.
The financial cost of disasters—such as cyberattacks are already in the billions of dollars and are projected to rise to $256 billion in the next decade—but those costs don’t include potential loss of revenue, customer loyalty or satisfaction, and employee productivity. Disasters happen and are much more costly to businesses that are unprepared—which are those without disaster recovery solutions.
Disaster recovery testing gives IT teams the confidence they can meet business recovery SLAs. Testing also helps confirm the meeting of internal and external compliance requirements. With the rise in cyber attacks, proven DR testing may also soon become a prerequisite to qualify for cyber insurance.
The testing of a disaster recovery plan and services can be automated or manual. However, comprehensive testing will cover these essential elements—people, processes, technology.
Teams conducting testing should ensure a full review of the roles responsible for recovery, documents outlining recovery, recovery time and point objectives (RTOs/RPOs) commitments, and more.
In terms of process, testing should also include a review of what happens and what is needed in terms of alerting, procedures, hardware, software, networking, data protection, backup and recovery snapshots, ransomware recovery, rollbacks, and more.
Testing should occur at least once per quarter with best-in-class organizations testing monthly.
Organizations are prepared for rapid recovery should the unexpected happen if they have these five elements of a disaster recovery plan in place already:
There are multiple ways to implement disaster recovery, but we recommend looking for solutions that allow you to address a wide range of SLAs and recovery times while minimizing downtime, reduce overall system and operational complexity, and reduce costs by not having as much duplicate or idle secondary infrastructure. Also look for flexibility to allow you self-manage your DR deployment or have it managed for you, the cloud or a DRaaS model.
Organizations typically architect disaster recovery sites to best meet their needs. The most popular options are:
Your disaster recovery team will typically be a subset of your business continuity team. The roles on that team include the CIO, IT resilience, crisis response, and security response roles.
Members of the team responsible for DR will typically be technical professionals with data center—compute, storage, networking, and cloud—responsibilities because the primary goal of a DR plan is to recover applications, data, and infrastructure quickly and completely.
The most reliable disaster recovery or disaster recovery as a service (DRaaS) will enable organizations to:
Disaster recovery can be complex and expensive. A business running hundreds of applications needs to tier these applications in terms of criticality, define separate policies, work with multiple vendors for each tier, and manage them all through disparate consoles. But Cohesity has introduced a solution that helps customers not only recover from a disaster almost instantly, it does so for every tier of application deployed. Complex, expensive, and fragmented solutions are a thing of the past with a unified and automated DR failover and failback orchestration solution.
Cohesity’s reliable disaster recovery and business continuity solution: