Disasters in a Hadoop environment have various origins. It could be a major natural disaster that takes out an entire data center, an extended power outage that makes the Hadoop platform unavailable, a DBA accidentally dropping an entire database, an application bug corrupting data stored on HDFS, or worse—a cyberattack. No matter the cause, proper Hadoop disaster recovery mechanisms are necessary prior to rolling out the application in production to protect data in each of these scenarios.
The actual mechanism used to protect this data depends on a number of factors including:
- The criticality of the application
- The maximum amount of data the organization can afford to lose, or Recovery Point Objective (RPO)
- The maximum time applications can be down during recovery, or Recovery Time Objective (RTO)
- The available budget for the appropriate Hadoop disaster recovery infrastructure
Protecting Data During Hadoop Disaster Recovery
Multiple replicas in Hadoop are a great way to protect against hardware failures such as disk drive failure or server failure. However, neither way protects against natural disasters, human errors, application corruptions, or cyberattacks. One or more of the following three options will have to be put in place to protect against these scenarios:
- Back up data on a regular basis using an enterprise-class solution, located in the same data center, a remote data center, or in the cloud. Regular backups will help protect against human errors and application corruptions. If backups are stored in a remote location, they can also protect against natural disasters. Data recovery may take longer (higher RTO) and recovered data may not be the most current, depending on how frequently backups are done.
- Replicate data asynchronously from the production Hadoop cluster to a standby Hadoop cluster in a different data center. Since replication mirrors data from production to the standby Hadoop cluster, it doesn’t have older copies of the data and doesn’t provide a way to recover older data that may be lost due to human error or application corruption. However, replication does protect against a natural disaster or power outage in the data center where the production Hadoop cluster is located. Since the data is always available in the standby Hadoop cluster, recovery time (RTO) is shorter. RPO will depend on the frequency with which data is copied to the remote data center.
- Synchronously replicate data from the production Hadoop cluster to another Hadoop cluster in a different data center. Synchronous replication will not protect against human error or application corruption, but will safeguard data in case of a data center outage. This solution results in the best RPO (no data loss) and RTO (very quick failover to the active Hadoop cluster).
Key Questions About Data Replication for Hadoop Disaster Recovery
So why not just use synchronous data replication to protect against a data center failure? There are some serious factors to consider before blindly jumping into deploying an active synchronous data replication solution for Hadoop disaster recovery purposes.
Does your application need real-time data replication? Is your application so critical that you cannot incur any downtime or data loss?
In the real world, very few applications (particularly transactional applications) require this type of stringent RPO and RTO. If your application is one of those few critical applications, active replication may make sense but it comes with its own limitations and cost considerations, noted below.
Are your application users willing to take the performance hit?
Synchronously replicating data will negatively impact your application performance. Every change made on the production system will have to be transmitted and acknowledged by the remote Hadoop cluster before allowing the application to proceed with the next change. The performance impact will depend on the network connectivity between the two clusters which will most likely be a slower wide area network (WAN) connection.
Are you willing to risk potential disruption to the production environment?
Synchronous data replication solutions require software to be installed on the production Hadoop cluster. This software will intercept all writes to the file system which can destabilize the production system and requires extensive testing prior to putting it into production. Also, any disruption on the WAN network will bring your application to a halt since data changes cannot be transmitted to the remote cluster or acknowledged. This can result in downtime and disruption to your production applications.
Can your WAN handle the additional network traffic?
In the case of active real-time replication, all changes (temporary or permanent) are sent over the network to the remote Hadoop cluster. This will cause significantly more load on the WAN compared to an asynchronous replication methodology which will transmit far less data over the network.
Do you have the budget required for an active disaster recovery solution?
Typically these solutions have much higher hardware, software, and networking costs.
Do you have basic data protection in place already should you need to recover from a disaster?
When it comes to taking out an entire data center, human errors, application corruptions, and a ransomware attack are more likely than a natural disaster. Protecting data against these likely events should be a higher priority for an enterprise. Implementing an active disaster recovery solution will not protect data in these scenarios since all changes (intentional or accidental) will get propagated to the disaster recovery copy instantaneously.
Make the Best Hadoop Disaster Recovery Choice
Although real-time replication results in the best possible RPO and RTO, it comes with limitations and considerations that need to be carefully thought through. Implementing an active Hadoop disaster recovery solution must be done in context to the criticality of the application to get the best return on investment. If not, it can result in unnecessary expenditures, affect the availability of the production Hadoop system, and lead to excessive resources in managing the production Hadoop environment.
Watch this video to get deeper insights into the Cohesity solution for Hadoop backup and recovery.
Apache and Hadoop are trademarks of Apache Software Foundation.