What is a Disaster?
Disasters that impact workloads and their critical data are not always natural. Unintentional or intentional errors introduced by humans can also contribute to disrupting businesses for a few hours to days. In general, disasters may include:
- Power Outage
- Network Outage
- Infrastructure Failure
- Cyber Attacks
- Earthquake
- War
- Flood
- Pandemic, etc.
What is Disaster Recovery (DR) and why it’s important?
Disaster recovery is a service that helps quickly recover applications and data after a disaster strikes your primary data center. Data is the new oil for today’s organizations. Loss of data can cause damage to your business by impacting customer experience, productivity and hence revenue. A natural disaster can destroy an entire data center and leave with no option to recover data. So, it’s crucial to have a Disaster Recovery plan in place to mitigate the effects of disasters.
Having a DR plan in place allows your organization to define the RTO and RPO. Quantifying these parameters helps you select the right tools and infrastructure to implement the DR solution. The less the service interruption, the lesser customer will be operationally and financially impacted.
DR is supported through replication which can be sync, async and near-sync to achieve different RPOs. While the sync replication (aka mirroring) has strict latency requirements (usually less than 5ms), the Async and near-sync usually can be between 20-80ms. However, for Async and near-sync replication; it is the bandwidth requirements which are more important in order to keep up with the RPO. For near-sync, usually the RPO is 2-20 min; and for Async it can be an hour or more.
For separate fault zones, a few kilometers apart, stretched clusters could be setup with nodes in each of those zones and volumes setup to mirror data across these zones. This will make the cluster resilient to disasters striking one of the fault zones – with zero RPO/RTO. No explicit DR is needed. For distant fault zones where latencies are higher, the recommendation would be to setup DR.
More about RTO and RPO
Recovery Time Objective (RTO) is the time it would take to restore services back after a disaster has occurred. For a guaranteed RTO of, say 20 minutes, if a disaster occurs at 1pm, services should be up and running on the DR site by 1:20pm. RTO determines the acceptable time window for which the service would be unavailable. RTO will also dictate the Disaster Recovery planning. If your RTO is too low, it essentially means you have to automate processes and have the required services already running on the DR site before the disaster strikes.
Recovery Point Objective (RPO) is the amount of acceptable data loss w.r.t time since the last recovery point. RPO also determines the backup plan you need to adopt so that there is minimal acceptable data loss. If your RPO is say 4 hrs, you will have to have your data backup services triggered every 4 hrs. If a disaster strikes at 5pm, you should be able to retrieve data upto 1pm.
Traditional Disaster Recovery Vs Cloud Disaster Recovery
Traditional DR requires you to maintain a dedicated IT infrastructure that includes sufficient servers, routers, switches, a high bandwidth network and maintenance staff at the DR site. Maintaining a DR site with such high resource requirements can become complex and expensive. But, when managing infrastructure in a dark site or air gapped environment that has no connectivity to the public cloud or you don’t want your data to flow outside your environment for security reasons, traditional DR may still be a good choice.
Cloud Disaster Recovery solutions give you the freedom to optimize your resources. You don’t need to build a physical DR site with huge hardware or worry about maintaining it. Cloud providers offer pay as you go pricing model where you actually pay only for the computing or storage resources that you actually use. Scaling the resources up or down is easy on the cloud. The overall storage and network infrastructure can also be better in some scenarios. Cloud providers usually have different regions mapped to different geographical locations, thus guaranteeing more resilience. The limitation here is that your infrastructure needs to have the capability to be constantly connected to the public cloud.
DR Planning and Validation
Following are the general steps you would follow to setup a DR and validate it :
- Access your IT infrastructure and shortlist the critical applications that need a DR plan.
- Determine the kind of DR you need: on-prem multi-site or across cloud. For cloud, choose the provider that has the least RTO and RPO.
- Document your data availability and application uptime need. Define your RPO and RTO accordingly.
- Provision a DR site. Configure network connectivity between primary and DR sites.
- Develop a detailed migration plan and effort estimate.
- Build and deploy the DR pilot solution.
- Perform a Fire drill on some applications and ensure the process is successful.
- Test actual Fail-over and Fail-back on the applications. Ensure data is consistent.
- Validate RPO and RTO values are adhered to while performing Fail-over and Fail-back operations.
- Automate the above process to maximum extent possible.
Diamanti’s solution for Disaster Recovery and advantages
Diamanti lets you easily backup and recover your applications and data with minimal steps. For cloud native applications, a good DR solution should allow data recovery at container granularity, rather than at node level. DR must be configurable at the application level so that the entire application can be recovered in the event of a disaster, not just a few microservices. Diamanti’s DR solution is capable of doing this. It also allows you to automate disaster recovery across complex hybrid cloud environments.
Configuring DR on an application with minimal steps. Primary site : Diamanti. DR site : AWS cluster
Diamanti uses asynchronous replication to implement DR and during every replication interval, only the incremental data is copied to the DR site. If an application is running on primary site with 200GB of volume and since the last replication interval, only 100 MB of data has effectively changed, then only those 100 MB data blocks will be copied to the DR site. This saves a lot of storage and networking resources on the DR site as you don’t need to store multiple copies of data which would be very costly especially on the cloud.
Diamanti, with Spektra – its multi cluster management solution, can manage clusters on-premises as well as on major cloud providers, like AWS, GCP and Azure. It simplifies application deployment across clusters and streamlines DR setup across hybrid cloud environments. Following combinations are supported :
Primary Site | Disaster Recovery Site |
On-prem cluster | Another on-prem cluster |
On-prem cluster | Any cloud cluster |
Cloud cluster | Same provider cloud cluster in different region |
Cloud cluster | Cloud cluster from a different provider |
Cloud cluster | On-prem cluster |
Diamanti’s Spektra UI showing creation of Diamanti, Azure, AWS and GCP clusters from a single view.
Spektra managing on-prem and AWS cloud cluster from a single page
Diamanti allows you to set these configurations seamlessly. The procedure remains the same irrespective of the combination you choose. So, there is no manual effort required for integration with different platforms making it platform agnostic. Same architecture is leveraged to support stateful workload/application migration from one cluster to another, across different cloud providers.
Diamanti’s DR solution is very easy to configure, can be done during the application creation itself or later using a few clicks. The fail-over and fail-back of the application also happens with just a single click irrespective of the platform. Diamanti supports sync and async replication and the RPO can be 5 min to few hours.
DR settings at an application level
Fail-over and Fire-drill operations are triggered on a single click
Fire Drill Feature
A fire-drill is a zero-downtime DR test that mimics the fail-over of stateful applications across different sites. A successful fire-drill operation gives confidence that stateful applications failing over to the DR site will behave as expected when a real disaster happens. With a single-click, a user is now able to create a new instance of the application on another Kubernetes cluster and verify data integrity on the DR site. During the Fire-drill, the application in production remains untouched and continues to function with no downtime.
Diamanti’s simplified configuration takes the stress out of the DR planning. Diamanti helps you set it up end to end in a few minutes. Regardless of your business vertical and complexity, Diamanti can handle your DR requirements. Contact us to schedule a demo and see the DR implementation in action.