The prevailing assumption is that containers are ideal only for stateless applications. However, many organizations are seeing the value of containerizing databases for many of the same reasons that they containerize their web applications – the ability to release more frequently, the ease of moving from development to staging to production, and the portability to run the same workload on any infrastructure. In fact, databases have emerged as a top use case for container adoption in a recent Diamanti survey.
Cloud Native infrastructure inherently enables stateful applications to take advantage of elasticity and flexibility. However, catastrophic events like hardware failures, power failures, natural disasters or other unexpected events can cause loss of data for an extended period of time. This makes recovery tricky for stateful applications. One of the important requirements for cloud native storage is to seamlessly recover from such catastrophic events.
Two key parameters for measuring storage availability services are:
- Recovery Time Objective (RTO)
How much time will it take to recover the data? Some databases are more critical than others. It is important to have minimal downtime and get back to normalcy as soon as possible. - Recovery Point Objective (RPO)
How far back can the data be recovered? An application may or may not experience loss of data depending on the type of event. Each data service provides its own recovery point.
The appropriate data services should be chosen based on the application’s RTO and RPO requirements for different types of failures.
Tape backup is one of the oldest methods of data protection. Even though tape backup is reliable, it might take days to backup and recover data, especially if the tape backup is stored offsite. Backups to the cloud, hard drive, or optical drive are more commonplace today but still require a full recovery of a workload which can take several hours (or days).
Snapshots can be taken more frequently than tape backup, resulting in lower RPO. Recovery time from a snapshot depends on the underlying storage architecture and whether data copy is involved during the restore process. Also as a snapshot is usually local, it is good for rewinding the data but not for keeping it safe.
Asynchronous replication, either at volume level or application level, can help to recover data with RPO and RTO in minutes or even seconds.
Mirroring is the process of synchronous volume replication across storage devices or availability zones. Mirroring has RPO and RTO of almost zero as there is no data loss or data copy involved in the restoration process.
The diagram below demonstrates various storage services that can be used to recover the data and shows how they fare in terms of RPO and RTO.
Figure 1: Comparison of RPO and RTO for different storage services
Diamanti Storage Services
Diamanti Enterprise Kubernetes Platform provides both a Kubernetes control plane and a single pane of glass for storage management. It offers data protection, backup and restore, disaster recovery, inbuilt high availability, and application mobility across clusters, providing a resilient and powerful platform for containerized applications – including database workloads. Diamanti’s unique storage and networking architecture helps deliver these services, at 1 million IOPS per server and less than 100 microseconds of latency with guaranteed Quality of Service (QoS), right out of the box.
Diamanti supports the following enterprise-level data services:
- Mirroring of volumes within a cluster
- Snapshot of volumes within a cluster
- Backup of volumes to remote backup targets
- Replication of volumes to another cluster
Mirroring
Diamanti’s storage architecture allows volumes to be created with multiple mirrors to ensure the high availability of data. This provides protection against any data loss in the case of a node or drive failure. These mirrors are spread across different nodes in the cluster. Any write to these volumes is synchronously written to all mirrors. A new mirror can be added to the existing volume which triggers data synchronization to ensure that the new mirror has the same data as the existing volume. If a node hosting the stateful application fails, an application can failover to another node without any data loss and with minimum downtime. This ensures high availability of data to the application despite any node failures in the cluster.
Figure 2: Mirroring on a Diamanti Cluster
Diamanti platform also supports multi-zone mirroring where a volume’s mirrors are spread across different availability zones. Each zone represents a failure domain. Diamanti’s volume placement strategy allows stateful applications to tolerate the loss of a zone without any impact on the data. If one of the zones fails, the application can failover to another zone.
Figure 3: Mirroring across multiple AZs
Snapshot
A snapshot represents a point-in-time copy of data. The Diamanti platform supports instant snapshot creation of a volume with zero-copy. Snapshots are space-optimized as they share blocks with their volume. The space required for creating a new snapshot depends on changes to the volume since the last snapshot was taken. Snapshots provide protection against any accidental data loss. Snapshots reside on the node where the volume exists. Local snapshots can be used to restore the volume instantly to a specific point in time. Snapshots on the Diamanti platform are also used for backup and replication for consistency.
Figure 4: Snapshot on a Diamanti Cluster
Backup and Restore
Diamanti platform offers a data protection solution that can backup all application volumes to remote backup targets. The backup targets can be NFS, iSCSI or cloud object storage.
Initiating a backup for a volume doesn’t have any impact on the application’s performance or its availability. Backups are normally started by creating a snapshot of a volume and then copying the snapshot data to the backup target. These snapshots are retained in the cluster and purged periodically based on the specified backup policy.
Diamanti’s backup controller, integrated with Kubernetes, allows users to configure policies such as frequency for running the backup, retention period for the backups and enabling compression. These backups ensure that application data is available for restore in case of any disaster.
Figure 5: Backup on a Diamanti Cluster
Replication for Disaster Recovery
The Diamanti platform provides a Disaster Recovery (DR) solution using volume replication across multiple clusters. Volume replication periodically sends the volume data to remote clusters to protect against primary site failure. The clusters can be deployed in multiple regions for fault isolation. Since these clusters are separated by long distances, synchronous replication becomes challenging. Hence, volumes across multiple clusters are asynchronously replicated.
Volume replication is configured via Diamanti-defined CR (Custom Resource), which is a Kubernetes concept that allows third-party functionality to natively integrate with Kubernetes APIs. Diamanti Volume Replication allows users to configure the parameters required for replication such as Persistent Volume Claims (PVCs), replication schedule and DR site information. Users can set up a replication schedule based on an application’s tolerance for loss of data. This helps a stateful application to failover to another cluster in case of any disaster. Volume replication across clusters is built using snapshot technology. On every replication cycle, a snapshot of the volume attached to an application is created on the primary cluster. The Replication Agent of the primary cluster compares the snapshot data against previous snapshots (if any). The differences are sent over to the DR cluster’s replication agent which syncs the data to the corresponding volume on a DR cluster. Once the replication is completed, a snapshot of the replicated volume is taken on the DR cluster. This ensures the volume’s snapshots on both clusters are consistent. In case of a disaster, the application can be instantly started with the replicated volume using the NVMe SSDs on the Diamanti platform. The snapshots on primary and DR clusters are purged periodically based on the specified replication policy.
Figure 6: Replication for Disaster Recovery
The Diamanti Enterprise Kubernetes Platform is the industry’s only hyperconverged infrastructure (HCI) that provides enterprise-class storage services for stateful containerized applications. With storage services that enable high availability (HA) and disaster recovery (DR), Diamanti provides an efficient and cost-effective failover mechanism to protect against any catastrophic events and significantly reduces application downtime. Thus, the Diamanti platform not only provides high performance, isolation, and portability, it also makes running stateful applications on Kubernetes with built-in protection against any failures in production a reality.