DISASTER RECOVERY TOPOLOGIES



When a disaster strikes, an organization may lose data and access to data and with it the ability to function. Recovering from such a catastrophe is a business imperative. Leveraging techniques that focus on risk quantification and mitigation is key and will help in deciding what technology to use and how much to spend.

Specifically, this decision revolves around two fundamental business issues for business-critical functions and their associated applications:

  • How much data loss can be tolerated during recovery (The ideal is a high probability of zero data loss.)
  • The acceptable time within which to recover systems (The ideal is seconds.)

These decisions are not independent of one another. Data loss, and in particular loss of data integrity, can significantly increase recovery time. Historically, the emphasis has been on reducing the time required to recover systems. All recovered systems lose some data. Today, there is significant pressure on organizations to implement recovery solutions that give a very high probability of zero data loss. It has been a business imperative for the financial industry for some time, but increasingly organizations are finding it inordinately costly to recover if data is lost, and the corresponding recovery time is significantly increased. The business processes to recover lost data manually after a disaster become more difficult as processes are increasingly computerized. Automatic remote recovery for all major IT systems can simplify business processes and reduce costs.

Recovery decisions can no longer be made in isolation. Within an organization, the collapse of one system can quickly have a domino effect and bring down other systems. Between organizations, a quick recovery from a regional disaster may mean nothing if an organization’s customers, suppliers and partners cannot recover.

Part of these mandates is the requirement that at least one backup data site is located a significant distance from the primary data center. Well over 200 miles is usually considered best practice. Recovery solutions have an important technology constraint: zero data loss cannot be achieved over long distances – the practical limitation is usually less than 50 miles or whatever latency the application will tolerate when its data is synchronously replicated. The business implication of this fact is that a single data center recovery node cannot be placed at sufficient distance to achieve zero data loss. However, by introducing a three-node topology with two data recovery nodes (one at close distance and one at long distance), a very high probability of zero data loss and fast recovery times can be achieved. For most organizations, planning a data recovery
system around attempting to maximize the probability of zero data loss will be more cost effective and will provide a simpler and sounder path for future system development.

Two-Node Disaster Recovery Topologies

The predominant topology for disaster recovery is a two-node topology. There are five main disaster recovery technology options, which are described in detail below. There are four main criteria for evaluating them: cost, distance (greater distance reduces the probability of both sites being hit), probability of permanent data loss and recovery time. Table below shows a summary of the trade-offs between the different technologies.


Consolidated backup. Consolidated tape or disk media backup is the least expensive solution and has the greatest permanent data loss and the slowest recovery times. It is well suited for addressing limited disruptions, such as data corruption. Improved techniques such as disk-to-disk backups and virtual tape can significantly improve efficiency and reduce the time to recovery.

High-availability storage networks. A SAN solution can overcome local server failures by providing access to a standby or clustered server system to ensure continuous operation. Permanent data loss can be low but not zero. However, the distance between the data centers is also very short, which increases the probability that a disaster will take out both sites. It is not regarded as a viable disaster recovery topology.

Remote point-in-time update replication. Point-in-time update replication copies the changes made to data to another building or city. Changes can be replicated at scheduled times during the day or whenever changes occur. This technology accommodates any distance requirements, as there are no latency limitations to overcome. It offers faster recovery times than tape backup, but it cannot achieve zero permanent data loss. Data recovery is measured in hours.

Asynchronous replication. Asynchronous replication has significantly lower data recovery times than point-in-time update replication. Asynchronous replication allows the primary and remote copies to be out of synchronization by a range of seconds to minutes. Permanent data loss is low but not zero. One of the challenges of asynchronous data replication has historically been integrity of data. A rapid and automatic recovery process depends on the integrity of the data that has been transmitted. Integrity relies on knowing that the data has been sent in the correct order (packets of data can arrive out of order over telecommunication services).

Modern solutions guarantee packets are time-stamped in the order they were written to disk, ensuring referential integrity at the remote site. This significantly reduces recovery times. These modern “pull” architectures utilize disk-based buffering, and the bandwidth costs are significantly lower compared to those of synchronous replication as they do not have to be configured for peaks.

Synchronous disk replication. Synchronous replication is suitable for applications that require the fastest recovery with zero permanent data loss. All disk writes are synchronously copied to a remote site across a high-performance network before a transaction is acknowledged, eliminating any transaction loss. This technology is sensitive to network latency, which limits the practical distance between sites to typically less than 50 miles.

Table above shows that probable zero permanent data loss can only be achieved over relatively short distances with two-node recovery topologies. Zero permanent data loss cannot be achieved with a two-node topology at distances usually required by regulation or by the business for recovery sites.

Three-Node Disaster Recovery Topologies

Three-node disaster recovery topologies allow a combination of technologies to allow very high probabilities of zero data loss at long distances. They combine synchronous replication (local recovery node) with asynchronous replication (remote recovery node). The local recovery node can accommodate very rapid recovery with a high probability of zero permanent data loss. Testing of this environment is simplified, and IT personnel can be shared between the primary node and the local backup nodes. The remote recovery node provides for recovery with low permanent data loss “in the unlikely event” that both the primary and local recovery nodes are impacted.

There are two predominant three-node disaster recovery topologies. These are 1) cascade threenode disaster recovery topology and 2) multitarget three-node disaster recovery topology.

Cascade Three-Node Disaster Recovery Topology

This approach is sometimes known as “multihop,” and combines the technologies to provide a high probability of zero permanent data loss for the majority of disaster scenarios over a long distance. Depending on the speed of the long-distance link between the local and remote recovery nodes, what time of day/year the primary node goes down and the complexity of the recovery process, recovery can be made at the remote node in under an hour or within a few hours.

There are two main options within this topology:

a) The local recovery node can be a minimal disk-only “bunker” whose primary function is to ensure that data can continue flowing to bring the remote recovery node completely up to date should the primary node go down. The local recovery node is often an unmanned storage site. This configuration is the most cost-effective way of providing a high probability of zero data loss at a remote recovery node with very good recovery time characteristics.

b) Less frequently, the local recovery node can be a full data center (often with fail-over and fail-back systems). This provides zero data loss and very rapid recovery for disasters at the primary node. Going forward, this configuration is less likely, as the multitarget topology discussed in the next section is costeffective and gives better protection.

One trade-off with cascade topology is seen in the following example. In the event the local recovery node goes down, the remote recovery node will be frozen with the data it has received at that point in time. The organization will then have to decide whether to continue to run the business’s IT systems. If it does, the remote recovery node will get further behind, and in the event of a rolling disaster taking out the primary node as well, there could be significant permanent data loss. If it chooses to close down the systems at the primary node until the secondary node is recovered or a communications link can be established between the primary node and the remote recovery node, the recovery time will be elongated, but the probability of permanent data loss will be minimized.

For organizations within a small geographical area, the cascade three-node topology makes good business sense. A disaster that takes down both the primary and local recovery sites is likely to affect most local customers. For interstate and international business, and especially for organizations that provide critical infrastructure services, this topology may not meet more exacting requirements.

Multitarget Three-Node Disaster Recovery Topology

The difference between the cascade topology and the multitarget is that in the multitarget topology, the primary data node backs up data to both nodes simultaneously. This is a recent technological capability, and very high-performance controllers are required to manage this process. This approach ensures that there is no permanent data loss if either the primary or local recovery node is lost. Either node can communicate data to the remote recovery node to ensure zero data loss.

To ensure rapid recovery, the storage controller technology has to be able to resynchronize the controllers at the remote recovery node with either the primary or local node, and pass just the changed data (delta resynchronization). In the cascade topology, if the local recovery node is down, no data can be transferred to the remote recovery node, as discussed above.

The major disadvantage of the multitarget topology is the higher cost of telecommunication lines. A major advantage is that if there are backup servers in the local recovery node, there can be failover and fail-back between the primary and local nodes. This significantly enhances recovery times, and the testing of recovery procedures. In the analysis of the multitarget option in both case studies, the additional cost of backup servers at the local recovery node is assumed.

 

 

GRUPO SIA


delivering value
acceso a web de Grupo SIA