These decisions are not independent of
one another. Data loss, and in particular loss of data integrity,
can significantly increase recovery time. Historically,
the emphasis has been on reducing the time required to recover
systems. All recovered systems lose some data. Today, there
is significant pressure on organizations to implement recovery
solutions that give a very high probability of zero data
loss. It has been a business imperative for the financial
industry for some time, but increasingly organizations are
finding it inordinately costly to recover if data is lost,
and the corresponding recovery time is significantly increased.
The business processes to recover lost data manually after
a disaster become more difficult as processes are increasingly
computerized. Automatic remote recovery for all major IT
systems can simplify business processes and reduce costs.
Recovery decisions can no longer be made
in isolation. Within an organization, the collapse of one
system can quickly have a domino effect and bring down other
systems. Between organizations, a quick recovery from a
regional disaster may mean nothing if an organization’s
customers, suppliers and partners cannot recover.
Part of these mandates is the requirement
that at least one backup data site is located a significant
distance from the primary data center. Well over 200 miles
is usually considered best practice. Recovery solutions
have an important technology constraint: zero data loss
cannot be achieved over long distances – the practical
limitation is usually less than 50 miles or whatever latency
the application will tolerate when its data is synchronously
replicated. The business implication of this fact is that
a single data center recovery node cannot be placed at sufficient
distance to achieve zero data loss. However, by introducing
a three-node topology with two data recovery nodes (one
at close distance and one at long distance), a very high
probability of zero data loss and fast recovery times can
be achieved. For most organizations, planning a data recovery
system around attempting to maximize the probability of
zero data loss will be more cost effective and will provide
a simpler and sounder path for future system development.
Two-Node Disaster Recovery Topologies
The predominant topology for disaster
recovery is a two-node topology. There are five main disaster
recovery technology options, which are described in detail
below. There are four main criteria for evaluating them:
cost, distance (greater distance reduces the probability
of both sites being hit), probability of permanent data
loss and recovery time. Table below shows a summary of the
trade-offs between the different technologies.

• Consolidated backup.
Consolidated tape or disk media backup is the least expensive
solution and has the greatest permanent data loss and the
slowest recovery times. It is well suited for addressing
limited disruptions, such as data corruption. Improved techniques
such as disk-to-disk backups and virtual tape can significantly
improve efficiency and reduce the time to recovery.
• High-availability storage
networks. A SAN solution can overcome local server
failures by providing access to a standby or clustered server
system to ensure continuous operation. Permanent data loss
can be low but not zero. However, the distance between the
data centers is also very short, which increases the probability
that a disaster will take out both sites. It is not regarded
as a viable disaster recovery topology.
• Remote point-in-time update
replication. Point-in-time update replication copies
the changes made to data to another building or city. Changes
can be replicated at scheduled times during the day or whenever
changes occur. This technology accommodates any distance
requirements, as there are no latency limitations to overcome.
It offers faster recovery times than tape backup, but it
cannot achieve zero permanent data loss. Data recovery is
measured in hours.
• Asynchronous replication.
Asynchronous replication has significantly lower data recovery
times than point-in-time update replication. Asynchronous
replication allows the primary and remote copies to be out
of synchronization by a range of seconds to minutes. Permanent
data loss is low but not zero. One of the challenges of
asynchronous data replication has historically been integrity
of data. A rapid and automatic recovery process depends
on the integrity of the data that has been transmitted.
Integrity relies on knowing that the data has been sent
in the correct order (packets of data can arrive out of
order over telecommunication services).
Modern solutions guarantee packets are time-stamped in the
order they were written to disk, ensuring referential integrity
at the remote site. This significantly reduces recovery
times. These modern “pull” architectures utilize
disk-based buffering, and the bandwidth costs are significantly
lower compared to those of synchronous replication as they
do not have to be configured for peaks.
• Synchronous disk replication.
Synchronous replication is suitable for applications that
require the fastest recovery with zero permanent data loss.
All disk writes are synchronously copied to a remote site
across a high-performance network before a transaction is
acknowledged, eliminating any transaction loss. This technology
is sensitive to network latency, which limits the practical
distance between sites to typically less than 50 miles.
Table above shows that probable zero permanent
data loss can only be achieved over relatively short distances
with two-node recovery topologies. Zero permanent data loss
cannot be achieved with a two-node topology at distances
usually required by regulation or by the business for recovery
sites.
Three-Node Disaster Recovery Topologies
Three-node disaster recovery topologies
allow a combination of technologies to allow very high probabilities
of zero data loss at long distances. They combine synchronous
replication (local recovery node) with asynchronous replication
(remote recovery node). The local recovery node can accommodate
very rapid recovery with a high probability of zero permanent
data loss. Testing of this environment is simplified, and
IT personnel can be shared between the primary node and
the local backup nodes. The remote recovery node provides
for recovery with low permanent data loss “in the
unlikely event” that both the primary and local recovery
nodes are impacted.
There are two predominant three-node disaster
recovery topologies. These are 1) cascade threenode disaster
recovery topology and 2) multitarget three-node disaster
recovery topology.
Cascade Three-Node Disaster Recovery
Topology
This approach is sometimes known as “multihop,”
and combines the technologies to provide a high probability
of zero permanent data loss for the majority of disaster
scenarios over a long distance. Depending on the speed of
the long-distance link between the local and remote recovery
nodes, what time of day/year the primary node goes down
and the complexity of the recovery process, recovery can
be made at the remote node in under an hour or within a
few hours.

There are two main options within this
topology:
One trade-off with cascade topology is
seen in the following example. In the event the local recovery
node goes down, the remote recovery node will be frozen
with the data it has received at that point in time. The
organization will then have to decide whether to continue
to run the business’s IT systems. If it does, the
remote recovery node will get further behind, and in the
event of a rolling disaster taking out the primary node
as well, there could be significant permanent data loss.
If it chooses to close down the systems at the primary node
until the secondary node is recovered or a communications
link can be established between the primary node and the
remote recovery node, the recovery time will be elongated,
but the probability of permanent data loss will be minimized.
For organizations within a small geographical
area, the cascade three-node topology makes good business
sense. A disaster that takes down both the primary and local
recovery sites is likely to affect most local customers.
For interstate and international business, and especially
for organizations that provide critical infrastructure services,
this topology may not meet more exacting requirements.
Multitarget Three-Node Disaster
Recovery Topology
The difference between the cascade topology
and the multitarget is that in the multitarget topology,
the primary data node backs up data to both nodes simultaneously.
This is a recent technological capability, and very high-performance
controllers are required to manage this process. This approach
ensures that there is no permanent data loss if either the
primary or local recovery node is lost. Either node can
communicate data to the remote recovery node to ensure zero
data loss.

To ensure rapid recovery, the storage controller
technology has to be able to resynchronize the controllers
at the remote recovery node with either the primary or local
node, and pass just the changed data (delta resynchronization).
In the cascade topology, if the local recovery node is down,
no data can be transferred to the remote recovery node,
as discussed above.
The major disadvantage of the multitarget
topology is the higher cost of telecommunication lines.
A major advantage is that if there are backup servers in
the local recovery node, there can be failover and fail-back
between the primary and local nodes. This significantly
enhances recovery times, and the testing of recovery procedures.
In the analysis of the multitarget option in both case studies,
the additional cost of backup servers at the local recovery
node is assumed.