The Need for High Availability

High availability refers to the ability of a ssytem to perform its function continuously (without interruption) for a significantly longer period of time than the reliabilities of its individual components would suggest. High availability is most often achieved through fault tolerance.

The degree of availability can be characterized by orders of magnitude. Unmanaged computer systems on the Internet typically fail every two weeks and take on the average 10 hours to recover. These unmanaged computers give about 90 percent steady-state availability. Managed conventional systems fail several times a year. Each failure takes about two hours to repair. This gives 99 percent availability. Fault-tolerant systems fail once every few years and are repaired within a few hours - this is 99.99 precent availability. High availability systems must have fewer failures and be designed for the faster repair. Their requirements are one to three orders of magnitude more demanding than current fault-tolerant technologies. Table 1 shows the availability of typical system classes [GRAY 1991]:

System Type Unavailability
(minutes/year)
Availability
(in percent)
Availability Class
Unmanaged 50,000 90 1
Managed 5,000 99 2
Well-managed 500 99.9 3
Fault-tolerant 50 99.99 4
High Availability 5 99.999 5
Very High Availability 0.5 99.9999 6
Ultra Availability 0.05 99.99999 7


High availability requires systems designed to tolerate faults - to detect a fault, report it, and recover from the fault in order to continue service while the faulty component is repaired off line. Beyond the usual hardware and software faults, high availability system must tolerate other faults.

Since a computer system or a network consists of many parts in which all parts usually need to be present in order for the whole to be operational, much planning for high availability centers around backup and failover processing and data storage and access. For storage, a redundant array of independent disks (RAID) is one approach. A more recent approach is the storage area network (SAN). Some availability experts emphasize that, for any system to be highly available, the parts of a system should be well-designed and thoroughly tested before they are used. For example, a new application program that has not been thoroughly tested is likely to become a frequent point-of-breakdown in a production system.

There are multiple techniques available to meet high-availability requirements. These techniques are as follows:
  1. Fault-tolerant hardware:

    The traditional approach to achieving high availability in stand-alone systems is with fault tolerant hardware. With this technique, redundant hardware is built into the hardware platform, and the active hardware is constantly monitored for failures. When a failure is detected, switchover to the redundant hardware must occur seamlessly (i.e., no calls being handled by the component are impacted). To achieve this seamless "fail-over" to standby hardware, identical software images must be executing on both the primary and redundant hardware.

  2. Fault-tolerant software:

    Fault-tolerant software monitors the "health" of individual software elements, transferring the affected functions to a different or new process upon the detection of a problem. The switchover may be made to a different version of software or the same version that was originally experiencing problems. Detecting software problems is much more complex than detecting hardware problems. And while switchover can still be done in a matter of seconds, it is not necessarily instantaneous as with fault-tolerant hardware.

    Fault-tolerant software provides additional benefits. Fault-tolerant software mechanisms can be used to execute different versions of the software as the primary and backup processes, thereby providing the ability to gracefully upgrade the system's software without interrupting normal operation.

  3. Network Redundancy:

    One approach to building highly available networks is to use fault tolerant network devices throughout the network. This is achieved by providing redundant backups within the device for each of its key components. For example, a highly faulty tolerant switch might be configured with redundant power supplies, cooling fans, switch fabrics, and switch processors, plus having provisions for redundant links via interfaces that support dual PHY or multi-linked connections. Another way to build highly available netowrks is through redundancy in the network topology rather than primarily within the network devices themselves.

    System redundancy blends features of hardware, software and network redundancy.


The following is a list of links that discuss the need for high availability.

  • High Availability
  • http://www.availability.com/ (Official Website)
  • Ante Uptime - High Availability
  • Making software trustworthy
  • Sun Unix beats Windows in uptime
  • IBM makes server uptime pledge
  • Sun telco server guarantees 99.999% uptime
  • Sun Sets Agenda On Availability with New Sunup Program
  • SunUP Availability Program
  • HP, SAP unite to bolster computer performance
  • HP cuts NT crashes with Marathon deal
  • IBM helps cluster Windows NT
  • HP, Cisco, Oracle outline '5nines 5minutes' high-availability initiative
  • HP 5 nines : 5 Minutes High Availability Solutions
  • IBM Eyes Self-Fixing Computer
  • Interesting NT Stuff
  • Predicting the Crash
  • High Availability Computer Systems, J. Gray and D. P. Siewiorek, IEEE Computer 24, pp. 39-48, Sept. 1991.


  • Last updated on 4 October 2001