Contents:




,
,c)
as a function
of 1/
(in seconds) for 1/
=10,000 h, 1/
= 2 h and 1/
= 5 min.

OS. We assume here that hardware failures
are permanent and hence require a repair or replacement action while OS failures
are cleared by a reboot. Repair or reboot takes place at rates
and b for the
hardware and OS respectively. A node is considered down when either the OS or
the hardware has failed.The cluster is down when both nodes have failed.In case
of a hardware failure in one node and an OS failure in the other, the OS is always
recovered first.
The CTMC corresponding to this cluster system is shown in Figure 8.35.In
state 1, both nodes and their OSs are functioning properly.In state 2, one of the
nodes has a hardware failure and in state 3, both the nodes have hardware failure.
In state 4, one of the OSs has failed while in state 5, both OSs have failed.In state
6, one node has a hardware failure while the other has an OS failure.For the steady
state balance equations we have
, to
obtain the steady state probabilities as shown below:



p, running on a machine (node)
that fails independently at the rate
m.
An automatic failure detection mechanism
based on polling is installed. Assume that the mean time to detect
server process failure is
and the mean time to detect machine failure is
.
Furthermore, when the machine is detected to have failed,the server
process is started on another machine, if available.The mean restart
time of a machine is
. When only the
server process is detected to have failed, it is automatically restarted on the same
machine. For details on process and machine failure detection and recovery,see
papers by Garg et al. and Huang and Kintala
[GARG 1999,
HUAN 1993]. The
mean restart time of the server software is
.
Typically,
p >
m.
There is a
small probability 1-C that the process restart on the same machine is unsuccessful,
in which case it is restarted on another machine,if available. Such a scheme of
automatic restart after failures is also called cold replication
[HUAN 1993].The
Web server is considered available when the server process as well as the machine
it is running on are up. We calculate the steady-state availability of the server,
assuming that no further failures can occur after a failure of either the process or
the machine until it has been dealt with.
, where Pp and Pm
denote the status of the primary server process and machine, and
Sp and Sm
denote the status of the spare server process and machine,respectively.
A status of "1" indicates that the process or machine is up,
a status of "0" indicates 

m
denotes the starting of the process on a machine. This happens as a
result of a failover (after reaching state 5 from state 4),or after recovery of a machine
after complete crash (after reaching state 5 from state 10).Note that although the
server is unavailable in states 2,3 and 8,the failure is not observable until it is
detected.
The steady-state probabilities
i,
i =1, 2, ..., 10 can be derived as







![]() |
![]() |
| Figure 8.30: Two-component availability model with imperfect coverage | Figure 8.32: Two-component availability model with imperfect coverage and detection delay |