Applications

The following examples are taken from Dr. Kishor Trivedi's book called Probability and Statistics with Reliability, Queuing, and Computer Science Applications, 2nd edition, John Wiley & Sons, 2001. These examples come from chapter 8.

Contents:

Example 8.23

We now combine the deleterious effects of the detection delay and imperfect coverage as shown in the CTMC model of figure 8.32. States 1D and 1C are both delay states. The delay in state 1D will be of the order of seconds,while that in state 1C will be of the order of minutes.

Solving steady-state balance equations,we can obtain steady-state probabilities:



where



Assume both states 1D and 1C are system down states; then the steady-state-unavailability is



and the downtime in minutes per year is



In figure 8.33 we have plotted D(,,c) as a function of 1/ (in seconds) for 1/ =10,000 h, 1/ = 2 h and 1/ = 5 min.

Figure 8.33: Downtime due to imperfect coverage and detection delay





Example 8.25

Consider a 2-node cluster where both hardware and Operating System software(OS) failures may occur [HUNT 1999]. The node hardware fails at the constant rate


Figure 8.35: CTMC for the 2-node cluster system


and the OS fails at the constant rate OS. We assume here that hardware failures are permanent and hence require a repair or replacement action while OS failures are cleared by a reboot. Repair or reboot takes place at rates and b for the hardware and OS respectively. A node is considered down when either the OS or the hardware has failed.The cluster is down when both nodes have failed.In case of a hardware failure in one node and an OS failure in the other, the OS is always recovered first. The CTMC corresponding to this cluster system is shown in Figure 8.35.In state 1, both nodes and their OSs are functioning properly.In state 2, one of the nodes has a hardware failure and in state 3, both the nodes have hardware failure. In state 4, one of the OSs has failed while in state 5, both OSs have failed.In state 6, one node has a hardware failure while the other has an OS failure.For the steady state balance equations we have



These equations can be solved,in conjunction with , to obtain the steady state probabilities as shown below:



and



where



The steady state availability can be written as





Example 8.26 [GARG 1999]

In this example,we consider both hardware and (application)software failures. We consider a Web server software,that fails at the rate p, running on a machine (node) that fails independently at the rate m. An automatic failure detection mechanism based on polling is installed. Assume that the mean time to detect server process failure is and the mean time to detect machine failure is . Furthermore, when the machine is detected to have failed,the server process is started on another machine, if available.The mean restart time of a machine is . When only the server process is detected to have failed, it is automatically restarted on the same machine. For details on process and machine failure detection and recovery,see papers by Garg et al. and Huang and Kintala [GARG 1999, HUAN 1993]. The mean restart time of the server software is . Typically, p > m. There is a small probability 1-C that the process restart on the same machine is unsuccessful, in which case it is restarted on another machine,if available. Such a scheme of automatic restart after failures is also called cold replication [HUAN 1993].The Web server is considered available when the server process as well as the machine it is running on are up. We calculate the steady-state availability of the server, assuming that no further failures can occur after a failure of either the process or the machine until it has been dealt with.
Figure 8.36 shows the homogeneous CTMC for a Web server with cold replication using one spare machine.The states are abeled using the notation , where Pp and Pm denote the status of the primary server process and machine, and Sp and Sm denote the status of the spare server process and machine,respectively. A status of "1" indicates that the process or machine is up, a status of "0" indicates



Figure 8.36: CTMC model for a Web server with a cold replication




that the process or machine is down,and "0D" indicates that the process or machine has failed but the failure is yet to be detected. A status of "x" indicates that the status is of no consequence (don't care)and is used to indicate the status of the server process on the spare machine.To simplify our discussion we relabe the states as shown in Table 8.4. The state space is denoted by I = {1 ,2 ,...,10 }. TheWeb server processes requests only in states 1,6 and 7.State 6 represents failure of the spare machine.States 8 and 9 denote primary process failures when the spare machine is down. State 10 denotes the state when both the primary and spare machines are down. Whenever a machine has crashed (states 7 and 10),a more elaborate recovery with rate is required. The transition from state 5 to state 7 with rate m denotes the starting of the process on a machine. This happens as a result of a failover (after reaching state 5 from state 4),or after recovery of a machine after complete crash (after reaching state 5 from state 10).Note that although the server is unavailable in states 2,3 and 8,the failure is not observable until it is detected. The steady-state probabilities i, i =1, 2, ..., 10 can be derived as



where



The steady-state availability is given by







Example 8.28 (Hierarchical Modeling)

Consider the availability model of a workstation consisting of three subsystems:a cooling subsystem with two fans,a dual power supply subsystem,and a two-CPU processing subsystem. The workstation is considered to be unavailable when one or more of the subsystems have failed. It is possible to construct a composite CTMC model for the entire workstation,but if the failures and repairs occurring in the three subsystems are independent of each other, then a hierarchical model can be constructed. A hierarchical model consists of multiple levels of models, where the lower-leve models capture the detailed behavior of subsystems and the topmost level is the system-leve model. Hierarchical models scale better with the number of subsystems and subsystem components than does a composite model. For our example,the top-leve model consists of the series reliability block diagram shown in Figure 8.40.
The availability of the workstation is then given by



The availabilities of the cooling, power supply, processor subsystems, that is, Af, Aps, and Ap, respectively, can be obtained by solving detailed lower-level models. For instance, if the two fans form a parallel reduntant system, the availability of the cooling subsystem, Af can be computed using the model in Figure 8.27. Adding



Figure 8.40: Top-level model for Example 8.28



the subscript f to the rates, we obtain



Let us consider that when one of the power supplies fails,the other working supply can be automatically switched in.With probability Cps this switching is successful, and with probability 1-Cps the switching fails, incurring a onger reconfiguration delay.The availability of the power supply subsystem is obtained by solving the model in Figure 8.30. Adding the subscript ps to the rates and the coverage factor, we obtain



Consider that the processors have a detection delay and imperfect coverage when one of them fails,as in the two-component system in Example 8.23. The availability of the processing subsystem is then given by the solution to the model in figure 8.32. Adding the subscript p to the rates and the coverage factor yields



For further examples of hierarchical availability models with independent subsystems see the paper by Ibe et al. [IBE 1989b], and for nearly independent subsystems, see the paper by Tomek and Trivedi [TOME 1991].




Appendix

Figure 8.30: Two-component availability model with imperfect coverage Figure 8.32: Two-component availability model with imperfect coverage and detection delay





References

[GARG 1999] S. Garg, Y. Huang, C.M. Kintala, K.S. Trivedi, and S. Yajnik, "Performance and reliability evaluation of passive replication schemes in application level fault tolerance," in Proc. 29th Annual Int.Symp. Fault Tolerant Computing (FTCS), Madison, Wisconsin, pp.15-18, June 15 –18, 1999.

[HUNT 1999] S.W. Hunter and W.E. Smith, "Availability modeling and analysis of a two node cluster," Proc. 5th Int.Conf. on Information Systems, Analysis and Synthesis, Orlando, FL, Oct.1999.

[HUAN 1993] Y. Huang and C.M. Kintala, "Software implemented fault tolerance: technologies and experience," Proc. 23rd Int. Symp.on Fault-Tolerant Computing (FTCS), Toulouse, France, 1993,pp.2 – 9.

[IBE 1989b] O. Ibe, R. Howe, and K.S. Trivedi, "Approximate availability analysis of VAXCluster systems," IEEE Trans. Reliability R-38 (1), 146 – 152 (1989).

[TOME 1991] L.A. Tomek and K.S. Trivedi, "Fixed point iteration in availability modeling," in M. Dal Cin (ed.), Proc. Fifth Int. GI/ITG/GMA Conf. on Fault-Tolerant Comput. Syst., Springer-Verlag, Berlin, 1991, pp.229 – 240.