16-01-2013, 04:50 PM
Fault Tolerance
1Fault Tolerance.ppt (Size: 604 KB / Downloads: 98)
Concepts of Fault Tolerance
Hardware, software and networks cannot be totally free from failures
Fault tolerance is a non-functional (QoS) requirement that requires a system to continue to operate, even in the presence of faults
Fault tolerance should be achieved with minimal involvement of users or system administrators
Distributed systems can be more fault tolerant than centralized systems, but with more processor hosts generally the occurrence of individual faults is likely to be more frequent
Notion of a partial failure in a distributed system
Attributes of a Dependable System
System attributes:
· Availability – system always ready for use, or probability that system is ready or available at a given time
· Reliability – property that a system can run without failure, for a given time
· Safety – indicates the safety issues in the case the system fails
· Maintainability – refers to the ease of repair to a failed system
Failure in a distributed system = when a service cannot be fully provided
System failure may be partial
A single failure may affect other parts of a system (failure escalation)
Strategies to Handle Faults
Fault avoidance
Techniques aim to prevent faults from entering the system during design stage
Fault removal
Methods attempt to find faults within a system before it enters service
Fault detection
Techniques used during service to detect faults within the operational system
Fault tolerant
Techniques designed to tolerant faults, i.e. to allow the system operate correctly in the presence of faults.
Example: Space Shuttle
Uses 5 identical computers which can be assigned to redundant operation under program control.
During critical mission phases - boost, re-entry and loading - 4 of its 5 computers operate an NMR configuration, receiving the same inputs and executing identical tasks. When a failure is detected the computer concerned is switched out of the system leaving a TMR arrangement.
The fifth computer is used to perform non-critical tasks in a simplex mode, however, under extreme cases may take over critical functions. The unit has "diverse" software and could be used if a systematic fault was discovered in the other four computers.
The shuttle can tolerate up to two computer failures; after a second failure it operates as a duplex system and uses comparison and self-test techniques to survive a third fault.
Process Groups
Organize several identical processes into a group
When a message is send to a group, all members of the group receives it
If one process in a group fails (no matter what reason), hopefully some other process can take over for it
The purpose of introducing groups is to allow processes to deal with collections of processes as a single abstraction.
Important design issue is how to reach agreement within a process group when one or more of its members cannot be trusted to give correct answers.
Reliable Communication
Fault Tolerance in Distributed system must consider communication failures.
A communication channel may exhibit crash, omission, timing, and arbitrary failures.
Reliable P2P communication is established by a reliable transport protocol, such as TCP.
In client/server model, RPC/RMI semantics must be satisfied in the presence of failures.
In process group architecture or distributed replication systems, a reliable multicast/broadcast service is very important.
Forward Recovery (Exception)
Exceptions
System states that should not occur
Exceptions can be defined either
predefined (e.g. array-index out of bounds, divide by zero)
explicitly declared by the programmer
Raising an exception
When such a state is detected in the execution of the program
The action of indicating occurrence of such as state
Exception handler
Code to be executed when an exception is raised
Declared by the programmer
For recovery action
Supported by several programming languages
Ada, ISO Modula-2, Delphi, Java, C++.