12-08-2013, 04:29 PM
FAULT TOLERANT SERVICES
FAULT TOLERANT.pptx (Size: 74.44 KB / Downloads: 16)
Concepts of Fault Tolerance
Hardware, software and networks cannot be totally free from failures
Fault tolerance is a non-functional (QoS) requirement that requires a system to continue to operate, even in the presence of faults
Fault tolerance should be achieved with minimal involvement of users or system administrators
Distributed systems can be more fault tolerant than centralized systems, but with more processor hosts generally the occurrence of individual faults is likely to be more frequent
Notion of a partial failure in a distributed system
Recovery Techniques
Once failure has occurred in many cases it is important to recover critical processes to a known state in order to resume processing
Problem is compounded in distributed systems
Two Approaches:
Backward recovery, by use of checkpointing (global snapshot of distributed system status) to record the system state but checkpointing is costly (performance degradation)
Forward recovery, attempt to bring system to a new stable state from which it is possible to proceed (applied in situations where the nature if errors is known and a reset can be applied)
Forward Recovery (Exception)
Exceptions
System states that should not occur
Exceptions can be defined either
predefined (e.g. array-index out of bounds, divide by zero)
explicitly declared by the programmer
Raising an exception
When such a state is detected in the execution of the program
The action of indicating occurrence of such as state
Exception handler
Code to be executed when an exception is raised
Declared by the programmer
For recovery action
Supported by several programming languages
Ada, ISO Modula-2, Delphi, Java, C++.