07-11-2016, 03:36 PM
1466321757-FAULTFINALREPORT.docx (Size: 38.18 KB / Downloads: 9)
Abstract:Now a day’s computers play a bigger role in society, their dependability is becoming increasingly important. Computing systems need to remain operational in spite of hardware failures in crucial areas such as medicine and space. This report is about introduction to terminology, fault-tolerance concepts and techniques, mainly from the hardware and software point of view. Also different ways of achieving fault-tolerance with redundancy is studied
Keywords: Terminology, concepts, techniques, redundancy
1. Introduction:
Reliability and availability have become increasingly important in today’s computer dependent world. In many applications where computers are used, outages or malfunction can be expensive, or even disastrous. Let’s take into consideration, a computer system in a nuclear plant malfunctioning. Or the computer systems in a space shuttle booting just when the shuttle is about to land. These are much advanced. More close to common man life, are the telecommunications switching systems and the bank transaction systems. In order to achieve this, we need fault-tolerant computers. They have the ability to tolerate faults by detecting failures, and can isolate defect modules in order to facilitate the system to operate correctly.
Reliability techniques have also become of increasing interest to general-purpose computer systems. The trends which contribute to this is that computers now have to operate in harsher environments when compared to earlier clean computer rooms, with stable climate and clean air. Also computers have moved out to industrial environments, with temperatures over a wide range, dust, humidity and unstable power supply. These factors will lead to computer failure.
In next instance, the users have changed. Now, with an increasing number of users, the typical user knows less about proper operation of the system. The consequence is that computers have to be able to tolerate more. Third, the service costs increased relative to hardware costs. Earlier the average machine was very expensive but in current scenario computers are cheap, and the user has the job to operate himself as he cannot afford frequent calls for field service. The fourth and last trend is larger systems. As systems become larger, there are more components that can fail. This means, to keep the reliability at an acceptable level, designs have to tolerate faults which will result from component failures.
The main causes for outages of equipment, making fault-tolerance techniques necessary can be split into:
• Environment: This is facilities failures, e.g. dust, fire in the machine room, problems with the cooling, earthquakes or sabotage.
• Operations: Procedures and activities of normal system administration, system configuration and system operation. This can be installation of a new operating system (requires booting of the machine), or installation of new application programs (which requires exit and restart of programs in use).
• Maintenance: This does not include software maintenance, but could be hardware upgrading.
• Hardware: Hardware device faults.
• Software: Faults in the software.
• Process: Outages due to something else, e.g. a strike.
The concept of fault tolerance is an amalgamation of various approaches to reliability assurance by means of testing, diagnosis, and redundancy in machine organization and operation. This emerged in the late 1960's and reached maturity with the formation of the IEEE Computer Society Technical Committee on Fault-Tolerant Computing in 1969 and the subsequent conference of the First International Symposium on Fault-Tolerant Computing in 1971. The Symposium has been held annually since then, and it has become the major international forum for the discussion of current experience and new ideas in system design, redundancy techniques, system modeling and analysis, testing and diagnosis methods, and other related areas.
2 Terminology:
Precise definition of a fault-tolerant computing system is that it is a system which has the built-in capability to preserve the continued correct execution of its programs and input/output (I/O) functions in the presence of a certain set of operational faults. An operational fault is an unspecified change in the value of one or more logic variables in the hardware of the system. It’s an immediate consequence of a physical failure event. The event maybe a permanent component failure, a temporary or intermittent component malfunction, or externally originating interference with the operation of the system.
In other words, when a system or module is designed, its behavior is specified. When in service, we can observe its behavior. When the observed behavior differs from the regular specified behavior, we call it a failure. A failure occurs because of an error, caused by a fault. The time between the occurrence of an error and the resulting failure is the error latency. This function has a fault, it does not check the value of the variable b. This fault results in a latent error in the function division. If the function is executed with a zero-value as b-argument, that is an error. When the division is executed, we have a program failure. It can be shown as:
Fault→ Error→ Failure→ (Error latency) → Detect
Faults can be hard or soft otherwise called as transient. A module with a hard fault will not function correctly, it will continue with a high probability of failing until and unless it is repaired. A module with a soft fault appears to be repaired after the failure. A hard fault could be a device with a burnt-up component. This will certainly not fix itself. A soft fault could be electrical noise interfering with the computer.
Module reliability measures the time from an initial instant to the next failure event. This reliability is statistically quantified as mean-time-to-failure (MTTF). The average time it takes to repair a module after the detection of the failure is called mean-time-to-repair (MTTR). As a result, we get the module availability, which is the ratio of service accomplishment to elapsed time:
In order to achieve a reliable, high-available system, two very different approaches can be used i.e., fault-avoidance and fault-tolerance. While fault-avoidance is prevention of fault-occurrences by construction, fault-tolerance is the use of redundancy to avoid failures due to faults. Fault-avoidance is difficult, and close to impossible in large and complex systems. This makes fault-tolerance the only realistic alternative for the classes of systems we are studying:
• General-purpose computer systems: General-purpose computers are at the high-end of the commercial market, employing fault-tolerance techniques to improve general reliability.
• High-availability computer systems: Systems designed for availability class 5 or higher. Here many applications require very high availability but can tolerate an occasional error or very short delays, while error recovery is taking place. Hardware designs for these systems are often considerably less expensive than those used for ultra-dependable real-time computers. Computers of this type often use duplex design. Example applications are telephone switching and transaction processing.
• Long-life systems: Systems designed for operating for a very long time without any chance of repair. Long-life systems are typical mobile systems where on-site repair is difficult, or maybe impossible. Examples are unmanned spacecraft systems like satellites or space exploration vehicles. These systems differ from other fault-tolerant systems discussed earlier by having redundancy not only in the electrical systems, but also in mechanical parts. They are also required to achieve correct operation over long periods of time.
• Critical-computation systems: Systems doing some critical work where faulty computations can jeopardize human life or have high economic impact. The best examples can be the computers in a space shuttle, nuclear plant or air traffic control system, where malfunction can be extremely disastrous. Applications such as spacecraft require computers to operate for long periods of time without external repair. Typical requirements are a probability of 95% that the computer will operate correctly for 5–10 years. Machines of this type must use hardware in a very efficient fashion, and they are typically constrained to low power, weight, and volume.
Fault Tolerance Techniques:
Validation:
It is used to reduce errors during the construction process. There are many ways to do this, one is to develop a model of the system in a formal language and use a validation program for the validation. Error correction reduces failures by using redundancy to tolerate faults. Latent error processing tries to detect and repair latent errors before they become effective. An example is preventive maintenance.
Effective error processing tries to correct the error after it becomes effective. This can be done by masking or recovery. An example of masking is error correcting codes. Recovery denies the requested service, and sets the module to an error-free state.
We have two forms of recovery, backward and forward recovery. Backward recovery returns to a previous correct state. This can be checkpoint-restart, which means that the state is stored at regular intervals, and at restart time the last stored state is loaded and restarted from.
With forward recovery, a new correct state is constructed, by re-sending a message or re-reading a disk page.
Four different types of redundancy are used in fault-tolerant systems and are as follows
• Hardware redundancy
It consists of the components that are introduced in order to provide fault-tolerance. The techniques of introducing hardware redundancy have been classified on the basis of terminal activity of modules into two categories:
Static redundancy
The static redundancy method is also known as masking redundancy, as the redundant components are employed to mask the effect of hardware failures within a set hardware module, and the outputs of the module remain unaffected as long as the protection is effective. It is applicable against both transient and permanent faults.
Two forms of static redundancy have been applied in the space program: replication of individual electronic components and triple modular redundancy (TMR). The use of static redundancy is based on the assumption that failures of the redundant copies are independent. For this reason, use of static redundancy is difficult to justify within integrated circuit packages, in which many failure phenomena are likely to affect several adjacent components.
Dynamic redundancy
This is based on approach fault caused errors or error signals appear at the outputs of a module. It is implemented in two ways. The presence of a fault is detected, and a subsequent recovery action either eliminates the fault, or corrects the error. When human assistance is entirely bypassed, dynamic redundancy (usually supported by software and time redundancy techniques provides self-repair of a computer system. Limited, i.e., human-controlled, use of dynamic redundancy techniques in computer hardware is quite extensive.
Information redundancy:
It is the addition of extra information to data, to allow error detection and correction. This is typically error-detecting codes, error-correcting codes (ECC), and self-checking circuits.
Error-Detection (and Correction) Codes
Codes are used in most modern computers for memory error detection. This is a simple code that does not require much additional hardware. Another, more advanced code is m-of-n code. This is a code that requires code words to be n bit of length, and contains exactly m ones. Cyclic and checksum codes are also common. When the operation is completed, the resulting code is checked to make sure it is valid. Codes can also be error-correcting. Data encoded with error-correcting codes(ECC) can contain errors, but contains enough redundancy to recover the data.
Consistency Checking
This is a verification of the results being reasonable. Examples are range checks likewise address checking, and arithmetic operation checking.
Self-Checking Logic
Failure in a comparator element at the top of the hierarchy can be disastrous (checking-the-checker problem). This single point of failure can be eliminated through self-checking and fail-safe logic design. A circuit is said to be self-checking if it has the ability to automatically detect the existence of a fault, without the need for any externally applied stimulus. When the circuit is fault free and presented a valid input code word, it should produce a correct output code word. If a fault exists, however, the circuit should produce an invalid output code so that the existence of the fault can be detected.
Software redundancy:
Software redundancy is the most challenging problem in fault-tolerance. As mentioned earlier, today’s hardware is relative reliable compared to the software. In the process of correcting a programming error, new errors are likely to be created. Software development is also a more complex and immature art than hardware design. It is said that perfect software is possible — it’s just a matter of time and money. This might be true, but for a large and complex software system, there is not enough of either time or money. We have two major software fault-tolerance techniques
• N-version programming: Write the program N times, then operate all N programs in parallel, and take a majority vote for each answer. This is an analogy to the N-plexing of hardware modules.
• Transactions: In this type, initially write the program as a transaction. Then use a consistency check at the end, and if the conditions are not met, restart. It should work the second time.
The big disadvantage with N-version programming is its cost. It is expensive, and repair is not trivial. It is also difficult to maintain. To get a majority, we need to have at least 3 versions. Programmers tend to do the same problems, so there is a certain risk of getting the same mistake in the majority of the programs.
• Time redundancy:
Hardware- and information- redundancy requires extra hardware. This could be avoided by doing operations several times in the same module and check the results, instead of doing it in parallel on several modules and compare the outputs. This reduces the amount of hardware at the expense of using additional time, and is especially suitable if faults are mostly transient.
Issues Faced by Fault Tolerance Researchers:
Much of today’s work is concentrated at incremental improvement, matured basic understanding in specific technology focus areas, and expanding the applicability of fault-tolerance technologies. There are no overriding technological approaches that dominate or help focus research, and the community is attacking pieces of the technology, problems, and solutions.
A vast Range of Issues includes a representative set of active research topics in fault tolerance. These range from fundamental fault-tolerance design and implementation issues such as checkpoint restart, distributed algorithms to fault-avoidance approaches which include formal methods.
Checkpoint Restart: Checkpoint restart strategies are backward recovery techniques for saving the state of a system to enable resuming operation from a well-defined state. This is classic fault-tolerance issue that is being addressed in the context of increasingly complex and distributed systems. Most of current research work involves issues relating to reliable high performance check pointing and distributed systems.
Distributed Algorithms: Unlike uniprocessor-based systems, distributed systems present a new set of challenges to achieving dependability. Clocks of the processors often must be synchronized, data must survive failures of individual processors, nodes must fail in controlled ways, and the communications between the processors must be reliable. Techniques such as interactive consistency, fail-stop processors, and reliable transport mechanisms have been developed to deal with these challenges. Many of the mechanisms are expensive to implement, and researchers continue to look for more efficient algorithms.
Fault Tolerance in Human Computer Interaction (HCI)
Issues and perspectives in fault-tolerance research are expanding from internal software state issues to considerations of interfaces between systems and between humans and computers. This broadened perspective includes considerations of the environment and its effect on software and overall system dependability.
Fault Injection
Fault injection is a technique for evaluating the dependability of a system. It’s purpose is to test the ability of a system to detect and recover from faults. As a result of running fault-injection experiments, designers are able to determine the fault coverage of their system. Fault injection involves seeding the system with faults under controlled conditions and observing its behavior.
Measurement and Interpretation
A significant difficulty in evaluating fault-tolerance and fault-avoidance mechanisms is the lack of real-world data. Without this data it is impossible to determine the efficacy of a method in a deployed system. Such data exist but are often proprietary. Some researchers have obtained access to such data and have been able to tune techniques and improve systems as a result.
Reliability Modeling
Reliability modeling is the modeling of faults and errors with the intention of predicting future behavior. Traditional hardware reliability models are empirically based and reflect physical characteristics of hardware failures, e.g., random failure models. In contrast, software and system failure models lack the physical data to guide reliability modeling and require reliance on usage data. Usage data is highly problematic both to collect and, because of the dependency of software reliability on its use, to generalize across systems.
Recent research in reliability models for predicting the future behavior of a system has focused on extending models, including complex and distributed systems, and addressing software reliability and its impact on overall system reliability.
Innovative Applications
This research involves applying fault tolerance in new ways to solve problems that have not been associated with fault tolerance in earlier times. It is based on abstracting the results of fault tolerance research and mapping these to other problem spaces. Examples include the use of fault tolerance to enable dependable system upgrade.
Sigma Algorithm
Sigma algorithm solves fault-tolerant mutual exclusion problem in dynamic systems where the set of processes may be large and change dynamically. Processes may crash, and the recovery of crashed processes may lose all state information. Sigma algorithm includes new messaging mechanisms to tolerate process crashes and memory losses. It does not require any extra cost for process recovery.
The above algorithm shows the complete Sigma algorithm that implements the specification of fault tolerance mutual exclusion(FTME). Each client maintains a state variable timestamp, which obtains values from a GetTimestamp () routine that generates unique and monotonically increasing numbers. We define a request to be a pair (ci, ti), where ci is a client id, and ti is the timestamp from ci. There is a predetermined total order among all such requests. Thus, for any two requests (c, t) and (c’, t’), we can write (c, t) < (c’, t’), and say that (c, t) is earlier than (c’, t’) according to this predetermined order. A simple choice of such an order is to order the request by timestamp values with client ids as the tiebreaker. We will impose further requirements on the order later when we need the algorithm to support Lockoutfreedom. Each server maintains a queue ReqQ of client requests, and a special request (cowner, towner) that it currently supports. The basic flow of the algorithm is: (a) a client sends a request to the servers to enter its critical section (lines 2--5); (b) each server responds the request with the request it currently supports (line 35); © the clientthat receives supporting responses from enough servers enters its critical section (lines 11--12); (d) when a client exits its critical section, it sends a RELEASE message to the servers (line 22); and (e) when a server receives the RELEASE message, it removes the corresponding request, selects the earliest request in its request queue to be the new request it supports, and sends a RESPONSE message to the new client it supports now
Validation of Fault-Tolerance:
One of the most difficult tasks in the design of a fault-tolerant machine is to verify that it will meet its reliability requirements. This requires designing a number of models. These models specify the structure and behavior of the design. It is then necessary to determine how well the fault tolerance mechanisms work by analytic studies and fault simulations. These results are error rates, fault-rates, latencies, and coverages, are used in reliability prediction models.
A number of probabilistic models have been developed using Markov and semi-Markov processes to predict the reliability of fault-tolerant machines as a function of time. These models have been implemented in several computer-aided design tools. Some of the better known tools are:
HARP—Hybrid Automated Reliability Predictor (Duke)
SAVE—System Availability Estimator (IBM)
SHARPE—Symbolic Hierarchical Automated Reliability and Performance Evaluator (Duke)
UltraSAN -- (University of Illinois, UIUC)
DEPEND -- (UIUC)
SURF-2 -- Laboratoire D'analyse Et D'architecture Des Systemes (LAAS)
Recently there has been a great deal of research in experimental testing by fault-insertion to aid in assessing the reliability of dependable systems. Among the fault-injection tools that have been developed to evaluate fault tolerant systems are: i) FTAPE (UIUC), ii) Ballista (CMU), and iii) MEFISTO (LAAS).
Conclusion and Future work:
We have discussed various fault-tolerance techniques and concluded that there is need of a more efficient and reliable technique that is also cheaper than the existing techniques. As new technologies are developed and new applications arise, new fault-tolerance approaches are also needed. In the early days of fault-tolerant computing, it was possible to craft specific hardware and software solutions from the ground up, but now chips contain complex, highly-integrated functions, and hardware and software must be crafted to meet a variety of standards to be economically viable. Thus a great deal of current research focuses on implementing fault tolerance using COTS (Commercial-Off-The-Shelf) technology. Future research works can explore more on MPI (Message Passing Interface) architecture in to present a reliable and less costly technique for fault-tolerance
Recent developments include the adaptation of existing fault-tolerance techniques to RAID disks where information is striped across several disks to improve bandwidth and a redundant disk is used to hold encoded information so that data can be reconstructed if a disk fails. Another area is the use of application-based fault-tolerance techniques to detect errors in high performance parallel processors. Fault-tolerance techniques are expected to become increasingly important in deep sub-micron VLSI devices to combat increasing noise problems and improve yield by tolerating defects that are likely to occur on very large, complex chips.