Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

Full Version: Safety-Critical Systems Design pdf
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Safety-Critical Systems Design

[attachment=54105]

Introduction

Embedded systems are like normal desktop systems in that they have functional requirements, that is, functions that the system is expected to perform, such as moving the robot arms. One of the ways embedded systems differ from desktop systems is that they also have significant quality of service (QoS) requirements as well. In real-time embedded systems, for example, timeliness and predictability are significant QoS requirements. In fact, in “hard” real-time systems, missing a single deadline is considered to be a systems failure of some kind. Other types of QoS requirement include the reliability and safety of the system in potentially harsh environments. Systems ranging from microwave ovens to automotive “drive-by-wire” electronics to avionics systems to nuclear power plants all have very significant safety and reliability requirements. Failures in such systems can lead to death of from one to potentially several million people.
In spite of the seriousness and severity of these requirements, safety and reliability as applied to electronic and software systems is not a part of a normal undergraduate or graduate curriculum. In fact, most engineers don’t even know the correct terms to apply to safety and reliability engineering let alone the meanings of those terms. Ask the meaning of the terms “reliable” and “safe” and you will get many different answers.

The Therac-25 Story

The most published software-related safety failure occurred in a radiation therapy treatment device, the Therac-25. Released to the market by the Atomic Energy of Canadian Limited (AECL) in 1982, it used software to enhance its usability and lower the cost of production, providing real benefit to its users. However, through a compounding of process, design, and implementation failures, software defects caused massive radiation overdoses to 6 patients, killing three and contributing directly to the death of a fourth. The history of the Therac-25 is detailed in [1], who concludes that merely fixing the identified defects in the code did not make the device safer. Safety continued to elude its developers despite effects to remove the bugs by modifying the code.

Other Stories

Although the Therac-25 is one of the best-known software-related safety failure, other examples abound. The first shuttle launch was delayed two days because a backup computer could not be started correctly when an error was discovered 20 minutes before scheduled launch. The Patriot missiles deployed in Saudi Arabia failed because of clock drift -- their effectiveness in stopping missiles was downgraded from 95% to 13%. Software flaws in the Aegis tracking system on the USS Vincennes contributed to the ship shooting down an Iranian Airline flight at the cost of 290 lives.

Safety is NOT Reliability!

Reliability is a measure of the “up-time” or “availability” of a system. It is normally measured with Mean Time Between Failure, or MTBF. MTBF is a statistical measure of the probability of failure, and is useful when applied to stochastic failure modes. Electrical engineers are familiar with the “washtub” curve, which shows the failure rates of electronic components over time. There is an initial high failure rate that rapidly drops to a low level and remains low for a long time. After a long time, the failure rate rises rapidly back to initial or higher levels, giving the characteristic “washtub” shape. This is why, for example, electrical components and systems undergo the burn-in process. The high temperature increases the probability of failure early thereby accelerating the washtub curve. In other words, the components that are going to fail early do so even earlier (during the burn in). The remaining components or systems fall into the low failure basin of the washtub curve and so have a much higher average life expectancy.

Safety is a System Issue

Many systems present hazards. However, note that safety is a “system” issue. An identified hazard can be removed or its associated risk reduced in many ways. For example, consider a radiation therapy device -- it has the hazard that it may over-radiate the patient. An electrical interlock activated when the beam is either too intense or lasts too long is one design approach to reduce risk. The interlock could involve a mechanical barrier or use an electric switch. Alternatively, the software could use redundant heterogeneous computational engines (verify the dosage using a different algorithm) to verify the setting before permitting the dose to be administered. The point is, the entire system is either safe or not. Not the software. Not the electronics. Not the mechanics. Of course, each of these impacts the system safety, but it is ultimately the interaction of all these elements that provides the hazard as well as the risk reduction.

Hazard Analysis

The first step in developing safe systems is to determine the hazards of system. A hazard, remember, is a condition that could allow a mishap to occur in the presence of other, non-fault, conditions. In a patient ventilator, one hazard is that the patient will not be ventilated, resulting in hypoxia and death. In an ECG monitor, electrocution is a hazard. A microwave oven can emit dangerous radiation literally cooking the user (always a bad thing for repeat business). Typically, embedded systems have many hazards because they have the potential to expose people to potentially dangerous high energy.

Single Point Failures

Devices ought to be safe when there are no faults and the device is being used properly. To be considered “safe”, however, the device must also not lead to an incident in the presence of any single point fault regardless of the likelihood of that fault. That is, a fault in any single component or due to any single fault condition should not lead to an accident. For example, consider software safety measures on a single CPU for a patient ventilator. What happens if the CPU locks up, the CPU crystal breaks, or the power is lost?
There are many ways in which these things can happen. The electrical supply of many hospital ORs is flaky at best, ranging down to 85 volts in some locales. Lightning can strike the power line causing an electrical surge. Electrosurgical equipment (basically arc welders used to cauterize surgical incisions) also creates an extremely electrically noisy environment. Even in the absence of external factors, CPUs themselves fail. They can fail due to sticking in a metastable state, which although unlikely, does occur. They can fail because they arrive at end-of-life for the component. Bonding wires can break or become loose. They can even be CPU design flaws -- remember the Pentium floating-point bug?

Safe Designs

So, what are the other practical steps you can take to make your embedded system safe? The issue, again, is to do the hazard analysis and track that your designs provide the means of control of the hazards. A couple of fundamental design architectures address hazards: single channel protected and dual channel designs.
Single Channel Protected Designs (SCPD)
A channel is a static path of data and control that takes some information and produces some output. Any failure of any component of the channel is a failure of the entire channel. In an SCPD architecture, a single channel exists for the control of some process.

Design Patterns for Reliability and Safety

A pattern is a generalized solution to a common problem. A pattern is instantiated and customized for the particular problem at hand, but provides a means for capturing design knowledge by capturing best practices from experienced designers. Single channel protected and dual-channel designs are types of safety patterns.
There are several design patterns that affect both reliability and safety. As discussed in [3], such patterns are primarily architectural because they affect most aspects of a system.