08-05-2012, 10:38 AM
Self-Healing in Modern Operating Systems
Princy- Full Paper.pdf (Size: 97.41 KB / Downloads: 102)
Introduction:
Every IT manager, system administrator, and developer is fighting against the monster of computing complexity. The worst possible situation to
be in is trying to identify, root-cause, and resolve a problem in today’s complex stack. While we need no reminder of the cost of complexity to
the industry, it is worth wondering: How much of the problem is still open research versus lack of execution or priority on the part of vendors? Is
there making more progress in hardware or in software? And how useful a solution can we expect, given the software providing now versus
needing it to be modified or rewritten? There is a need for designers of operating systems and other system-level components that sit between
hardware, applications, and administrators to play a vital role in facilitating a real leap forward in self-healing computer systems.
THE ROLE OF THE OPERATING SYSTEM
There are three basic forces which can bring to bear on improving the availability of computing services: Improve the reliability and resiliency of
the individual components (hardware or software); Introduce redundancies to cope with component failures; and predictively avoid failure or
reduce the time required to recover. Yet some important trends in the industry are largely at odds with the desire to increase component
reliability and redundancy. First, there is the growing use of commodity hardware components from disparate sources to build cheaper systems.
Similarly, modern software stacks are being constructed from commodity components as well—in many cases, from open source or off-the-shelf
components of widely varying quality. Second, the desire to increase redundancy is often at odds with the need to reduce the cost, management
difficulty, and complexity of the solution while maximizing its overall performance. So while improvements in these first two areas are
important to any overall solution, the usefulness of self-healing systems is fundamentally about making significant progress in the third area:
Reducing recovery time and implementing systems that can diagnose, react to, and even predict failures. Even a basic system or blade will soon
have multiple processor cores per die and multiple hardware threads per core: Sun, AMD, IBM, and Intel are all hard at work here. Even a
service deployed on a single system is of increasing depth and complexity: multiple threads per process, multiple processes per component, and
a variety of components from different authors stacked on top of each other.
In this emerging world, each system will be able to do more useful work by supporting more application services with more memory, compute
power, and I/O. One approach is simply to make an individual system the unit of recovery; if anything fails, either restart the whole thing or failover
to another system providing redundancy. Unfortunately, with the increasing physical resources available to each system, this approach is
inherently wasteful: Why restart a whole system if you can disable a particular processor core, restart an individual application, or refrain from
using a bit of your spacious memory or a particular I/O path until a repair is truly needed? Fine-grained recovery provides more effective use of
computing resources and dollars for system administrators and reduces downtime for end users.
Operating system provides threads as a programming primitive that permit applications to scale transparently and perform better as multiple
processors, multiple cores per die, or more hardware threads per core are added. Operating system also provides virtual memory as a
programming abstraction that allows applications to scale transparently with available physical memory resources. Now operating systems is
needed to provide the new abstractions that will enable self-healing activities or graceful degradation in service without requiring developers to
rewrite applications or administrators to purchase expensive hardware that tries to work around the operating system instead of with it. A key
requirement of these abstractions, however, is that they enable self-healing systems to make diagnoses and take actions that, like those of your
human doctor, should begin by doing no harm. The operating system and self-healing software can implement intelligent self-diagnosis and selfhealing
only if they understand significantly more about hardware/software dependencies than they have historically, and more about the
relationships and dependencies in the software stack deployed above.
To see why, let’s consider a simple example. The kernel can detect the failure of any running process by handling various types of exceptions
and deciding to terminate the process or pass an exception along to it (in UNIX terms, it can send the process a signal such as SIGSEGV or
SIGBUS). These exceptions usually cause the process to terminate by default, but can also be intercepted by more intelligent applications to try
to clean up or save data before dying. Historically, such signals indicated a programming error; the errant process attempted to read from an
unmapped or misaligned address (for example, dereferencing a null or bogus pointer). On a modern system such as Solaris, however, user can
also detect and attempt to recover from an exception where a process accessed memory with an underlying double-bit ECC (error correcting
code) error, which can be detected but not corrected by the typical ECC code protecting your DRAM. Similar scenarios can occur with errors in
the processor core itself and its L1 and L2 caches, with varying degrees of recovery possible depending on the capabilities of the underlying
hardware.
Now the big question: Is it still safe to signal or terminate the process? And if the process is terminated, can it restart? How? What else might be
affected? What of this do we need to explain to the administrator? In other words, what’s a self-healing system supposed to do?
To illustrate the complexity of the problem, let’s consider a few quick scenarios related to only the first question. If the affected memory region
is not shared and the process has no relationship to other processes on the system, then we can terminate the process and simply stop using the
affected physical page of memory rather than returning it to use in the kernel’s free page list. As observed earlier, however, most processes
aren’t like that anymore. If you simply terminate the process, then its sudden death may cause a portion of a multiprocess application suddenly to
go missing, causing the application to deadlock or misbehave. Or the process may be providing some type of service to other processes (e.g., a
name service, a database back end), which would cause a cascading failure in other applications. If a signal is sent to the process, there may also
be trouble; if the signal handler contains code that accesses the same bad piece of memory while trying to print a message or save data while
cleaning up, the same error can recur. Any signal that results in a core dump could prove confusing to administrators—the software in this
example is an innocent victim of a hardware problem, and no one should waste time attempting to debug this particular core file or application
software code.
Finally, if the error is in a shared memory region, things get even worse. Many modern multiprocess applications contain their own restart
capability, wherein a parent process monitors and restarts its children. If a child process died from touching shared memory and was restarted,
the new child might well immediately touch the same location again and repeatedly die, again causing application failure or serious degradation
of service. From these few scenarios, it might be tempted to say, “Pass enough information to the application to let it decide what to do,” and
punt the problem to application developers or administrators. This leads back to one of the original questions: Is the next generation of selfhealing
technology going to require everyone to rewrite their applications? Handling that kind of signal sounds complicated, prone to bugs, and
not particularly portable.
These emphasize the need for new abstractions in the system that can solve these problems without continuing to overburden developers and
administrators. They also demonstrate that we need to make progress in two distinct areas: the ability to implement rapid, intelligent responses to
errors that are detected in the system as they happen, and the ability asynchronously to diagnose observed errors to underlying problems. Once a
self-healing system has diagnosed a faulty hardware component or broken application, it can use this knowledge to trigger actions, such as
disabling a component, failing over to a redundancy, or notifying a human administrator that a repair or patch is needed.
SELF-HEALING SYSTEM MODEL
A self-healing system is one that:
• Replaces traditional error messages with robust error detection, handling, and correction that produces telemetry for automated diagnosis
• Provides automated diagnosis and response from the error telemetry for hardware and software entities
• Provides recursive fine-grained restart of services based upon knowledge of their dependencies
• Presents simplified administrative interactions for diagnosed problems and their effects on services and resources
To make tangible progress on these problems, an extensible fault manager can be implemented that can receive telemetry from system
components, including the kernel, and pass them to self-healing diagnosis and response software; and it also implemented a service manager that
manages descriptions of the services on the system and their interdependencies, and can implement intelligent automated restart. Finally, it
allows a service manager to describe expectations for recovery of a set of managed processes to the kernel. This model extends to any system or
to hierarchical networked compositions of systems. A fault manager is itself a service that receives incoming error telemetry observed by the
system and uses appropriate algorithms or expert-system rules to attempt to diagnose these errors automatically to an underlying problem, such
as a hardware fault or likely defect in an application. A service manager manages the various application services running on the system and uses
their dependencies to implement orderly startup, shutdown, and restart. So while the operating system’s fault manager and service manager deal
with local resources such as CPUs, memory, I/O devices, and single-system services, the same concepts would apply when designing selfhealing
features for a rack of blades or a networked data center, where a fault manager would track the list of known service outages and a
service manager would observe and manage the highest-level set of services offered to the network or data center as a whole.
FAULT MANAGER AND DIAGNOSIS
In this self-healing design, the fault manager is responsible for implementing asynchronous automated diagnoses of problems from error
symptoms. It then uses the results of each diagnosis to trigger an automated response, such as off lining a CPU, device, region of memory, or
service, or communicating to a human administrator or higher-level management software. The fault manager therefore manages the list of
problems on the system and exports this as its abstraction to human administrators and higher-level management applications, rather than the
individual underlying error messages it has received. While we believe this new abstraction layer will significantly reduce complexity, it also
forms the basis of a major change from the traditional error model.