28-08-2017, 04:17 PM
In software systems, the term self-healing describes any application, service or system that may find that it is not functioning properly and that, without human intervention, makes the necessary changes to restore to normal or designed state. Self-healing consists in making the system capable of making its decisions by continually checking and optimizing its condition and automatically adapting to changing conditions. The goal is to make fault tolerant and responsive system capable of responding to changes in demand and failure recovery.
Self-healing systems can be divided into three levels, depending on the size and type of resources we are monitoring and acting on. These levels are as follows.
• Application level
• System level
• Hardware level
We will explore each of these three types separately.
Self-healing at application level
Application-level healing is the ability of an application, or service, to heal itself internally. Traditionally, we are accustomed to catching problems through exceptions and, in most cases, to register them for further examination. When such an exception occurs, we tend to ignore it and move on (after recording), as if nothing had happened, hoping for the best in the future. In other cases, we tend to stop the application if an exception of a certain type occurs. An example would be a connection to a database. If the connection is not established when the application is started, we often stop the entire process. If we are a little more experienced, we could try to repeat the attempt to connect to the database. Hopefully those attempts are limited or we can easily enter an endless loop, unless the connection failure to the database is temporary and the DB is reconnected online soon after. Over time, we have better ways to deal with problems within applications. One of them is Akka. It is the use of the supervisor and the patterns of design that promotes, allow us to create applications and services of internal self-healing. Akka is not the only one. Many other libraries and frameworks allow us to create fault-tolerant applications capable of recovering from potentially disastrous circumstances. Since we are trying to be agnostic to programming languages, I will leave you, dear reader, researching ways to self-heal your applications internally. Note that self-healing in this context refers to internal processes and does not provide, for example, recovery of failed processes. In addition, if we adopt the architecture of microservices, we can quickly reach services written in different languages, using different frames, etc. It's really up to the developers of each service to design it in a way that can heal itself and recover from failures.
Self-healing at the system level
Unlike application level healing that depends on a programming language and design patterns we apply internally, system-level self-healing can be generalized and applied to all services and applications, regardless of their internal. This is the kind of self-healing we can design at the level of the whole system. While there are many things that can happen at the system level, the two most monitored aspects are process failures and response time. If a process fails, we need to reassign the service or restart the process. On the other hand, if the response time is not adequate, we need to scale or disqualify, depending on whether we reach upper or lower response time limits. Recovering from process failures is often not enough. While such actions can restore our system to the desired state, human intervention is often still necessary. We need to investigate the cause of the failure, correct the service design or correct an error. That is, self-healing often goes hand in hand with researching the causes of that failure. The system recovers automatically and we (humans) try to learn from those failures, and improve the system as a whole. For that reason, some kind of notification is also required. In both cases (failure and traffic increase), the system needs to be monitored and take some action.
Self-healing at the hardware level
In fact, there is no such thing as self-healing hardware. We can not have a process that automatically heats the failed memory, repairs the broken hard disk, repairs the faulty CPU, and so on. What healing really means at this level is the redistribution of services from a healthy knot to one of the healthy ones. As with the system level, we need to periodically check the status of the different hardware components, and act accordingly. In fact, most of the healing caused due to the hardware level will occur at the system level. If the hardware does not work properly, it is likely that the service will fail and, therefore, will be solved by system-level healing. Hardware-level healing is more related to the preventive types of checks we will discuss shortly.
Self-healing systems can be divided into three levels, depending on the size and type of resources we are monitoring and acting on. These levels are as follows.
• Application level
• System level
• Hardware level
We will explore each of these three types separately.
Self-healing at application level
Application-level healing is the ability of an application, or service, to heal itself internally. Traditionally, we are accustomed to catching problems through exceptions and, in most cases, to register them for further examination. When such an exception occurs, we tend to ignore it and move on (after recording), as if nothing had happened, hoping for the best in the future. In other cases, we tend to stop the application if an exception of a certain type occurs. An example would be a connection to a database. If the connection is not established when the application is started, we often stop the entire process. If we are a little more experienced, we could try to repeat the attempt to connect to the database. Hopefully those attempts are limited or we can easily enter an endless loop, unless the connection failure to the database is temporary and the DB is reconnected online soon after. Over time, we have better ways to deal with problems within applications. One of them is Akka. It is the use of the supervisor and the patterns of design that promotes, allow us to create applications and services of internal self-healing. Akka is not the only one. Many other libraries and frameworks allow us to create fault-tolerant applications capable of recovering from potentially disastrous circumstances. Since we are trying to be agnostic to programming languages, I will leave you, dear reader, researching ways to self-heal your applications internally. Note that self-healing in this context refers to internal processes and does not provide, for example, recovery of failed processes. In addition, if we adopt the architecture of microservices, we can quickly reach services written in different languages, using different frames, etc. It's really up to the developers of each service to design it in a way that can heal itself and recover from failures.
Self-healing at the system level
Unlike application level healing that depends on a programming language and design patterns we apply internally, system-level self-healing can be generalized and applied to all services and applications, regardless of their internal. This is the kind of self-healing we can design at the level of the whole system. While there are many things that can happen at the system level, the two most monitored aspects are process failures and response time. If a process fails, we need to reassign the service or restart the process. On the other hand, if the response time is not adequate, we need to scale or disqualify, depending on whether we reach upper or lower response time limits. Recovering from process failures is often not enough. While such actions can restore our system to the desired state, human intervention is often still necessary. We need to investigate the cause of the failure, correct the service design or correct an error. That is, self-healing often goes hand in hand with researching the causes of that failure. The system recovers automatically and we (humans) try to learn from those failures, and improve the system as a whole. For that reason, some kind of notification is also required. In both cases (failure and traffic increase), the system needs to be monitored and take some action.
Self-healing at the hardware level
In fact, there is no such thing as self-healing hardware. We can not have a process that automatically heats the failed memory, repairs the broken hard disk, repairs the faulty CPU, and so on. What healing really means at this level is the redistribution of services from a healthy knot to one of the healthy ones. As with the system level, we need to periodically check the status of the different hardware components, and act accordingly. In fact, most of the healing caused due to the hardware level will occur at the system level. If the hardware does not work properly, it is likely that the service will fail and, therefore, will be solved by system-level healing. Hardware-level healing is more related to the preventive types of checks we will discuss shortly.