01-02-2013, 04:37 PM
Capturing, indexing, clustering, and retrieving system history
ABSTRACT
When complex software systems misbehave whether they is a partial failure, violate an established service-level objective (SLO), or otherwise respond in an unexpected way to workload understanding the likely causes of the problem can speed repair. While a variety of problems can be solved by simple mechanisms such as rebooting [3], many cannot, including problems related to a misallocation or shortage of resources that leads to a persistent performance problem or other anomaly that can be addressed only by a nontrivial configuration change.
Understanding and documenting the likely causes of such problems is difficult because they often emerge from the behavior of a collection of low-level metrics such as CPU load, disk I/O rates, etc., and therefore simple rules of thumb" focusing on a single metric are usually misleading. Furthermore, today there is no systematic way to leverage past diagnostic efforts when a problem arises, even though such efforts may be expensive and are on the critical path of continued system operation. To that end we would like to be able to recognize and retrieve similar problem instances from the past. If the problem was previously resolved, we can try to justify the diagnosis and perhaps even apply the repair actions. Even if the problem remained unresolved, we could gather statistics regarding the frequency or even periodicity of the recurrence of that problem, accumulating necessary information for prioritizing or escalating diagnosis and repair efforts. In order to do these things, we must be able to extract from the system an indexable description that both distills the essential system state associated with the problem and that can be formally manipulated to facilitate automated clustering and similarity based search. Meeting these requirements would enable matching an observed behavior against a database of previously observed ones both for retrieval and determining whether the problem is a recurrent one.