01-02-2013, 04:38 PM
Detecting large-scale system problems by mining console logs
ABSTRACT
Today’s large-scale Internet services run in large server clusters in data centers and cloud computing environments. These system architectures enable highly scalable Internet services at a relatively low cost. However,detecting and diagnosing problems in such systems bring new challenges for both system developers and operators. One significant problem is that as the system scales, the amount of information operators need to process goes far beyond the level that can be handled manually, and thus there is a huge demand for automatic processing of monitoring data. Much work has been done on automatic problem detection and diagnosis in such systems.
Researchers and operators have been using all kinds of monitoring data, from the simplest numerical metrics such as resource utilization counts (Lakhina et al.,2004; Cohen et al., 2005; Bodik et al., 2010) to system events (Hellerstein et al., 2002; Ma & Hellerstein, 2001) to more detailed tracing such as execution paths (Chen et al., 2002; Chen & Brewer, 2004). However, console logs, the debugging information built into almost every piece of software, are rarely studied by either operators or the research community. Since the dawn of programming, developers have used everything from printf to complex logging and monitoring libraries (Fonseca et al., 2007; Gulcu, 2002) to record program variable values, trace execution, report runtime statistics, and even printing out full-sentence messages designed to be read by a human—usually by the developer.
However, modern large-scale services usually combine large open-source components authored by hundreds of developers, and the people scouring the logs—part integrator, part developer, part operator, and charged with fixing the problem are usually not the people who chose what to log or why.
Furthermore, even in well-tested code, many operational problems are dependent on the deployment and runtime environment and cannot be easily reproduced by the developer. Thus, it is unavoidable that people other than the original developers need to source logs from time to time when diagnosing problems.
Our goal is to provide them with better tools to extract value from the console logs. As logs are too large to examine manually and too unstructured to analyze automatically, operators typically create ad hoc scripts to search for keywords such as “error” or “critical,” but this has been shown to be insufficient for determining problems (Jiang et al., 2009; Oliner & Stearley, 2007). Rule-based processing (Prewett, 2003) is an improvement, but the operators’ lack of detailed knowledge about specific components and their interactions makes it difficult to write rules that pick out the most relevant sets of events for problem detection.