05-02-2013, 11:31 AM
Just-In-Time Analytics on Large File Systems
1Just-In-Time.pdf (Size: 749.02 KB / Downloads: 26)
Abstract
As file systems reach the petabytes scale, users and administrators
are increasingly interested in acquiring highlevel
analytical information for file management and
analysis. Two particularly important tasks are the processing
of aggregate and top-k queries which, unfortunately,
cannot be quickly answered by hierarchical file
systems such as ext3 and NTFS. Existing pre-processing
based solutions, e.g., file system crawling and index
building, consume a significant amount of time and space
(for generating and maintaining the indexes) which in
many cases cannot be justified by the infrequent usage
of such solutions. In this paper, we advocate that user interests
can often be sufficiently satisfied by approximate -
i.e., statistically accurate - answers. We develop Glance,
a just-in-time sampling-based system which, after consuming
a small number of disk accesses, is capable of
producing extremely accurate answers for a broad class
of aggregate and top-k queries over a file system without
the requirement of any prior knowledge. We use a
number of real-world file systems to demonstrate the efficiency,
accuracy and scalability of Glance.
Introduction
Today a file system with billions of files, millions of directories
and petabytes of storage is no longer an exception
[29]. As file systems grow, users and administrators
are increasingly keen to perform complex queries
[37, 47], such as “How many files have been updated
since ten days ago?”, and “Which are the top five largest
files that belong to John?”. The first is an example of
aggregate queries which provide a high-level summary
of all or part of the file system, while the second is top-
k queries which locate the k files and/or directories that
have the highest score according to a scoring function.
Aggregate Query Processing
In this section, we develop FS Agg, our algorithm
for processing aggregate queries. We first describe
FS Agg Basic, a vanilla algorithm which illustrates our
main idea of aggregate estimation without bias through a
random descent process within a file system. Then, we
describe two ideas to make the vanilla algorithm practical
over very large file systems: high-level crawling leverages
the special properties of a file system to reduce the
standard error of estimation, and breadth-first implementation
improves both accuracy and efficiency of query
processing. Finally, we combine all three techniques to
produce FS Agg.
FS Agg Basic
A Random Descent Process: In general, the folder organization
of a file system can be considered as a tree or a
directed acyclic graph (DAG), depending on whether the
file system allows hard links to the same file. The random
descent process we are about to discuss can be applied to
both cases with little change. For the ease of understanding,
we first focus on the case of tree-like folder structure,
and then discuss a simple extension to DAG at the end of
this subsection.
Detailed Design
The design of FS TopK is built upon a hypothesis that
the highest scores estimated in Step 2, when compared
with the lower bound estimated in Step 1, can prune a
large portion of the tree, significantly reducing the overhead
of crawling in Step 3. In the following, we first describe
the estimations of the lower bound and the highest
scores in Steps 1 and 2, and then discuss the validity of
the hypothesis for various types of scoring functions.
Both estimations in the two steps can be made from
the order statistics [20] of files retrieved by the random
descent process in FS Agg. The reason is that both estimations
are essentially on the order statistics of the population
(i.e., all files in the system) - The lower bound in
Step 1 is the k-th largest order statistics of all files, while
the highest scores are on the largest order statistics of the
subtrees. We refer readers to [20] for details of how the
order statistics of a sample can be used to estimate that
of the population and how accurate such an estimation is.