Just-In-Time Analytics on Large File Systems pdf

**project girl** · 05-02-2013, 11:31 AM

Just-In-Time Analytics on Large File Systems

.pdf

1Just-In-Time.pdf (Size: 749.02 KB / Downloads: 26)

Abstract

As file systems reach the petabytes scale, users and administrators
are increasingly interested in acquiring highlevel
analytical information for file management and
analysis. Two particularly important tasks are the processing
of aggregate and top-k queries which, unfortunately,
cannot be quickly answered by hierarchical file
systems such as ext3 and NTFS. Existing pre-processing
based solutions, e.g., file system crawling and index
building, consume a significant amount of time and space
(for generating and maintaining the indexes) which in
many cases cannot be justified by the infrequent usage
of such solutions. In this paper, we advocate that user interests
can often be sufficiently satisfied by approximate -
i.e., statistically accurate - answers. We develop Glance,
a just-in-time sampling-based system which, after consuming
a small number of disk accesses, is capable of
producing extremely accurate answers for a broad class
of aggregate and top-k queries over a file system without
the requirement of any prior knowledge. We use a
number of real-world file systems to demonstrate the efficiency,
accuracy and scalability of Glance.

Introduction

Today a file system with billions of files, millions of directories
and petabytes of storage is no longer an exception
[29]. As file systems grow, users and administrators
are increasingly keen to perform complex queries
[37, 47], such as “How many files have been updated
since ten days ago?”, and “Which are the top five largest
files that belong to John?”. The first is an example of
aggregate queries which provide a high-level summary
of all or part of the file system, while the second is top-
k queries which locate the k files and/or directories that
have the highest score according to a scoring function.

Aggregate Query Processing

In this section, we develop FS Agg, our algorithm
for processing aggregate queries. We first describe
FS Agg Basic, a vanilla algorithm which illustrates our
main idea of aggregate estimation without bias through a
random descent process within a file system. Then, we
describe two ideas to make the vanilla algorithm practical
over very large file systems: high-level crawling leverages
the special properties of a file system to reduce the
standard error of estimation, and breadth-first implementation
improves both accuracy and efficiency of query
processing. Finally, we combine all three techniques to
produce FS Agg.

FS Agg Basic

A Random Descent Process: In general, the folder organization
of a file system can be considered as a tree or a
directed acyclic graph (DAG), depending on whether the
file system allows hard links to the same file. The random
descent process we are about to discuss can be applied to
both cases with little change. For the ease of understanding,
we first focus on the case of tree-like folder structure,
and then discuss a simple extension to DAG at the end of
this subsection.

Detailed Design

The design of FS TopK is built upon a hypothesis that
the highest scores estimated in Step 2, when compared
with the lower bound estimated in Step 1, can prune a
large portion of the tree, significantly reducing the overhead
of crawling in Step 3. In the following, we first describe
the estimations of the lower bound and the highest
scores in Steps 1 and 2, and then discuss the validity of
the hypothesis for various types of scoring functions.
Both estimations in the two steps can be made from
the order statistics [20] of files retrieved by the random
descent process in FS Agg. The reason is that both estimations
are essentially on the order statistics of the population
(i.e., all files in the system) - The lower bound in
Step 1 is the k-th largest order statistics of all files, while
the highest scores are on the largest order statistics of the
subtrees. We refer readers to [20] for details of how the
order statistics of a sample can be used to estimate that
of the population and how accurate such an estimation is.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Software Crisis pdf	study tips	1	2,117	21-09-2017, 04:31 PM Last Post: jaseela123
	HOW EMAIL WORKS pdf	project girl	1	3,067	20-09-2017, 11:39 AM Last Post: jaseela123
	Cyber crime detection, investigation and prosecution pdf	seminar projects maker	1	958	20-09-2017, 11:31 AM Last Post: jaseela123
	Review: Context Aware Tools for Smart Home Development pdf	study tips	1	1,227	20-09-2017, 11:22 AM Last Post: jaseela123
	Getting Started with the MAXQ1103 Evaluation Kit and the CrossWorks Compiler pdf	project girl	1	969	15-09-2017, 03:11 PM Last Post: jaseela123
	Wireless Application Protocol (WAP) pdf	project girl	1	1,531	15-09-2017, 02:42 PM Last Post: jaseela123
	MAC Protocol for Reliable Multicast over Multi-Hop Wireless Ad Hoc Networks pdf	study tips	1	1,029	15-09-2017, 12:39 PM Last Post: jaseela123
	Wireless Automotive Communications pdf	seminar projects maker	1	637	14-09-2017, 01:27 PM Last Post: jaseela123
	Enabling Secure and Efficient Ranked Keyword Search over Outsourced Cloud Data pdf	study tips	1	2,018	13-09-2017, 12:59 PM Last Post: jaseela123
	Internetworking connectionless and connection-oriented networks pdf	project girl	1	1,151	13-09-2017, 11:03 AM Last Post: jaseela123

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.