24-06-2013, 03:29 PM
An Online Data Access Prediction and Optimization Approach for Distributed Systems
An Online Data Access.pdf (Size: 4.73 MB / Downloads: 33)
Abstract
Current scientific applications have been producing large amounts of data. The processing, handling and analysis of such
data require large-scale computing infrastructures such as clusters and grids. In this area, studies aim at improving the performance of
data-intensive applications by optimizing data accesses. In order to achieve this goal, distributed storage systems have been
considering techniques of data replication, migration, distribution, and access parallelism. However, the main drawback of those
studies is that they do not take into account application behavior to perform data access optimization. This limitation motivated this
paper which applies strategies to support the online prediction of application behavior in order to optimize data access operations on
distributed systems, without requiring any information on past executions. In order to accomplish such a goal, this approach organizes
application behaviors as time series and, then, analyzes and classifies those series according to their properties. By knowing
properties, the approach selects modeling techniques to represent series and perform predictions, which are, later on, used to optimize
data access operations. This new approach was implemented and evaluated using the OptorSim simulator, sponsored by the LHCCERN
project and widely employed by the scientific community. Experiments confirm this new approach reduces application execution
time in about 50 percent, specially when handling large amounts of data.
INTRODUCTION
SCIENTIFIC applications tend to produce and handle large
amounts of data. Examples of such applications
include the Pan-STARRS project,1 which captures at about
2.5 petabytes (PB) of data per year, and Large Hadron
Collider project (LHC),2 that generates from 50 to 100 PB
of data every year. Those applications tend to consider
distributed computing tools to deal with the need for high
performance and storage requirements. Such tools have
great potential for solving a vast class of complex
problems; however, they are still limited in terms of
manipulating large amounts of data [1].
RELATED WORKS
This section presents related works on the prediction of
application behavior and data access optimization approaches.
Prediction of Application Behavior
Several studies have been considering statistical techniques
to analyze data and construct probabilistic models to
characterize and predict application workloads in distributed
environments. Those models have been employed to
assist fault diagnosis, resource allocation, and system
performance optimizations.
As one of the first works in this area, Devarakonda and
Iyer [21] propose a statistical approach to predict the
consumption of CPU, file system I/O, and memory. This
study models the behaviors of processes using automata,
storing those models in databases. When a new process
arrives at the system, the approach verifies if there is any
automaton capable of representing it. If so, this automaton
is used to estimate resource requirements for this next
process. Trace-driven experiments confirm a strong correlation
in between next and past executions.
TIME SERIES ANALYSIS
By modeling the outputs produced by real-world systems,
we can study and, therefore, understand how they work and
behave under different circumstances. This is specially
interesting to support the prediction of behavior and,
consequently, support decision making, what is particularly
required in certain application domains. Here, we are
interested in predicting read-and-write operations in attempt
to optimize data access on distributed environments.
Outputs produced by real-world systems present a strong
temporal dependency, i.e., adjacent observations are dependent
[33]. This dependency strongly reduces the modeling
accuracy of conventional techniques. In order to overcome
this problem, a new area was developed, called Time Series
Analysis [34], in which data are commonly organized in
terms of variables and their observations over time.
Data Access Optimization
Several studies have been attempting to improve data
access on distributed environments. Such works are mainly
focused on data replication, distribution, and consistency.
In this context, Oldfield and Kotz [27] proposed the Armada
framework to execute, control, and monitor applications.
Armada builds graph structures to represent processing
and data flows. Graphs, representing process versus data
dependencies, support decisions on moving data toward
processes, thus reducing execution time. Experiments
compare applications running on a traditional environment
and also on Armada. In that scenario, Armada improves
network throughput in around 40 percent.
Classifying Data Sets: Assessing the Time
Series Generation Process
In this paper, we organize process behavior as time series,
and, afterwards, predict observations which are input to a
data access optimization approach. However, first of all, we
need to understand the generation process of those time
series, what supports the selection of adequate modeling
techniques. In order to accomplish that, we employ the
approach by Ishii et al. [20] on the three data sets to assess
their stochasticity, linearity, and stationarity. After defining
such properties for every data set, we then select the most
adequate model to represent every one. Recurrence Plot [41]
was considered to evaluate stochasticity. White Neural
Network [35] was considered to evaluate linearity. Space-
Time Separation Plot [47] and Autocorrelation Function
(ACF) [48] was considered to evaluate stationarity.
CONCLUDING REMARKS
This paper has presented a data access optimization
approach which uses predictive techniques for distributed
computing environments. Our main objective is to minimize
the application execution time by optimizing data
accesses and, therefore, improving decisions on replication,
migration, and consistency. From that, data access operations
are transformed into time series. By modeling those
series, we can understand the behavior of applications and,
therefore, predict future observations. Such prediction
supports to take decisions beforehand. However, this
modeling is related to specific aspects of each time series
such as the stochasticity, linearity, and stationarity.