02-06-2014, 11:15 AM
A Local-Optimisation based Strategy for Cost-Effective Datasets Storage of Scientific Applications in the Cloud
A Local-Optimisation based Datasets Storage Strategy
A Motivating Example and Problem Analysis
Important Concepts and Cost Model of Datasets Storage in the Cloud
A Local-Optimisation based Strategy for
Cost-Effective Datasets Storage of Scientific Applications in the Cloud
Simulation and Evaluation
Problem Analysis
Which datasets should be stored?
Data challenge: double every year over the next decade and further -- [Szalay et al. Nature, 2006]
Scientific workflows are very complex and there are dependencies among datasets.
Furthermore, one scientist can not decide the storage status of a dataset anymore.
Datasets should be stored based on the trade-off of computation cost and storage cost.
A cost-effective datasets storage strategy is needed.
Attributes of a Dataset in DDG
A dataset di in DDG has the attributes: <xi, yi, fi, vi, provSeti, CostRi>
xi ($) denotes the generation cost of dataset di from its direct predecessors.
yi ($/t) denotes the cost of storing dataset di in the system per time unit.
fi (Boolean) is a flag, which denotes the status whether dataset di is stored or deleted in the system.
vi (Hz) denotes the usage frequency, which indicates how often di is used.
CTT-SP Algorithm
To find the minimum cost storage strategy for a DDG
Philosophy of the algorithm:
Construct a Cost Transitive Tournament (CTT) based on the DDG.
In the CTT, the paths (from the start to the end dataset) have one-to-one mapping to the storage strategies of the DDG.
The length of each path equals to the total cost rate of the corresponding storage strategy.
The Shortest Path (SP) represents the minimum cost storage strategy.