25-10-2012, 11:44 AM
Optimal File-Bundle Catching Algorithms for Data-Grids
ABSTRACT
The file-bundle caching problem arises frequently in scientific applications where jobs need to process several files
simultaneously. Consider a host system in a data-grid that maintains a staging disk or disk cache for servicing jobs of file
requests. In this environment, a job can only be serviced if all its file requests are present in the disk cache. Files must be
admitted into the cache or replaced in sets of file-bundles, i.e. the set of files that must all be processed simultaneously. In this
paper we show that traditional caching algorithms based on file popularity measures do not perform well in such caching
environments since they are not sensitive to the inter-file dependencies and may hold in the cache non-relevant combinations
of files. We present and analyze a new caching algorithm for maximizing the throughput of jobs and minimizing data
replacement costs to such data-grid hosts.We tested the new algorithm using a disk cache simulation model under a wide range
of conditions such as file request distributions, relative cache size, file size distribution, etc. In all these cases, the results show
significant improvement as compared with traditional caching algorithms.