05-11-2012, 12:28 PM
Scalable and Parallel Boosting with MapReduce
Scalable and Parallel.doc (Size: 34.5 KB / Downloads: 31)
Abstract
MapReduce is a framework for processing large data sets and also used to do distribute computing on clusters of computers. These MapReduce libraries have been written in many programming languages. Here we propose two novel algorithms for an efficient map reduce method, AdaBoost.PL (Parallel AdaBoost) and LogitBoost.PL (Parallel LogitBoost), which facilitate simultaneous participation of multiple computing nodes to construct a boosted classifier.
Due to the recent overwhelming growth rate of large-scale data, the development of faster processing algorithms with optimal performance has become a dire need of the time. Our algorithms can induce boosted models whose generalization performance is close to the respective baseline classifier. By exploiting their own parallel architecture both the algorithms gain significant speedup. Moreover, the algorithms do not require individual computing nodes to communicate with each other, to share their data or to share the knowledge derived from their data and hence, they are robust in preserving privacy of computation as well. We used the Map-Reduce framework to implement our algorithms and experimented on a variety of synthetic and real-world data sets to demonstrate the performance in terms of classification accuracy, speedup and scale up.
Proposed System
We propose two novel parallel boosting Algorithms like ADABOOST.PL (Parallel ADABOOST) and LOGITBOOST.PL (Parallel LOGITBOOST). These Algorithms achieve parallelization in both time and space with minimal amount communication between the computing nodes.
ADABOOST, short for Adaptive Boosting, is a machine learning algorithm (also called as Meta – algorithm) used in conjunction with many other learning algorithms to improve their performance.
LOGITBOOST is an influential boosting algorithm that is based on additive logistic regression method. It also computes working response and weights for each data points.
The Map function is applied in parallel to every pair in the input dataset. And produces a list of pairs for each call.
The Reduce function is applied in parallel to each group, which in turn produces a collection of values in the same domain.
Finally we improve the scalability of the MapReduce method and the possibility of improving multi-resolution boosting models to reduce the number of iterations.
Existing System
The Existing method of MapReduce has several limitations. The Execution time for previous algorithms used in map reduce method is too high. Data is structured into tradition database tables and columns, the SQL for processing that data is less clear.
In the Existing Method has inherent sequential nature, so it is not easy to achieve the scalability for boosting and parallelized boosting.