18-10-2016, 04:33 PM
1459720855-prathubigdata.docx (Size: 676.86 KB / Downloads: 3)
Abstract Data analysis and predictive analytics today are driven by large scale dis- tributed deployments of complex pipelines, guiding data cleaning, model training and evaluation. A wide range of systems and tools provide the basic abstractions for building such complex pipelines for offline data processing, however, there is an increasing demand for providing support for incremental models over unbounded streaming data. In this work, we focus on the prob- lem of modelling such a pipeline framework and providing algorithms that build on top of basic abstractions, fundamental to stream processing. We design a streaming machine learning pipeline as a series of stages such as model building, concept drift detection and continuous evaluation. We build our prototype on Apache Flink, a distributed data processing system with streaming capabilities along with a state-of-the-art implementation of a varia- tion of Vertical Hoeffding Tree (VHT), a distributed decision tree classification algorithm as a proof of concept. Furthermore, we compare our version of VHT with the current state-of- the-art implementations on distributed data processing systems in terms of performance and accuracy. Our experimental results on real-world data sets show significant performance benefits of our pipeline while maintaining low classification error. We believe that this pipeline framework can offer a good baseline for a full-fledged implementation of various streaming algorithms that can work in parallel.
Introduction Over the last years the size of the available data sets has increased enormously, due to the continuous growth of the Internet and social media (Facebook, Twitter, etc.), as well as the high rate of adoption of devices which are capable of creating information (e.g. smartphones). Facebook clusters for example handle 500 TB of new data every day and Twitter has 500M tweets sent every day. Only in United Stated the digital universe is expected to grow by 25% each year till 2020, reaching by then the enormous amount of 6.6 zettabytes or around 7 billion terabytes. The volume of those structured or unstructured data sets, referred with the "buzz" term Big Data, is so high that existing data mining methods are incapable of deal- ing with that. Data Mining is the process of extracting useful information from complex big data sets, with the use of machine learning techniques; like clas- sification, pattern recognition, clustering to identify existing valuable and under- standable patterns/ models in the data sets. By learning a model that can be used to identify the patterns of a data set, predictive data mining techniques can be used to do predictions for previously unseen data.
Problem Despite the great benefits of incremental machine learning, a number of big chal- lenges arise as well; with the main of those to be the redesign of the existing learning algorithms in the streaming model. Adding to those challenges, a dis- tributed execution of incremental machine learning algorithms arises the problem of distributing the algorithm in an efficient way so as the learning process scale- out without sacrificing the accuracy of the learned model. Apache Flink is an open source platform for large-scale data processing. This master thesis builds its contributions on Apache Flink, by porting in the abstrac- tions used in Batch ML algorithms which could be also used in the Apache Flink Streaming Application Programming Interface (API) and extending the exist- ing pipeline with additional abstractions that are needed in incremental machine learning.
Purpose This master thesis presents the abstraction which were defined for modeling an ML pipeline with Apache Flink Streaming API. The pipeline will target in facilitating the implementation of ML algorithms on Apache Flink Streaming API.
Contributions In scope of this master thesis the following contributions were achieved:
• A literature study of existing systems and programming models which al- low the implementation of incremental ML algorithms. The most important systems are presented in Section 2.5
• Part of the implementation of an adapter for integrating Apache SAMOA streaming machine learning framework on Apache Flink Streaming
• The design of the basic abstractions needed to model incremental ML pipelines on a descriptive manner
• The implementation of our ML pipeline prototype along with an implemen- tation of a scalable decision tree algorithm
• The evaluation of the performance of the implemented prototypes and algo- rithms, against the scalable decision tree algorithm implemented on Apache SAMOA. The evaluation showed that our solution scaled well and maintains an acceptable accuracy over time
• The implementation will be a contribution to the open source project Apache Flink[8], thus the implementation can be freely accessed
Methodology The quantitative research method was used for this master thesis; experiments and testing over different platforms of the incremental machine learning algo- rithms were conducted for the evaluation of the algorithms. The philosophical assumption of this degree project was positivism, as the eval- uation of the designed framework and the implemented algorithm was based on the experimental results and it was unbiased by the opinions of the researcher. Moreover, the experimental research method was used in order to investigate how changes between the different versions of the algorithms affect the results. The factors affecting the experimental results are:
• Implementation of the algorithms
• Datasets used: The size as well as the nature of the datasets that are used in incremental classification algorithms are highly important.
• Parallelism: Number of available slots for parallel processing of the data
To effectively compare the different versions of the algorithm a set of metrics was identified and used. Specifically, the variables considered for the evaluation of the experiments were:
• Prequential Error: The prequential classification error of the algorithm
• Execution time of the algorithms in different platforms and with respect to different data sets.
The validity of the measurements from our algorithms have been ensured by cross-checking the implementation of the algorithms. Moreover, the results were compared to the results of the same algorithms executed on Apache SAMOA, a different platform. In order to avoid possible oscillation due to the random seed used by the data generators, the experiments were run multiple times for each data set, which revealed a low variance between different runs, thus confirming the reliability of the experiments. The programming languages used in scope of this degree project as well as the testbed have been described in this report. Consequently, the replicability of our results can be ensured. Additionally, all datasets used in the experiments are pub- licly available in the web but also described in the appendix A, as well as the paper describing the implemented algorithm. Taking all these into consideration we can safely assume that our results could be replicated by other research.
Delimitations One of the issues that could affect the study of an incremental machine learning algorithm is the arrival rate of the observed data points. As stated in, when the arrival rate is higher than the maximum processing rate of an application, the quantity of unused data grows without bounds as time progresses. As a result more data points may be needed for higher accuracy.Sampling the stream of incoming events could be one solution to high arrival rate, as a sampling technique could provide data points to the algorithm in a normal rate. The effect of the arrival rate of the input stream is not considered for the evaluation of the implemented algorithms.
Outline Section 2 presents all necessary theory that will introduce the reader on the base concepts and theory needed to understand the systems used in this thesis. More- over, related work on the area is presented. Section 3 presents our model for ML pipelines on Apache Flink Streaming API. Moreover the prototyping of the pipeline model is presented in detail. Section 4 presents all the algorithms which were implemented in scope of this master thesis, in order to evaluate the modeled pipeline. In Section 4.1 the details of the VHT, the ML algorithm implemented with the pipeline modeled by this master thesis is presented. Moreover, one more variation of the Hoeffding Tree (HT) algorithm is presented and the main differences with the first implementation are highlighted. Section 6, presents the evaluation pipeline of the the two implemented variations of HT algorithm, as well as the experimental setup for the evaluation. In addi- tion, the experimental results of the evaluation are presented also in the same section. Lastly, in Section 7, the conclusion is given and potential future work is dis- cussed.
Theoretical Background
• Parallel Data Processing
Nowadays, the size of the available data sets has increased enormously, hence there exist an essential need for processing massive volume of data in an efficient manner. Parallel data processing enables the processing of large data sets in parallel, thus increasing the throughput of the algorithms.
In literature there are encountered mainly three categories of parallelization[10, 11, 12]:
• Data Parallelism: The same operation is performed in parallel on subsets of the data set. Data Parallel algorithms are parallelized with respect to the data
• Task Parallelism: Different operations are performed in parallel on the same data set
• Hybrid Parallelism: A pipeline of tasks executed concurrently
Data Parallel Processing Models Message Passing Interface (MPI), is one of the first widely used standards for parallel processing of data. MPI is a language-independent message pass- ing interface standard which can be used for developing parallel applications that distribute the work load in numerous machines of a cluster. An MPI program is loaded and executed in every processor of the cluster facilitating the com- munication of the collaborating processes through functions like "MPI_Bcast", "MPI_Reduce", etc. One of the drawbacks of the MPI standard is that it does not offer any fault tolerance mechanisms. MapReduce is a programming model, first proposed in 2004 by Google, which among others offers a fault tolerance mechanism, thus allowing people with small experience in parallel computing to implement efficient applications. In MapRe- duce the user expresses a problem with the use of two functions:
• Map: Takes the input in the form of key/value pairs (K,V), process them and produces intermediate results in the form of (K,V) pairs.
• Reduce: Receives all intermediate results that have the same key and merge those together in order to produce a final result per key.
There exist numerous frameworks which implement the MapReduce program- ming model, and all of them facilitate the use of distributed file systems like the Hadoop Distributed File System (HDFS). Except from being a broadly used programming model for big data processing, MapReduce forms the base idea of many big data batch and stream processing engines, such as Apache Hadoop, Apache Flink, Apache Spark, Apache Storm.
Data Streams The space and time assumptions of the traditional data mining techniques makes them inappropriate for mining big data which cannot fit on the main memory with low processing latency and unbounded processing restrictions. Incremental ma- chine learning algorithms are based on data streams, which have much different characteristics than the data sets.
Machine Learning Algorithms Over the last years ML algorithms are used for a wide range of services and systems, such as recommendation systems, spam detection and more. Machine learning algorithms use a data set as an input and try to learn a model in order to make data-driven predictions for other unseen data. Thus with the use of ML algorithms, computers can learn concepts/models, without being statically programmed. There exist mainly two ways of categorizing an ML algorithm, depending either on the format of the input data set or on the format of the output of the learned model. With respect to the type of information that a machine learning algorithm needs in order to learn a model, ML algorithms can be categorized as follows:
• Supervised: For each observation of the input data set x1, x2,. .. xn an as- sociated response value y1, y2,. .. yn is also provided to the algorithm. In
that way, the algorithm learns a model, which tries to relate the response to the input observations. Examples of supervised ML algorithms are classifi- cation and regression algorithms.
• Unsupervised: For each observation of the input data set x1, x2,. .. xn no associated response value is provided to the algorithm. In that way, the algorithm learns a model, which tries to find relations between observations. Clustering algorithms are an example of unsupervised ML algorithms. Depending on the prediction output of a ML algorithm, there exist a number of categories the main of which are the following:
• Classification: Classification algorithms are supervised ML algorithms, that learn a model which predicts classes(labels) for unseen data points. Examples of classification algorithms are the Decision Tree (DT) and Logis- tic Regression.
• Regression: Regression algorithms are supervised ML algorithms, that learn a model which predicts a continuous value for unseen data points. Linear Regression and Regression DTs are examples of regression algo- rithms.
• Clustering: Clustering algorithms are unsupervised ML algorithms, which identify clusters in the input data points and assign each data point to a cluster. kMeans and k-Medoids are examples of clustering algorithms.
Streaming Machine Learning
Data Parallelization Nowadays, most of the applications involve large-scale data sets, resulting in the need of parallel algorithms and techniques, in order to efficiently handle them. As we already mentioned in 2.1, there are different methods of parallelizing a problem, with data parallelism being the most commonly used also from current parallel models.
There exist two types of data parallelism, Horizontal Data Parallelism and Vertical Data Parallelism:
• Horizontal Data Parallelism. In Horizontal Data Parallelism, the system splits the data set horizontally as depicted in Figure 3a, based on the data set’s size, so in the special case of a distributed machine learning algorithm, each parallel processor will process a subset of instances of the data.
• Vertical Data Parallelism. In Vertical Data Parallelism, the system splits the data set vertically as depicted in Figure 3b, based on some characteristics of the data. In a machine learning algorithm, the data set is split with respect to the attributes of the instances, thus each parallel processor will process a number of attributes of the input data points.
Depending on the problem, different parallelization technique can be used to par- allelize an algorithm.