17-10-2016, 02:35 PM
1459434873-SeminaronBigDataBayesianNetworkLearningApproach.pptx (Size: 499.14 KB / Downloads: 6)
Introduction
Bayesian Network (BN) is a probabilistic graph model that provides theoretically solid mechanisms for processing unsettled information and presenting relations among variables.
BNs have been applied in a wide range of domains such as Health Care, Education, Finance, Environment, Bioinformatics, Telecommunication, and Information Technology .
With abundant data resources nowadays, learning BN from Big Data could discover valuable business insights and bring potential revenue value to different domains that deal with Big Data.
We introduce a data partition approach that provides scalability to Bayesian network learning and it can also be applied to many other machine learning techniques to make them scalable and Big Data ready.
Existing System
Several Distributed Data Parallel (DDP) patterns, such as Map, Reduce, Match, CoGroup, and Cross, have been identified to easily build efficient and scalable data parallel analysis and analytics applications.
Each DDP pattern executes user-defined functions (UDF) in parallel over input data sets.
Since each DDP execution engine defines its own API for how UDFs should be implemented, an application implemented for one engine may be difficult to run on another engine.
Arising Problems
How can we effectively pre-process Big Data to evaluate its quality and reduce the size if necessary?
How to design a workflow capable of taking Gigabytes of big data sets and learn BNs with decent accuracy?
How to provide easy scalability support to BN learning algorithms?
These three are the main motivation for this research: the creation of the novel workflow - Scalable Bayesian Network Learning (SBNL) workflow.
Proposed System : Scalable Bayesian Network Learning (SBNL) workflow
This SBNL workflow has three research components which contribute to the current literature:
Intelligent Big Data pre-processing through the use of a proposed data quality score called Arc S to measure and ensure data quality and data faithfulness.
A new weight based ensemble algorithm (Max-Min Hill Climbing) is proposed to learn a BN structure from an ensemble of local results.
A user-friendly approach to build and run scalable Big Data machine learning applications via Kepler on top of DDP patterns and engines via scientific workflows.
Conclusion
By combining machine learning, distributed computing and workflow techniques, a Scalable Bayesian Network Learning (SBNL) workflow has been designed.
An illustration has been provided as to how the Kepler scientific workflow system can easily provide scalability to Bayesian network learning.
SBNL obtains significant performance gain when applied to distributed environments while keeping the same learning accuracy, making SBNL an ideal workflow for Big Data Bayesian Network learning.