Xml Data Processing with Big Data

**mkaasees** · 24-10-2016, 02:52 PM

1460843382-projectdocument.docx (Size: 874.58 KB / Downloads: 31)

ABSTRACT :

Abstract. Processing XML queries over big XML data using MapRe-duce has been studied in recent years. However, the existing works focus on partitioning XML documents and distributing XML fragments into di erent compute nodes. The main key challenges of project to improve the efficiency in usage of resource utilization as well as to take the decisions in current market by analyzing data, which helps the industries to increase their revenue. This analysis done by using Bigdata technologies like Hadoop (HDFS), MapReduce and hive. This project includes many modules like to collect xml data from sources, converting to structured data, loading in to external tables in to hive. One of business intelligence tool Qlikview use data for analytics and visualization.This attempt may introduce high overhead in XML fragment transferring from one node to another during MapReduce execution. Motivated by the structural join based XML query process-ing approach, which uses only related inverted lists to process queries in order to reduce I/O cost, we propose a novel technique to use MapRe-duce to distribute labels in inverted lists in a computing cluster, so that structural joins can be parallelly performed to process queries. We also propose an optimization technique to reduce the computing space in our framework, to improve the performance of query processing. Last, we conduct experiment to validate our algorithms.

1.Introduction :

1.1 Context

The increasing amount of data generated by di erent applications, sensors and devices, and the increasing attention to the value of the data marks the beginning of the era of big data. From big IT companies to SMEs, from computer science researchers to social scientists, nowadays everyone is talking about big data and the impact of the big data in business, technologies, and science research.

It is no doubt that to e ectively manage big data is the rst step for any fur-ther analysis and utilization of big data. How to manage such data, considering the large size and diverse formats, is a new challenge to the database commu-nity. On one hand, researchers are keen to nd a more elastic and more reliable solution rather than the traditional distributed database system [15] to store big relational (structured) data and o er SQL-like query ability; on the other hand, for emerging datasets in heterogeneous models, new platforms and databases for semi-structured data, text data, graph data, etc. were designed, aiming to provide more e cient access to such data in large scale.

Gradually, the research attempts converge to a distributed data processing framework, MapReduce [10]. The MapReduce programming model simpli es parallel data processing by o ering two interfaces: map and reduce. With a and database systems [3][13][7] on top of the MapReduce framework.

1.2 Motivation

Recently, researchers started looking into the possibility of managing big XML data in a more elastic distributed environment, such as Hadoop [1], using MapRe-duce. Inspired by the XML-enabled relational database system, big XML data can be stored and processed by relational storage and operators. However, shred-ding XML data in big size into relational tables is extremely expensive. Further-more, with relational storage, each XML query must be processed by several -joins among tables. The cost for joins is still the bottleneck for Hadoop-based database systems.

Most recent research attempts leverage on the idea of XML partition-ing and query decomposition adopted from distributed XML databases [16][12]. Similar to the join operation in relational database, an XML query may require linking two or more arbitrary elements across the whole XML document. Thus to process XML queries in a distributed system, transferring fragmented data from one node to another is unavoidable. In a static environment like a distributed XML database system, proper indexing techniques can help to optimally dis-tribute data fragments and the workload. However, for an elastic distributed environment such as Hadoop, each copy of XML fragment will probably be transferred to undeterminable di erent nodes for processing. In other words, it is di cult to optimize data distribution in a MapReduce framework, thus the existing approach may su er from high I/O and network transmission cost.

1.3 Contribution

In this paper, we study the parallelization of the structural join based XML query processing algorithms using MapReduce. We do not distribute a whole big XML document to a computer cluster, instead, we distribute inverted lists for each type of document node to be queried. As mentioned, since the size of the inverted lists that are used to process a query is much smaller than the size of the whole XML document, our approach potentially reduces the cost on cross-node data transfer. Our contribution can be summarized as follows:

The problem of parallelizing structural joins for XML query processing us-ing MapReduce is studied. To the best of our knowledge, this is the rst work that discusses distributing inverted list labels rather than raw XML document for parallel structural joins.

A polynomial-based workload distribution algorithm is designed in the Map phase, which can balance the workload of each Reduce task.

An optimization technique is proposed to not to emit nodes to reducers where they do not contribute to structural join results.
We conduct experiments to validate our algorithms.

1.4 Organization

The rest of the paper is organized as follows. In Section 2, we introduce the background knowledge and revisit related work. In Section 3 we present the map and reduce functions in our framework to parallelly process XML queries. In Section 4, an optimization algorithm is proposed so that unnecessary emitting can be pruned in mappers. Section 5 shows the experimental study to validate the proposed algorithms. Finally we conclude this paper in Section 6.

2. Background and Related Work

2.1 MapReduce

MapReduce is a computational model initiated from functional programming languages and introduced to computer clusters for parallel data processing [10]. It simpli es programmers' implementation for parallel data processing by o ering two user-de ned functions, map and reduce.

The map function takes a set of key/value pairs as input. After a MapReduce job is submitted to the system, the map tasks (normally referred as mappers) are started on certain compute nodes. Each map task executes the map function that is implemented by the user over every key/value input pair. The output of the map function is another set of key/value pairs, which are temporarily stored in the local le systems and sorted by the keys. When all map tasks complete executing, the system noti es the reduce tasks (referred as reducers) to start executing.

2.2XML Query Processing

There are di erent approaches to process XML queries. Initially, XML docu-ments are shred into relational tables and queries are translated into SQL state-ments to query the database (e.g., [17]). This is so-called the relational approach.

However, in the relational approach, an XML query requires several table joins and most of them are expensive -joins. Later, researchers proposed to process XML queries in their native form.

To process a query, only those inverted lists related to the queried node type are loaded and scanned. All other document nodes are actually ig-nored. Finally structural joins are performed over the inverted lists to nd the query answers.

2.3 XML Query Processing using MapReduce

To process XML queries using MapReduce, we need to decompose a big XML document and distribute portions to di erent sites, and execute the query pro-cessing algorithm in a parallel manner at di erent sites. Obviously, the relational approach is not suitable, because transforming a big XML document into rela-tional tables can be extremely time consuming and -joins among relational tables are expensive using MapReduce.

Recently, there are several works proposed to implement native XML query processing algorithms using MapReduce. In [9], the authors proposed a dis-tributed algorithm for Boolean XPath query evaluation using MapReduce. By collecting the Boolean evaluation result from a computer cluster, they proposed a centralized algorithm to nally process a general XPath query. Actually, they did not use the distributed computing environment to generate the nal result. It is still unclear whether the centralized step would be the bottleneck of the algorithm when the data is huge. In [8], a Hadoop-based system was designed to process XML queries using the structural join approach. There are two phases of MapReduce jobs in the system. In the rst phase, an XML document was shred into blocks and scanned against input queries.

In fact, most existing works are based on XML document shredding, which is inspired by the document partitioning in the distributed XML databases [16][12]. However, in the MapReduce framework, the distributed computing environment
is assumed dynamic and elastic, which makes XML fragment distribution di - cult to optimize. Hence, in such a setting, fragmented data (from either original document or intermediate result) may need to be transferred to undeterminable di erent sites, which leads high I/O and network transmission cost. Motivated by the inverted lists based structural join algorithms, in this paper, we propose an approach that distributes inverted lists rather than a raw XML document, so that the size of the fragmented data for I/O and network transmission can be greatly reduced.

3. Roles and Responsibilities:

 Install and configurations of Apache Hadoop eco-system which includes storage HDFS, MapReduce and Hive.
 Develope MapReduce job applications for parsing data from xml format to structured format.
 Load the structured data into hive warehouse for analysis and transformations of data as per the requirement.
 Have to develope crown jobs for automation of job run process.
 Have to work with business intelligence tool like Qlikview where the incoming structured data slices and dices, analysis and visualizations like KPI, charts and Calendar.To Work with one end to end solution in Hadoop environment.

3.1 Technologies: Hadoop-1.2.1, HDFS, MapReduce, Hive-0.11.0, Core Java, Eclipse, Qlikview , JDBC connector and UBUNTU.

4. HIVE :

 A system for managing and querying structured data built on top of Hadoop
 Map-Reduce for execution
 HDFS for storage
 Metadata in an RDBMS

Key Building Principles:
 SQL as a familiar data warehousing tool
 Extensibility – Types, Functions, Formats, Scripts
 Scalability and Performance
 Interoperability

4.1HIVE PROCESS :

Pros :
Superior in availability/scalability/manageability
Efficiency not that great, but throw more hardware
Partial Availability/resilience/scale more important than ACID

Cons: Programmability and Metadata
Map-reduce hard to program (users know sql/bash/python)
Need to publish data in well known schemas

1Hadoop Clusters :

 Used to log data from web servers

 Clusters collocated with the web servers

 Network is the biggest bottleneck

 Typical cluster has about 50 nodes.

 Stats:
 ~ 25TB/day of raw data logged
 99% of the time data is available within 20 seconds

5.2 Hadoop/Hive cluster

 8400 cores
 Raw Storage capacity ~ 12.5PB
 8 cores + 12 TB per node
 32 GB RAM per node
 Two level network topology
 1 Gbit/sec from node to rack switch
 4 Gbit/sec to top level rack switch

 2 clusters
 One for adhoc users
 One for strict SLA jobs

Statistics per day:

 12 TB of compressed new data added per day
 135TB of compressed data scanned per day
 7500+ Hive jobs per day
 80K compute hours per day

 Hive simplifies Hadoop:
 New engineers go though a Hive training session
 ~200 people/month run jobs on Hadoop/Hive
 Analysts (non-engineers) use Hadoop through Hive
 Most of jobs are Hive Jobs

5.3 Types of Applications:
Reporting
Eg: Daily/Weekly aggregations of impression/click counts
Measures of user engagement
Microstrategy reports

Ad hoc Analysis
Eg: how many group admins broken down by state/country

Machine Learning (Assembling training data)
Ad Optimization
Eg: User Engagement as a function of user attributes

Many others

7.HIVE QUERY LANGUAGE
SQL
 Sub-queries in from clause
 Equi-joins (including Outer joins)
 Multi-table Insert
 Multi-group-by
 Embedding Custom Map/Reduce in SQL
 Sampling
 Primitive Types
 integer types, float, string, boolean
 Nestable Collections
 array<any-type> and map<primitive-type, any-type>
 User-defined types
 Structures with attributes which can be of any-type

OPTIMIZATIONS:

 Joins try to reduce the number of map/reduce jobs needed.
 Memory efficient joins by streaming largest tables.
 Map Joins
 User specified small tables stored in hash tables on the mapper
 No reducer needed

 Map side partial aggregations
 Hash-based aggregates
 Serialized key/values in hash tables
 90% speed improvement on Query
 SELECT count(1) FROM t;
 Load balancing for data skew

8.HIVE-OPEN AND EXTENSIBLE :

Different on-disk storage(file) formats
Text File, Sequence File, …
Different serialization formats and data types
LazySimpleSerDe, ThriftSerDe …
User-provided map/reduce scripts
In any language, use stdin/stdout to transfer data …
User-defined Functions
Substr, Trim, From_unixtime …
User-defined Aggregation Functions
Sum, Average …
User-define Table Functions
Explode …

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	data mining full report	project report tiger	37	374,184,749	16-03-2019, 05:22 PM Last Post: TitkinWY
	A Novel Data Embedding Method Using Adaptive Pixel Pair Matching Report	project girl	3	4,489	15-01-2018, 01:56 PM Last Post: dhanabhagya
	Design and Implementation of High-Performance FPGA Signal Processing Datapaths	seminar class	1	300,513	20-09-2017, 01:31 PM Last Post: jaseela123
	Detecting False Data in Wireless Sensor Network using Efficient Becan Scheme	seminar tips	1	3,235	20-09-2017, 01:03 PM Last Post: jaseela123
	Different Initialization Data and the Performance by the BFM	seminar flower	1	680	20-09-2017, 12:44 PM Last Post: jaseela123
	Wide Area Mobile Data Services	seminar ideas	1	2,373	19-09-2017, 02:35 PM Last Post: jaseela123
	Integrating and Designing the Data Mining Technique System Based on Customer	seminar projects maker	1	782	15-09-2017, 02:45 PM Last Post: jaseela123
	Secure Online Examination Management using XML full report	seminar class	1	294,875	14-09-2017, 12:51 PM Last Post: jaseela123
	Survey of Privacy Protection for Medical Data	project maker	1	649	13-09-2017, 01:14 PM Last Post: jaseela123
	Using Rapid Prototyping Data to Enhance a Knowledge-Based Framework for Product Redes	smart paper boy	1	115,120	13-09-2017, 09:54 AM Last Post: jaseela123

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.