24-10-2016, 02:52 PM
1460843382-projectdocument.docx (Size: 874.58 KB / Downloads: 31)
ABSTRACT :
Abstract. Processing XML queries over big XML data using MapRe-duce has been studied in recent years. However, the existing works focus on partitioning XML documents and distributing XML fragments into di erent compute nodes. The main key challenges of project to improve the efficiency in usage of resource utilization as well as to take the decisions in current market by analyzing data, which helps the industries to increase their revenue. This analysis done by using Bigdata technologies like Hadoop (HDFS), MapReduce and hive. This project includes many modules like to collect xml data from sources, converting to structured data, loading in to external tables in to hive. One of business intelligence tool Qlikview use data for analytics and visualization.This attempt may introduce high overhead in XML fragment transferring from one node to another during MapReduce execution. Motivated by the structural join based XML query process-ing approach, which uses only related inverted lists to process queries in order to reduce I/O cost, we propose a novel technique to use MapRe-duce to distribute labels in inverted lists in a computing cluster, so that structural joins can be parallelly performed to process queries. We also propose an optimization technique to reduce the computing space in our framework, to improve the performance of query processing. Last, we conduct experiment to validate our algorithms.
1.Introduction :
1.1 Context
The increasing amount of data generated by di erent applications, sensors and devices, and the increasing attention to the value of the data marks the beginning of the era of big data. From big IT companies to SMEs, from computer science researchers to social scientists, nowadays everyone is talking about big data and the impact of the big data in business, technologies, and science research.
It is no doubt that to e ectively manage big data is the rst step for any fur-ther analysis and utilization of big data. How to manage such data, considering the large size and diverse formats, is a new challenge to the database commu-nity. On one hand, researchers are keen to nd a more elastic and more reliable solution rather than the traditional distributed database system [15] to store big relational (structured) data and o er SQL-like query ability; on the other hand, for emerging datasets in heterogeneous models, new platforms and databases for semi-structured data, text data, graph data, etc. were designed, aiming to provide more e cient access to such data in large scale.
Gradually, the research attempts converge to a distributed data processing framework, MapReduce [10]. The MapReduce programming model simpli es parallel data processing by o ering two interfaces: map and reduce. With a and database systems [3][13][7] on top of the MapReduce framework.
1.2 Motivation
Recently, researchers started looking into the possibility of managing big XML data in a more elastic distributed environment, such as Hadoop [1], using MapRe-duce. Inspired by the XML-enabled relational database system, big XML data can be stored and processed by relational storage and operators. However, shred-ding XML data in big size into relational tables is extremely expensive. Further-more, with relational storage, each XML query must be processed by several -joins among tables. The cost for joins is still the bottleneck for Hadoop-based database systems.
Most recent research attempts leverage on the idea of XML partition-ing and query decomposition adopted from distributed XML databases [16][12]. Similar to the join operation in relational database, an XML query may require linking two or more arbitrary elements across the whole XML document. Thus to process XML queries in a distributed system, transferring fragmented data from one node to another is unavoidable. In a static environment like a distributed XML database system, proper indexing techniques can help to optimally dis-tribute data fragments and the workload. However, for an elastic distributed environment such as Hadoop, each copy of XML fragment will probably be transferred to undeterminable di erent nodes for processing. In other words, it is di cult to optimize data distribution in a MapReduce framework, thus the existing approach may su er from high I/O and network transmission cost.
1.3 Contribution
In this paper, we study the parallelization of the structural join based XML query processing algorithms using MapReduce. We do not distribute a whole big XML document to a computer cluster, instead, we distribute inverted lists for each type of document node to be queried. As mentioned, since the size of the inverted lists that are used to process a query is much smaller than the size of the whole XML document, our approach potentially reduces the cost on cross-node data transfer. Our contribution can be summarized as follows:
The problem of parallelizing structural joins for XML query processing us-ing MapReduce is studied. To the best of our knowledge, this is the rst work that discusses distributing inverted list labels rather than raw XML document for parallel structural joins.
A polynomial-based workload distribution algorithm is designed in the Map phase, which can balance the workload of each Reduce task.
An optimization technique is proposed to not to emit nodes to reducers where they do not contribute to structural join results.
We conduct experiments to validate our algorithms.
1.4 Organization
The rest of the paper is organized as follows. In Section 2, we introduce the background knowledge and revisit related work. In Section 3 we present the map and reduce functions in our framework to parallelly process XML queries. In Section 4, an optimization algorithm is proposed so that unnecessary emitting can be pruned in mappers. Section 5 shows the experimental study to validate the proposed algorithms. Finally we conclude this paper in Section 6.
2. Background and Related Work
2.1 MapReduce
MapReduce is a computational model initiated from functional programming languages and introduced to computer clusters for parallel data processing [10]. It simpli es programmers' implementation for parallel data processing by o ering two user-de ned functions, map and reduce.
The map function takes a set of key/value pairs as input. After a MapReduce job is submitted to the system, the map tasks (normally referred as mappers) are started on certain compute nodes. Each map task executes the map function that is implemented by the user over every key/value input pair. The output of the map function is another set of key/value pairs, which are temporarily stored in the local le systems and sorted by the keys. When all map tasks complete executing, the system noti es the reduce tasks (referred as reducers) to start executing.
2.2XML Query Processing
There are di erent approaches to process XML queries. Initially, XML docu-ments are shred into relational tables and queries are translated into SQL state-ments to query the database (e.g., [17]). This is so-called the relational approach.
However, in the relational approach, an XML query requires several table joins and most of them are expensive -joins. Later, researchers proposed to process XML queries in their native form.
To process a query, only those inverted lists related to the queried node type are loaded and scanned. All other document nodes are actually ig-nored. Finally structural joins are performed over the inverted lists to nd the query answers.
2.3 XML Query Processing using MapReduce
To process XML queries using MapReduce, we need to decompose a big XML document and distribute portions to di erent sites, and execute the query pro-cessing algorithm in a parallel manner at di erent sites. Obviously, the relational approach is not suitable, because transforming a big XML document into rela-tional tables can be extremely time consuming and -joins among relational tables are expensive using MapReduce.
Recently, there are several works proposed to implement native XML query processing algorithms using MapReduce. In [9], the authors proposed a dis-tributed algorithm for Boolean XPath query evaluation using MapReduce. By collecting the Boolean evaluation result from a computer cluster, they proposed a centralized algorithm to nally process a general XPath query. Actually, they did not use the distributed computing environment to generate the nal result. It is still unclear whether the centralized step would be the bottleneck of the algorithm when the data is huge. In [8], a Hadoop-based system was designed to process XML queries using the structural join approach. There are two phases of MapReduce jobs in the system. In the rst phase, an XML document was shred into blocks and scanned against input queries.
In fact, most existing works are based on XML document shredding, which is inspired by the document partitioning in the distributed XML databases [16][12]. However, in the MapReduce framework, the distributed computing environment
is assumed dynamic and elastic, which makes XML fragment distribution di - cult to optimize. Hence, in such a setting, fragmented data (from either original document or intermediate result) may need to be transferred to undeterminable di erent sites, which leads high I/O and network transmission cost. Motivated by the inverted lists based structural join algorithms, in this paper, we propose an approach that distributes inverted lists rather than a raw XML document, so that the size of the fragmented data for I/O and network transmission can be greatly reduced.
3. Roles and Responsibilities:
Install and configurations of Apache Hadoop eco-system which includes storage HDFS, MapReduce and Hive.
Develope MapReduce job applications for parsing data from xml format to structured format.
Load the structured data into hive warehouse for analysis and transformations of data as per the requirement.
Have to develope crown jobs for automation of job run process.
Have to work with business intelligence tool like Qlikview where the incoming structured data slices and dices, analysis and visualizations like KPI, charts and Calendar.To Work with one end to end solution in Hadoop environment.
3.1 Technologies: Hadoop-1.2.1, HDFS, MapReduce, Hive-0.11.0, Core Java, Eclipse, Qlikview , JDBC connector and UBUNTU.
4. HIVE :
A system for managing and querying structured data built on top of Hadoop
Map-Reduce for execution
HDFS for storage
Metadata in an RDBMS
Key Building Principles:
SQL as a familiar data warehousing tool
Extensibility – Types, Functions, Formats, Scripts
Scalability and Performance
Interoperability
4.1HIVE PROCESS :
Pros :
Superior in availability/scalability/manageability
Efficiency not that great, but throw more hardware
Partial Availability/resilience/scale more important than ACID
Cons: Programmability and Metadata
Map-reduce hard to program (users know sql/bash/python)
Need to publish data in well known schemas
1Hadoop Clusters :
Used to log data from web servers
Clusters collocated with the web servers
Network is the biggest bottleneck
Typical cluster has about 50 nodes.
Stats:
~ 25TB/day of raw data logged
99% of the time data is available within 20 seconds
5.2 Hadoop/Hive cluster
8400 cores
Raw Storage capacity ~ 12.5PB
8 cores + 12 TB per node
32 GB RAM per node
Two level network topology
1 Gbit/sec from node to rack switch
4 Gbit/sec to top level rack switch
2 clusters
One for adhoc users
One for strict SLA jobs
Statistics per day:
12 TB of compressed new data added per day
135TB of compressed data scanned per day
7500+ Hive jobs per day
80K compute hours per day
Hive simplifies Hadoop:
New engineers go though a Hive training session
~200 people/month run jobs on Hadoop/Hive
Analysts (non-engineers) use Hadoop through Hive
Most of jobs are Hive Jobs
5.3 Types of Applications:
Reporting
Eg: Daily/Weekly aggregations of impression/click counts
Measures of user engagement
Microstrategy reports
Ad hoc Analysis
Eg: how many group admins broken down by state/country
Machine Learning (Assembling training data)
Ad Optimization
Eg: User Engagement as a function of user attributes
Many others
7.HIVE QUERY LANGUAGE
SQL
Sub-queries in from clause
Equi-joins (including Outer joins)
Multi-table Insert
Multi-group-by
Embedding Custom Map/Reduce in SQL
Sampling
Primitive Types
integer types, float, string, boolean
Nestable Collections
array<any-type> and map<primitive-type, any-type>
User-defined types
Structures with attributes which can be of any-type
OPTIMIZATIONS:
Joins try to reduce the number of map/reduce jobs needed.
Memory efficient joins by streaming largest tables.
Map Joins
User specified small tables stored in hash tables on the mapper
No reducer needed
Map side partial aggregations
Hash-based aggregates
Serialized key/values in hash tables
90% speed improvement on Query
SELECT count(1) FROM t;
Load balancing for data skew
8.HIVE-OPEN AND EXTENSIBLE :
Different on-disk storage(file) formats
Text File, Sequence File, …
Different serialization formats and data types
LazySimpleSerDe, ThriftSerDe …
User-provided map/reduce scripts
In any language, use stdin/stdout to transfer data …
User-defined Functions
Substr, Trim, From_unixtime …
User-defined Aggregation Functions
Sum, Average …
User-define Table Functions
Explode …