22-09-2016, 10:12 AM
Big Data Analysis inside a DBMS tightly coupled with different File Systems Perspectives
1455614742-PaperBigDataAnalysisinsideaDBMS.doc (Size: 55 KB / Downloads: 4)
Abstract DBMS is considered important analytics for Big Data having high functionality of query languages, indexes or different schemas to maximize scalability and parallelism. RDBMS remain the first order data management technology while many non-DBMS tools like statistical languages, generalized data mining techniques and large scale parallel systems are serving as the main technology for Big Data analytics. There has been a lot of research on DBMS support in MapReduce. The only technology which directly exploits the DBMS for Big Data analytics in the MapReduce framework is HadoopDB. It is taking advantages of both these technologies of DBMS and MapReduce there is a limitation that sharability is not supported by HadoopDB for entire data as it uses multiple nodes to save the data in a shared –nothing manner. Large scale systems have to use the base of HadoopDB and MapReduce hence DBMS seems not to be good technology to analyze Big Data despite of a fast and reliable data repository and handling SQL queries. HadoopDB cannot process queries efficiently that needs inter and intra node communications. It means HadoopDB must have to reload the whole data to handle some queries or cannot handle some complex queries. In my research effort I propose a NFS-integrated DBMS where a DBMS is tightly coupled with networked file system (NFS) through which we can achieve the sharability of the entire data. In my search I’m explaining the networked analytics on large Databases inside a DBMS. Although DBMSs cannot replace the parallel systems like MapReduce for web-scale textual data analysis. Here the technologies are influencing themselves each other. To process big data analytics parallel, I implement MapReduce framework on top of NFS-integrated DBMS. I also propose the notion of networked mapping for optimization. It will show that limitations of HadoopDB are overcomed by these strengths – (1) it will perform faster query processing as it will not needed to reload the data. (2) it will support more complex query types. Here I want to conclude with a proposal of research issues at long-terms taking into consideration of “Big Data Analysis” research work trends.
I. INTRODUCTION
Google data processing and storage is implemented by Hadoop which is a open source technology. A programming model is used by Hadoop called Map and Reduce that was used in the functional programing languages like LISP but if require; all data could not be loaded into memory. Big data term describe the exponential expansion and accessibility of structured and unstructured data. Traditional databases and other analytical techniques cannot process the Big Data as it have larger occupation. Better decisions can only be taken when data is available in extremely large sizes for more and more accurate analyses. It creates a confident decision making which generates operational efficiencies, cost feasibility and better risk management.
Researchers and academicians are challenged to analyzing big data that needs special analytical technical skills. Big data analysis uncovers the patterns, correlations and other knowledge for betterment of the decision making. It gives us a recognition which data is most useful for future needs.
Hadoop Map Reduce technique analyses big data. Due to its high scalability it emerged as a latest paradigm for large scale data analysis, fine grain fault tolerant having easy algorithm implementation. MapReduce refers two tasks map and its reduction.
HDFS, the Hadoop Distributed File System runs on commodity hardware. HDFS is defined for bigger data (TB, PB or ZB) and provide high throughput accessing to the knowledge and information sharing.
II. OBJECTIVES AND CHALLENGES
The objective of multi-dimensional data analysis is to efficient prediction of future observations and to have the capability to get the relationship of outcomes with provided inputs for scientific objectives. Due to large samples big data have two additional goals of heterogeneity and generosity through distinct subpopulations. Big data satisfy two basic requirements of uncover the structures of each subpopulation of data while sample size is comparatively small that is not feasible traditionally. And the other one is features extraction commonly spread across subpopulations.
speed with accuracy is a challenge to make decisions fast but the enormous data volume and accessing the desired level of metadata is a cumbersome job as a challenge. This challenge increases when granularity degree increases. To explore big voluminous data in real time we can use enhanced memory or parallel processing or we can insert the data into memory. Clustered data can resolve this demand of speed and accuracy by making the data groups smaller to visualize the data effectively.
if the data is inaccurate and time lapsed, the value of that data is suspected and compromising for decision making purposes even though the accessing is speedy. This challenge can be overcome when big data analysis is used with some profound data visualization techniques. The graphical data can be generated using visualization techniques which can set trends and outliers fasters than relational tables having numeric and textual data. The analysts can easily spot the attention seeing simply at a chart representation prepared by some graphical drawing techniques and methods.