07-10-2016, 09:43 AM
1458126520-SURVEYOFRECENTRESEARCHPROGRESSANDISSUESINBIG.docx (Size: 425.93 KB / Downloads: 6)
Abstract
Big data is the term for data sets so large and complicated that it becomes difficult to process using traditional data management tools or processing applications. This paper reveals most recent progress on big data networking and big data. We have categorized reported efforts into four general categories. First, efforts related to classic big data technology such as storage, Software-Defined Network, data transportation and analytics are reported. Second, important aspects of big data in cloud computing such as recourse management and performances optimization are introduced. Lastly, we introduce interesting benchmarks and progress in both search engines and mobile networking. Upon detailed summary and analysis, limitations of the proposed works as well as possible future research directions have been proposed.
1. Introduction
Big data is the term for data sets so large and complicated that it becomes difficult to process using traditional data management tools or processing applications. This paper reveals recent progress on big data, big data networking and relevant topics.
According to [Bakshi12], the size of digital data in 2011 is roughly 1.8 Zettabytes (1.8 tri lion gigabytes). That is, supporting networking infrastructure has to manage 50 times more information by year 2020. Specifica ly, considerations of efficiency, economics and privacy should be carefu ly planned while including new big data building blocks into existing data and networking infrastructure [Bakshi12].
In addition to big data cha lenges induced by traditional data generation, consumption, and analytics at a much larger scale, newly emerged characteristics of big data has shown important trends on mobility of data, faster data access and consumption, as we l as ecosystem capabilities [Cisco11]. Fig. 1 i lustrates a general big data network model with MapReduce. Distinct applications in the cloud has put demanding requirements for acquisition, transportation and analytics of structured and unstructured data.
In this paper, we pay close attention to recent progresses made on big data and big data networking. We divide relevant efforts into representative categories while maintaining our own independent understandings. To be specific, topics covered in this paper include: recent progress on classic big data networking technologies, e.g., Hadoop and MapReduce, big data technologies in could computing, big data benchmarking projects, and mobile big data networking.
2. Related Work
This section reveals recent progress and efforts in big data networking. We cover these topics in 4 categories: classic big data networking technology, big data in cloud computing, data engineering and benchmarking approaches, and mobile big data networking. A l covered topics are reported between 2011 and 2013.
As classical big data research, the fo lowing work reported progress in big data networking. [Madden12] reveals cha lenges and opportunities in databases in existence of big data. [Girola11] introduces virtualization planning and cloud computing methods in IBM data center networking. [Keim13] depicts interesting methodologies in big data virtualization. From a platform architecting perspective, [Ferguson12] reports their progress for accelerating big data analytics. As a recent effort, [Dittrich12] introduces contributions on optimizing big data processing efficiency in Hadoop and MapReduce.
There have been a number of new and interesting big data methodologies reported. [Monga12] introduces their efforts on Software-Defined Networking for big-data science-architectural models in campus environment and more importantly in Wide Area Network. [Herodotou11] proposed a self-tuning system for big data analytics. RCFile as a fast and space- efficient data placement structure in MapReduce warehouse has been proposed in [He11]. An efficient in-network aggregation method for big data applications was introduced in [Costa12], which considerably reduce sizes of data transportation. [Brunet12] reported their method of Gaia Hadoop solution with an emphasis on identifying potential challenges. An interesting application of using big-data for kinect training was discussed in [Budiu12]. [Wang12] introduces their efforts of run-time networking programming in big data applications. A recent case study for bursting data in Transportation SDN was introduced in [Sadasivarao13]. Efforts on optimizing interactions with big data analytics were reported in [Fisher12]. [Begoli12] presented their design principles for efficient knowledge discovery. Radoop, based on RapidMiner and Hadoop, has attracted attention in data analytics [Prekopcsák11]. General considerations for big data architecture and data management has been reported in [Bakshi12]
Remarkable progress of big data networking has also been reported in the area of cloud computing. [Agrawal11] reported existing states and potential future opportunities for big data and cloud computing. Resource management and a location in multi-cluster clouds were introduced in [Lakew13]. A dataflow-based performance analysis for big data cloud, i.e., Hitune, was presented in [Dai11]. Interesting case studies on big data processing in cloud computing environment was depicted in [Ji12]. [Lu11] presented their work of a framework for cloud-based large-scale data analytics and visualization; a case study on climate data of various scales were introduced too. A recent online cost-minimization approach was depicted in [Zhang13]. Specifica ly for reducing cooling energy cost for big data analytics cloud, a data-central approach was introduced in [Kaushik12].
In addition to methodologies, there have been a few interesting data engineering and benchmarking efforts reported for big data. [Gao13] introduces a big data benchmark project based on open-source data interfaces of web search engines. [Laurila12] presents a mobile data co lection cha lenge initiated by Nokia, which represents an important step towards mobile big data networking.
Given that mobile networking is becoming a more and more important counterpart of traditional Internet and big data. Big data benchmarking has valuable impact for the research community. [Shekhar12] reported a spatial big-data cha lenges intersecting mobility and cloud computing. A recent effort on mining large-scale smartphone and data for personality studies has been presented in [Chittaranjan13]. From a perspective of big data applications, [Silberstein11] introduced cha lenges in social applications while [Zaslavsky13] presented an interesting application as a service of big data.
3. Efforts in Classic Big Data Networking
In addition to traditional big data technologies such as Hadoop, MapReduce and NoSQL, plausible progresses have been made in the past two years on big data networking in many other areas. We summarize them into 4 categories: storage and warehouse, data transportation, Software-Defined Networking and big data Analytics.
3.1 Storage and Warehouse
Data storage is the basis for big data networking. Representative technologies are Relational database and Not Only SQL (NoSQL) databases and data warehouse.
An in-depth review on state-of-art database technologies in the area of big data was presented in [Madden12]. The author claimed that although considerable progresses have been made in database research, much remains to be done: firstly, handling streaming high-rate data in relational models remains as an open problem; second, statistical analysis and machine learning algorithms for big data need to be more robust and easier to use; lastly but more importantly, an ecosystem-alike mechanism should be built around the devised big data algorithms such that data management and usage can evolve sitting on top of the proposed algorithms.
Another important aspect in big data related database is data placement structures. Authors in [He11] argues that traditional data placement structures such as row-stores, column-stores and hybrid-stores are no longer suitable in large data analysis using MapReduce on distributed systems. Instead, the authors have proposed RCFile (Record Columnar File) and its implementation in Hadoop, which meets fast data loading, query processing, efficient storage space utilization, and strong adaptability to dynamic workload patterns[He11]. Basic idea of RCFile is depicted as in Fig. 2.
3.2 Software-Defined Network
Software-Defined Network (SDN) as the critical transportation media of big data also plays a critical role in big data networking. We next reveals progress in this regard.
[Monga12] introduces their efforts on Software-Defined Networking for big-data science-architectural models from campus to WAN. To bypass traditional performance hotspots in typical campus network, the authors have built based on the SC11 SCinet Research Sandbox demonstrator with SDN for sake of a scalable architectural approach. The proposed work has been proved to be simple and more importantly adaptable to network framework. Overa l speaking, method in this work is incremental, but we are glad to see its system validation has proved yet another SDN design.
Run-time networking programming is useful for big data networks that require frequent reconfigurations. [Wang12] introduces their efforts of run-time networking programming in big data applications. Specifica ly, the authors combined SDN contro ler and optical switching to realize close co laboration of network control and potential applications. Joint optimizations of network performance as we l as network utilization have been explored. Analysis shows that, at a relatively sma l overhead of configurations, the proposed integration offers great potentials for optimizing applications performances. The systematic design and evaluation in this work is inspiring.
Bursting data transportation is yet another important aspect for SDN data exchanging as it promises sma ler transportation delays. A recent case study for bursting data in Transportation SDN was introduced in [Sadasivarao13]. The authors proposed a SDN-enabled optical transportation architecture which meshes seamlessly within data centers. A case study with an OpenFlow-enabled optical vSwitch managing a sma l optical transport network was reported. The authors argue that their extension and the inherent programmability brought by SDN are substantial in real world applications. However, general impact has to be further validated in larger deployments.
In sum, real-world case studies on SDN as we l as run-time programming and bursting data transportation has been reported and they a l showed promising advancement compared to existing approaches. SDN is benefiting from these advancements.
An efficient in-network aggregation method, Camdoop, for big data applications was introduced in [Costa12], which considerably reduce volumes of data transportation. Instead of increasing network bandwidth, authors in this work focused on decreasing the traffic by pushing aggregation from the edge into the network. Implementation based on CamCube and direct-connect topology (i.e., servers connected directly to other servers), Camdoop specifica ly utilize the property that CamCube servers forward traffics to do in-network aggregation. Case studies showed that Camdoop significantly reduces the network traffic while maintaining comparable performances as opposed to a reference of Hadoop and Dryad/DryadLINQ.
However, similar to tradeoffs in other in-networking aggregation approaches, Camdoop also suffers from losing end-data accuracies, because it does not transport a l the generated data. Moreover, in-depth comparisons against more advanced approaches instead of the reported one reference is needed.
3.3 Analytics
Co lection and transportation of big data share a common goal: analyzing the data for insights and better application guidance. We reveal new progress as below for big data analytics.
As a recent effort, [Dittrich12] gives a tutorial on optimizing big data processing efficiency in Hadoop and MapReduce. To be specific, the users focused on introducing different data management techniques, e.g., job optimization, physical data organization such as data layouts and indexes. A comprehensive comparison between Hadoop MadReduce and Para lel DBMS was given. From an architecture perspective, [Ferguson12] reported their progress for accelerating big data analytics. This work introduces efforts of IBM in architecting their big data platforms to meet the requirement that one new analytical ecosystem can support entire spectrum of big data analytics. The reported technology utilized Hadoop, IBM Smart Analytic System with built-in NoSQL graph store.
Starfish in [Herodotou11] proposed a self-tuning system for big data analytics. The focus of this work is to mitigate the knowledge gap between new users and the sophisticated configurations of Hadoop and its default MapReduce layer. Moreover, Starfish can adapt to user ends and system workloads for better performance. The basis of Starfish is self- tunning database. Nevertheless, it is not clear how we l Starfish can react to high-rate streaming data.
Progress of Big Data in Could Computing
Cloud Computing as an important application environment for big data has attracted tremendous attentions from the research community. Remarkable progress of big data networking has also been reported in this area. In this section, we introduce big data research issues and solutions related to Cloud Computing. Specifica ly, we are interested in the fo lowing topics: opportunities and cha lenges of big data networking in Cloud Computing, cloud resource management of big data, and performance optimization of big data in Cloud Computing
4.2 Performance Optimization
Performance optimization is yet another classic and important topic in cloud computing because appropriate optimization techniques will provide better application experiences with comparable or even less system resource consumption, compared to non-optimized cases.
A dataflow-based performance analysis tool for big data cloud, i.e., Hitune, was presented in [Dai11]. Hitune is shown to be effective in assisting users doing Hadoop performance analysis and system parameter tuning. Limitations of existing approaches, such as Hadoop logs and metrics was also compared and discussed. A few interesting case studies on big data processing in cloud computing environment was depicted in [Ji12]. Efforts of the Fijitsu laboratory are based on data store and complex event processing, as we l as workflow description in distributed data processing.
A recent online cost-minimization algorithm was depicted in [Zhang13]. The proposed work specifica ly focused on real- time cost minimizations for uploading massive and dynamic data onto the cloud. The two online algorithms have achieved competitive cost reduction ratios. However, the proposed methods are only evaluated in a limited scale. The proposed algorithms need to be further evaluated at larger and more competitive scales, e.g., data streaming applications with larger topologies.
In sum, Hitune and the Fijitsu laboratory approaches have been focused on promoting user experiences by using fundamental big data techniques such as event processing and work flow description. Tools and case studies like this are informational and offer more choices to users. Moreover, online cost-minimizing as another promising direction has been proved to be effective in big data applications. We expect a lot more scalable and efficient algorithms to be proposed in the near future.
5. Big Data Benchmarks and Mobile Networking
In this section, we briefly reveal two important counterparts of big data networking research: benchmarks and mobile networking with big data considerations. The discussed works represents not only the dedicated efforts but also possible popular trends in big data networking research.
5.1 Big Data Benchmarks
Big data benchmarks play a crucial role in these data-centric research areas, because scientifica ly co lection and organization of informational data wi l provide important ground truth for further methodology verifications.
Authors in [Gao13] present BigdataBench, an interesting big data benchmark project based on open-source data interfaces of web search engines. As we a l know, search engines have been entrance point of the whole Internet. Hence, insightful collection of informational data sets is not only valuable but also hard due to privacy regulations. The reported work has called in Internet giants such as Baidu, Sougou, Facebook, Yahoo, Huawei and preliminary results have been shown. To be specific, data co lection techniques in this work is based on open source solutions of search engines and anonymous Web access logs. Two interesting case studies have been presented. We have a reason to be positive about this benchmark effort considering the big names in the crew.
[Laurila12] also presents a mobile data co lection cha lenge initiated by Nokia, which represents an important step towards mobile big data networking. To be specific, the authors in the work reviewed the Lausanne Data Co lection Campaign (LDCC) for unique and longitudinal smartphone data set, which acts as the basis of Mobile Data Cha lenge (MDC). Privacy, cha lenging and scalable data co lection and usage have been emphasis of this benchmark study.
In sum, remarkable benchmarking efforts have been initiated in both traditional Internet and mobile networking. With an emphasis on privacy-respecting and scalable information co lection, the discussed benchmark problems represent a promising step for big data and big networking research in the long run. However, we are also expecting that more insights and inspiring observations can be extracted from these large scale studies.
5.2 Mobile Networking
Mobile networking is becoming a more and more important counterpart of traditional Internet and big data. The mobile networking is becoming larger and larger due to releasing of hundreds of thousands of ce l phones and pads. Moreover, the evolution of ce lular network has enables mobile devices to be connected fast and reliably.
A number of big data efforts have also been reported regarding mobile networking. [Laurila12] introduced an important mobile big data co lection project. [Shekhar12] reported a spatial big-data cha lenges intersecting mobility and cloud computing. The observation that spatial localization of mobile data is critical is certainly valid and valuable. One interesting improvement for the work in [Shekhar12] is to study daily behaviors of users based on usage of mobile maps on their cellphones or GPS, for which Apple Map and Google Map on ce lphones are two important representatives.
A recent effort on mining large-scale smartphone and data for personality studies has been presented in [Chittaranjan13]. Although this work only covers basic aspects of big data, it is sti l worthwhile a read because it simultaneously considers both personality study and large scale data.
In sum, mobile networking is by fact an important counterpart of traditional Internet. More importantly, benchmarks and case studies have reflected usefulness of studying mobile big data. Moreover, considering the fast and reliable requirement of mobile networking requirements, effective interactions of the cloud and end users (i.e., close-loop control/interaction) might be another interesting research direction.
6. Summary
In this work, we have done in-depth reviews on recent efforts dedicated to big data and big data networking. We have reviewed the progresses in fundamental big data technologies such as storage and warehousing, SDN, transportation and analytics. Important aspects of big data networking in cloud computing such as new cha lenges and opportunities, resource management and performance optimizations are also introduced and discussed with independent viewpoints. Lastly but not the least, we have also reported important efforts in big data benchmarking and mobile networking, which represent foundations of big data research and promising trends, respectively.
To sum up, we conclude that promising progresses have been made in the area of big data and big data networking, but much remains to be done. Almost a l proposed approaches are evaluated at a limited scale, for which the reported benchmarking projects can act as a helpful compensation for larger-scale evaluations. Moreover, software-oriented studies also need to systematica ly explore cross-layer, cross-platform tradeoffs and optimizations.