04-11-2016, 04:01 PM
1464791016-CloudComputingJun28.ppt (Size: 2.21 MB / Downloads: 98)
Outline of the talk
Introduction to cloud context
Technology context: multi-core, virtualization, 64-bit processors, parallel computing models, big-data storages…
Cloud models: IaaS (Amazon AWS), PaaS (Microsoft Azure), SaaS (Google App Engine)
Demonstration of cloud capabilities
Cloud models
Data and Computing models: MapReduce
Graph processing using amazon elastic mapreduce
A case-study of real business application of the cloud
Questions and Answers
Speakers’ Background in cloud computing
Bina:
Has two current NSF (National Science Foundation of USA) awards related to cloud computing:
2009-2012: Data-Intensive computing education: CCLI Phase 2: $250K
2010-2012: Cloud-enabled Evolutionary Genetics Testbed: OCI-CI-TEAM: $250K
Faculty at the CSE department at University at Buffalo.
Kumar:
Principal Consultant at CTG
Currently heading a large semantic technology business initiative that leverages cloud computing
Adjunct Professor at School of Management, University at Buffalo.
Introduction: A Golden Era in Computing
Cloud Concepts, Enabling-technologies, and Models: The Cloud Context
Evolution of Internet Computing
Challenges
Alignment with the needs of the business / user / non-computer specialists / community and society
Need to address the scalability issue: large scale data, high performance computing, automation, response time, rapid prototyping, and rapid time to production
Need to effectively address (i) ever shortening cycle of obsolescence, (ii) heterogeneity and (iii) rapid changes in requirements
Transform data from diverse sources into intelligence and deliver intelligence to right people/user/systems
What about providing all this in a cost-effective manner?
Enter the cloud
Cloud computing is Internet-based computing, whereby shared resources, software and information are provided to computers and other devices on-demand, like the electricity grid.
The cloud computing is a culmination of numerous attempts at large scale computing with seamless access to virtually limitless resources.
on-demand computing, utility computing, ubiquitous computing, autonomic computing, platform computing, edge computing, elastic computing, grid computing, …
“Grid Technology: A slide from my presentationto Industry (2005)
Emerging enabling technology.
Natural evolution of distributed systems and the Internet.
Middleware supporting network of systems to facilitate sharing, standardization and openness.
Infrastructure and application model dealing with sharing of compute cycles, data, storage and other resources.
Publicized by prominent industries as on-demand computing, utility computing, etc.
Move towards delivering “computing” to masses similar to other utilities (electricity and voice communication).”
Now,
It is a changed world now…
Explosive growth in applications: biomedical informatics, space exploration, business analytics, web 2.0 social networking: YouTube, Facebook
Extreme scale content generation: e-science and e-business data deluge
Extraordinary rate of digital content consumption: digital gluttony: Apple iPhone, iPad, Amazon Kindle
Exponential growth in compute capabilities: multi-core, storage, bandwidth, virtual machines (virtualization)
Very short cycle of obsolescence in technologies: Windows Vista Windows 7; Java versions; CC#; Phython
Newer architectures: web services, persistence models, distributed file systems/repositories (Google, Hadoop), multi-core, wireless and mobile
Diverse knowledge and skill levels of the workforce
You simply cannot manage this complex situation with your traditional IT infrastructure:
Answer: The Cloud Computing?
Typical requirements and models:
platform (PaaS),
software (SaaS),
infrastructure (IaaS),
Services-based application programming interface (API)
A cloud computing environment can provide one or more of these requirements for a cost
Pay as you go model of business
When using a public cloud the model is similar to renting a property than owning one.
An organization could also maintain a private cloud and/or use both.
Enabling Technologies
Common Features of Cloud Providers
Windows Azure
Enterprise-level on-demand capacity builder
Fabric of cycles and storage available on-request for a cost
You have to use Azure API to work with the infrastructure offered by Microsoft
Significant features: web role, worker role , blob storage, table and drive-storage
Amazon EC2
Amazon EC2 is one large complex web service.
EC2 provided an API for instantiating computing instances with any of the operating systems supported.
It can facilitate computations through Amazon Machine Images (AMIs) for various other models.
Signature features: S3, Cloud Management Console, MapReduce Cloud, Amazon Machine Image (AMI)
Excellent distribution, load balancing, cloud monitoring tools
Google App Engine
This is more a web interface for a development environment that offers a one stop facility for design, development and deployment Java and Python-based applications in Java, Go and Python.
Google offers the same reliability, availability and scalability at par with Google’s own applications
Interface is software programming based
Comprehensive programming platform irrespective of the size (small or large)
Signature features: templates and appspot, excellent monitoring and management console
Demos
Amazon AWS: EC2 & S3 (among the many infrastructure services)
Linux machine
Windows machine
A three-tier enterprise application
Google app Engine
Eclipse plug-in for GAE
Development and deployment of an application
Windows Azure
Storage: blob store/container
MS Visual Studio Azure development and production environment
Cloud Programming Models
The Context: Big-data
Data mining huge amounts of data collected in a wide range of domains from astronomy to healthcare has become essential for planning and performance.
We are in a knowledge economy.
Data is an important asset to any organization
Discovery of knowledge; Enabling discovery; annotation of data
Complex computational models
No single environment is good enough: need elastic, on-demand capacities
We are looking at newer
Programming models, and
Supporting algorithms and data structures.
Google File System
Internet introduced a new challenge in the form web logs, web crawler’s data: large scale “peta scale”
But observe that this type of data has an uniquely different characteristic than your transactional or the “customer order” data : “write once read many (WORM)” ;
Privacy protected healthcare and patient information;
Historical financial data;
Other historical data
Google exploited this characteristics in its Google file system (GFS)
What is Hadoop?
At Google MapReduce operation are run on a special file system called Google File System (GFS) that is highly optimized for this purpose.
GFS is not open source.
Doug Cutting and others at Yahoo! reverse engineered the GFS and called it Hadoop Distributed File System (HDFS).
The software framework that supports HDFS, MapReduce and other related entities is called the project Hadoop or simply Hadoop.
This is open source and distributed by Apache.
Fault tolerance
Failure is the norm rather than exception
A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data.
Since we have huge number of components and that each component has non-trivial probability of failure means that there is always some component that is non-functional.
Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.
HDFS Architecture
Hadoop Distributed File System
What is MapReduce?
MapReduce is a programming model Google has used successfully is processing its “big-data” sets (~ 20000 peta bytes per day)
A map function extracts some intelligence from raw data.
A reduce function aggregates according to some guides the data output by the map.
Users specify the computation in terms of a map and a reduce function,
Underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, and
Underlying system also handles machine failures, efficient communications, and performance issues.
-- Reference: Dean, J. and Ghemawat, S. 2008. MapReduce: simplified data processing on large clusters. Communication of ACM 51, 1 (Jan. 2008), 107-113.