25-10-2016, 02:17 PM
1461092341-BigData.pptx (Size: 2.38 MB / Downloads: 106)
Big Data EveryWhere!
Lots of data is being collected and warehoused
Web data, e-commerce
purchases at department/grocery stores
Bank/Credit Card transactions
Social Network
How much data?
Google processes 20 PB a day (2008)
Wayback Machine has 3 PB + 100 TB/month (3/2009)
Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
eBay has 6.5 PB of user data + 50 TB/day (5/2009)
CERN’s Large Hydron Collider (LHC) generates 15 PB a year
The Earthscope
The Earthscope is the world's largest science project. Designed to track North America's geological evolution, this observatory records data over 3.8 million square miles, amassing 67 terabytes of data. It analyzes seismic slips in the San Andreas fault, sure, but also the plume of magma underneath Yellowstone and much, much more. (http://www.msnbc.msnid/44363598/ns/techn...metOdQ--uI)
Type of Data
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
Social Network, Semantic Web (RDF), …
Streaming Data
You can only scan the data once
What to do with these data?
Aggregation and Statistics
Data warehouse and OLAP
Indexing, Searching, and Querying
Keyword based search
Pattern matching (XML/RDF)
Knowledge discovery
Data Mining
Statistical Modeling
Statistics 101
Random Sample and Statistics
Population: is used to refer to the set or universe of all entities under study.
However, looking at the entire population may not be feasible, or may be too expensive.
Instead, we draw a random sample from the population, and compute appropriate statistics from the sample, that give estimates of the corresponding population parameters of interest.
Statistic
Let Si denote the random variable corresponding to data point xi , then a statistic ˆθ is a function ˆθ : (S1, S2, · · · , Sn) → R.
If we use the value of a statistic to estimate a population parameter, this value is called a point estimate of the parameter, and the statistic is called as an estimator of the parameter.
What is Data Mining?
Discovery of useful, possibly unexpected, patterns in data
Non-trivial extraction of implicit, previously unknown and potentially useful information from data
Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns
Data Mining Tasks
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]
Collaborative Filter [Predictive]
Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes.
Goal: previously unseen records should be assigned a class as accurately as possible.
A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
Collaborative Filtering
Goal: predict what movies/books/… a person may be interested in, on the basis of
Past preferences of the person
Other people with similar past preferences
The preferences of such people for a new movie/book/…
One approach based on repeated clustering
Cluster people on the basis of preferences for movies
Then cluster movies on the basis of being liked by the same clusters of people
Again cluster people based on their preferences for (the newly created clusters of) movies
Repeat above till equilibrium
Above problem is an instance of collaborative filtering, where users collaborate in the task of filtering information to find information of interest
Other Types of Mining
Text mining: application of data mining to textual documents
cluster Web pages to find related pages
cluster pages a user has visited to organize their visit history
classify Web pages automatically into a Web directory
Graph Mining:
Deal with graph data
Data Streams
What are Data Streams?
Continuous streams
Huge, Fast, and Changing
Why Data Streams?
The arriving speed of streams and the huge amount of data are beyond our capability to store them.
“Real-time” processing
Window Models
Landscape window (Entire Data Stream)
Sliding Window
Damped Window
Mining Data Stream