25-08-2017, 09:32 PM
Clustering
Clustering.pptx (Size: 1.16 MB / Downloads: 22)
What is clustering?
A way of grouping together data samples that are similar in some way - according to some criteria that you pick
A form of unsupervised learning – you generally don’t have examples demonstrating how the data should be grouped together
So, it’s a method of data exploration – a way of looking for patterns or structure in the data that are of interest
Why cluster?
Cluster genes = rows
Measure expression at multiple time-points, different conditions, etc.
Similar expression patterns may suggest similar functions of genes
Cluster samples = columns
Expression levels of thousands of genes for each tumor sample
Similar expression patterns may suggest biological relationship among samples
Choosing (dis)similarity measures – a critical step in clustering
Recall that the goal is to group together “similar” data – but what does this mean?
No single answer – it depends on what we want to find or emphasize in the data; this is one reason why clustering is an “art”
The similarity measure is often more important than the clustering algorithm used – don’t overlook this choice!
(Dis)similarity measures
Instead of talking about similarity measures, we often equivalently refer to dissimilarity measures (I’ll give an example of how to convert between them in a few slides…)
Jagota defines a dissimilarity measure as a function f(x,y) such that f(x,y) > f(w,z) if and only if x is less similar to y than w is to z
This is always a pair-wise measure
Think of x, y, w, and z as gene expression profiles (rows or columns)
Missing Values
A common problem with microarray data
One approach with Euclidean distance or PLC is just to ignore missing values (i.e., pretend the data has fewer dimensions)
There are more sophisticated approaches that use information such as continuity of a time series or related genes to estimate missing values – better to use these if possible
K-means Clustering Issues
Random initialization means that you may get different clusters each time
Data points are assigned to only one cluster (hard assignment)
Implicit assumptions about the “shapes” of clusters
You have to pick the number of clusters…