Machine Learning - XIII. Clustering (Week 8)
http://blog.csdn.net/pipisorry/article/details/44677097
机器学习Machine Learning - Andrew NG courses学习笔记
Clustering 聚类
{unsupervised learning algorithm where we learn from unlabeled data instead of the label data}
Unsupervised Learning_ Introduction非监督学习介绍
a typical supervisory problem : given a label training sets and the goal is to find the decision boundary that separates the positive label examples and the negative label examples.
The supervised learning problem in this case is given a set of labels to fit a hypothesis to it.
unsupervised learning : give this sort of unlabeled training set to an algorithm and we just ask the algorithm: find some structure in the data for us.
one type of structure we might have an algorithm find, is that has points grouped into two separate clusters and so an algorithm that finds that clusters like the ones circled, is called a clustering algorithm.
聚类算法的应用
Social network analysis:information about who are the people that you email the most frequently and who are the people that they email the most frequently, and to find coherent groups of people.you‘d want to find who other coherent groups
of friends in a social network.
K-Means Algorithm K-均值算法(聚类算法的一种)
In the clustering problem we are given an unlabeled data set and we would like to have an algorithm automatically group the data into coherent subsets or into coherent clusters for us.
K-均值算法步骤图形化展示
randomly initialize two points, called the cluster centroids.
Note:two Cluster Centroids because I want to group my data into two clusters.
K Means is an iterative algorithm and it does two things.First is a cluster assignment step, and second is a move centroid step.
cluster assignment step:going through each of the examples,depending on whether it‘s closer to the red cluster centroid or the blue cluster centroid,assign each of the data points to one of the two cluster
centroids.
move centroid step:move the two cluster centroids to the average of the points colored the same colour.
迭代这两个步骤的效果:
converged收敛: keep running additional iterations of K means the cluster centroids will not change any further and the colours of the points will not change any further. And so, at this point,K means has converged and it‘s done a pretty good job finding the two clusters in this data.
K-均值算法
Note:
1. for unsupervised learning of the K means use the convention that XI is an RN dimensionalvector, rather N plus one dimensional vectors.
2. cluster assignment step:find k that minimizes the distance between the example and the cluster centroid, and set as Ci, and by convention the distance between Xi and the cluster centroid tend to write this as the squared distance.
3. what if there is a cluster centroid with zero points assigned to it: (that can happen, although in practice it happens not that often)
just eliminate that cluster centroid.end up with K minus one clusters.
if you really need k clusters,you can just randomly reinitialize that cluster centroid.
K-means for non-separated clusters
applied to data sets that not be several very well separated clusters.例子an example of market segmentation: with examples of different peoples heights and weight.design t-shirts of three sizes, small, medium and large.
Note:even though the data before hand it didn‘t seem like we had 3 well separated clusters, K Means will kind of separate out the data into multiple pluses.
Optimization Objective最优化目标
Most of the supervised learning algorithms have an optimization objective or some cost function that the algorithm was trying to minimize.
K-means最优化目标的用途knowing what is theoptimization objective of K-means be useful
to us for two purposes:
First, help us to debug the learning algorithm and make sure that K-means is running correctly;
second,use this to help K-means find better clusters and avoid local optima.
变量定义:
最优化目标
Note: This cost function is sometimes also called thedistortion(变形) function or the distortion of the K-means algorithm.
K-means算法中最优化目标的分析
Note:
1. the first assignment step doesn‘t change the cluster centroids, but picking the values of C1, C2, up to CM that minimizes the cost function or the distortion function,J. And it‘s possible to prove that mathematically.
2. the second step of K-means,once again won‘t prove it,but it can be shown mathematically, that what the root centroid step does, is it chooses the values of mu that minimizes J.
3. So, K-means taking the two sets of variables and partitioning them into two halves right here.
it first minimizes J with respect to the variable C, and then minimizes J with respect the variables Mu, and then it keeps on iterating.
Random Initialization随机初始化
Choosing the Number of Clusters选择聚类的数目
from:http://blog.csdn.net/pipisorry/article/details/44677097
郑重声明:本站内容如果来自互联网及其他传播媒体,其版权均属原媒体及文章作者所有。转载目的在于传递更多信息及用于网络分享,并不代表本站赞同其观点和对其真实性负责,也不构成任何其他建议。