Bisecting k-means¾ÛÀàËã·¨£¬¼´¶þ·Ök¾ùÖµËã·¨£¬ËüÊÇk-means¾ÛÀàËã·¨µÄÒ»¸ö±äÌ壬Ö÷ÒªÊÇΪÁ˸Ľøk-meansËã·¨Ëæ»úÑ¡Ôñ³õʼÖÊÐĵÄËæ»úÐÔÔì³É¾ÛÀà½á¹û²»È·¶¨ÐÔµÄÎÊÌ⣬¶øBisecting
k-meansËã·¨ÊÜËæ»úÑ¡Ôñ³õʼÖÊÐĵÄÓ°Ïì±È½ÏС¡£
Ê×ÏÈ£¬ÎÒÃÇ¿¼ÂÇÔÚÅ·¼¸ÀïµÂ¿Õ¼äÖУ¬ºâÁ¿´ØµÄÖÊÁ¿Í¨³£Ê¹ÓÃÈç϶ÈÁ¿£ºÎó²îƽ·½ºÍ£¨Sum
of the Squared Error£¬¼ò³ÆSSE£©£¬Ò²¾ÍÊÇÒª¼ÆËãÖ´ÐоÛÀà·ÖÎöºó£¬¶Ôÿ¸öµã¶¼Òª¼ÆËãÒ»¸öÎó²îÖµ£¬¼´·ÇÖÊÐĵ㵽×î½üµÄÖÊÐĵľàÀë¡£ÄÇô£¬¼ÈȻÿ¸ö·ÇÖÊÐĵ㶼ÒѾÊôÓÚij¸ö´Ø£¬Ò²¾ÍÊÇÒª¼ÆËãÿ¸ö·ÇÖÊÐĵ㵽ÆäËùÔڴصÄÖÊÐĵľàÀ룬×îºó½«ÕâЩ¾àÀëÖµÏà¼ÓÇóºÍ£¬×÷ΪSSEÈ¥ÆÀ¹ÀÒ»¸ö¾ÛÀàµÄÖÊÁ¿ÈçºÎ¡£ÎÒÃǵÄ×îÖÕÄ¿±êÊÇ£¬Ê¹µÃ×îÖÕµÄSSEÄܹ»×îС£¬Ò²¾ÍÊÇÒ»¸ö×îС»¯Ä¿±êSSEµÄÎÊÌâ¡£ÔÚnάŷ¼¸ÀïµÂ¿Õ¼ä£¬SSEÐÎʽ»¯µØ¶¨Ò壬¼ÆË㹫ʽÈçÏ£º

Bisecting k-means¾ÛÀàËã·¨µÄ»ù±¾Ë¼ÏëÊÇ£¬Í¨¹ýÒýÈë¾Ö²¿¶þ·ÖÊÔÑ飬ÿ´ÎÊÔÑ鶼ͨ¹ý¶þ·Ö¾ßÓÐ×î´óSSEÖµµÄÒ»¸ö´Ø£¬¶þ·ÖÕâ¸ö´ØÒÔºóµÃµ½µÄ2¸ö×Ó´Ø£¬Ñ¡Ôñ2¸ö×ӴصÄ×ÜSSE×îСµÄ»®·Ö·½·¨£¬ÕâÑùÄܹ»±£Ö¤Ã¿´Î¶þ·ÖµÃµ½µÄ2¸ö´ØÊDZȽÏÓŵģ¨Ò²¿ÉÄÜÊÇ×îÓŵģ©£¬Ò²¾ÍÊÇÕâ2¸ö´ØµÄ»®·Ö¿ÉÄÜÊǾֲ¿×îÓŵģ¬È¡¾öÓÚÊÔÑéµÄ´ÎÊý¡£
Bisecting k-means¾ÛÀàËã·¨µÄ¾ßÌåÖ´Ðйý³Ì£¬ÃèÊöÈçÏÂËùʾ£º
1¡¢³õʼʱ£¬½«´ý¾ÛÀàÊý¾Ý¼¯D×÷Ϊһ¸ö´ØC0£¬¼´C={C0}£¬ÊäÈë²ÎÊýΪ£º¶þ·ÖÊÔÑé´ÎÊým¡¢k-means¾ÛÀàµÄ»ù±¾²ÎÊý£»
2¡¢È¡CÖоßÓÐ×î´óSSEµÄ´ØCp£¬½øÐжþ·ÖÊÔÑém´Î£ºµ÷ÓÃk-means¾ÛÀàËã·¨£¬È¡k=2£¬½«Cp·ÖΪ2¸ö´Ø£ºCi1¡¢Ci2£¬Ò»¹²µÃµ½m¸ö¶þ·Ö½á¹û¼¯ºÏB={B1,B2,¡,Bm}£¬ÆäÖУ¬Bi={Ci1,Ci2}£¬ÕâÀïCi1ºÍCi2Ϊÿһ´Î¶þ·ÖÊÔÑéµÃµ½µÄ2¸ö´Ø£»
3¡¢¼ÆËãÉÏÒ»²½¶þ·Ö½á¹û¼¯ºÏBÖУ¬Ã¿Ò»¸ö»®·Ö·½·¨µÃµ½µÄ2¸ö´ØµÄ×ÜSSEÖµ£¬Ñ¡Ôñ¾ßÓÐ×îС×ÜSSEµÄ¶þ·Ö·½·¨µÃµ½µÄ½á¹û£ºBj={Cj1,Cj2}£¬²¢½«´ØCj1¡¢Cj2¼ÓÈëµ½¼¯ºÏC£¬²¢½«Cp´ÓCÖÐÒÆ³ý£»
4¡¢Öظ´²½Öè2ºÍ3£¬Ö±µ½µÃµ½k¸ö´Ø£¬¼´¼¯ºÏCÖÐÓÐk¸ö´Ø¡£
¾ÛÀàË㷨ʵÏÖ
»ùÓÚÉÏÃæÃèÊöµÄ¾ÛÀàÖ´Ðйý³Ì£¬Ê¹ÓÃJavaʵÏÖBisecting k-means¾ÛÀ࣬´úÂëÈçÏÂËùʾ£º
@Override public void clustering() { // parse sample files final List<Point2D> allPoints = Lists.newArrayList(); FileUtils.read2DPointsFromFiles(allPoints, "[\t,;\\s]+", inputFiles);
// ´ÓÎļþÖжÁÈ¡¶þÎ¬×ø±êµã£¬¼ÓÈëµ½¼¯ºÏallPointsÖÐ final int bisectingK = 2; int bisectingIterations = 0; int maxInterations = 20; List<Point2D> points = allPoints; final Map<CenterPoint,
Set<ClusterPoint<Point2D>>> clusteringPoints = Maps.newConcurrentMap();
// ×îÖյľÛÀà½á¹û¼¯ºÏ while(clusteringPoints.size() <= k) { // µ±µÃµ½k¸ö´Ø£¬ÔòËã·¨ÖÕÖ¹ LOG.info("Start bisecting iterations: #" + (++bisectingIterations) + ",
bisectingK=" + bisectingK + ",maxMovingPointRate=" + maxMovingPointRate + ", maxInterations=" + maxInterations + ", parallism=" + parallism); // for k=bisectingK, execute k-means clustering // bisecting trials KMeansClustering bestBisectingKmeans = null; double minTotalSSE = Double.MAX_VALUE; for (int i = 0; i < m; i++) {
// Ö´Ðжþ·ÖÊÔÑ飺µ÷ÓÃk-means¾ÛÀàËã·¨£¬½«ÊäÈëµÄµã¼¯½øÐжþ·Ö£¬µÃµ½2¸ö´Ø£¬ÊÔÑéÖ´ÐÐm´Î final KMeansClustering kmeans = new KMeansClustering(bisectingK, maxMovingPointRate,
maxInterations, parallism); kmeans.initialize(points); // the clustering result should have 2 clusters kmeans.clustering(); double currentTotalSSE = computeTotalSSE(kmeans.getCenterPointSet(),
kmeans.getClusteringResult()); // ¼ÆËãÒ»´Î¶þ·ÖÊÔÑéÖÐ×ܵÄSSEµÄÖµ if(bestBisectingKmeans == null) { bestBisectingKmeans = kmeans; minTotalSSE = currentTotalSSE; } else { if(currentTotalSSE < minTotalSSE) {
// ¼Ç¼×ÜSSE×îСµÄ¶þ·Ö¾ÛÀ࣬ͨ¹ýkmeans±£´æ¶þ·Ö½á¹û bestBisectingKmeans = kmeans; minTotalSSE = currentTotalSSE; } } LOG.info("Bisecting trial <<" + i + ">> : minTotalSSE=" + minTotalSSE + ",
currentTotalSSE=" + currentTotalSSE); } LOG.info("Best biscting: minTotalSSE=" + minTotalSSE); // merge cluster points for choosing cluster bisected again int id = generateNewClusterId(clusteringPoints.keySet());
// ÿ´ÎÖ´ÐÐk-means¾ÛÀ࣬¶¼¶àÁËÒ»¸ö´Ø£¬Îª¶à³öµÄÕâ¸ö´Ø·ÖÅäÒ»¸ö±àºÅ Set<CenterPoint> bisectedCentroids = bestBisectingKmeans.getCenterPointSet();
// ¶þ·ÖµÃµ½µÄ2¸ö´ØµÄÖÊÐĵļ¯ºÏ merge(clusteringPoints, id, bisectedCentroids,
bestBisectingKmeans.getClusteringResult().getClusteredPoints());
// ½«¶þ·ÖµÃµ½µÄ2¸ö´ØµÄ¼¯ºÏ£¬ºÏ²¢¼ÓÈëµ½×îÖÕ½á¹ûµÄ¼¯ºÏÖÐ if(clusteringPoints.size() == k) { // ÒѾµÃµ½k¸ö´Ø£¬Ëã·¨ÖÕÖ¹ break; } // compute cluster to be bisected ClusterInfo cluster = chooseClusterToBisect(clusteringPoints); // remove centroid from collected clusters map clusteringPoints.remove(cluster.centroidToBisect); LOG.info("Cluster to be bisected: " + cluster); points = Lists.newArrayList(); for(ClusterPoint<Point2D> cp : cluster.clusterPointsToBisect) { points.add(cp.getPoint()); } LOG.info("Finish bisecting iterations: #" + bisectingIterations + ",
clusterSize=" + clusteringPoints.size()); } // finally transform to result format Iterator<Entry<CenterPoint,
Set<ClusterPoint<Point2D>>>> iter = clusteringPoints.entrySet().iterator(); while(iter.hasNext()) { // ¹¹Ôì×îÖÕÊä³ö½á¹ûµÄÊý¾Ý½á¹¹ Entry<CenterPoint, Set<ClusterPoint<Point2D>>> entry = iter.next(); clusteredPoints.put(entry.getKey().getId(), entry.getValue()); centroidSet.add(entry.getKey()); } } |
ÉÏÃæ£¬ÎÒÃǵ÷ÓÃchooseClusterToBisect·½·¨ÇøÊµÏÖ¶Ô¾ßÓÐ×î´óµÄSSEµÄ´Ø½øÐжþ·Ö£¬¾ßÌåÑ¡ÔñµÄʵÏÖ¹ý³Ì£¬´úÂëÈçÏÂËùʾ£º
private ClusterInfo chooseClusterToBisect(Map<CenterPoint,
Set<ClusterPoint<Point2D>>> clusteringPoints) { double maxSSE = 0.0; int clusterIdWithMaxSSE = -1; CenterPoint centroidToBisect = null; Set<ClusterPoint<Point2D>> clusterToBisect = null; Iterator<Entry<CenterPoint,
Set<ClusterPoint<Point2D>>>> iter = clusteringPoints.entrySet().iterator(); while(iter.hasNext()) { Entry<CenterPoint, Set<ClusterPoint<Point2D>>> entry = iter.next(); CenterPoint centroid = entry.getKey(); Set<ClusterPoint<Point2D>> cpSet = entry.getValue(); double sse = computeSSE(centroid, cpSet); // ¼ÆËãÒ»¸ö´ØµÄSSEÖµ if(sse > maxSSE) { maxSSE = sse; clusterIdWithMaxSSE = centroid.getId(); centroidToBisect = centroid; clusterToBisect = cpSet; } } return new ClusterInfo(clusterIdWithMaxSSE, centroidToBisect, clusterToBisect, maxSSE);
// ½«´ý·ÖÁѵĴصÄÐÅÏ¢±£´æÔÚClusterInfo¶ÔÏóÖÐ }
private double computeSSE(CenterPoint centroid,
Set<ClusterPoint<Point2D>> cpSet)
{
// ¼ÆËãij¸ö´ØµÄSSE
double sse = 0.0;
for(ClusterPoint<Point2D> cp : cpSet) {
// update cluster id for ClusterPoint object
cp.setClusterId(centroid.getId());
double distance = MetricUtils.euclideanDistance(cp.getPoint(),
centroid);
sse += distance * distance;
}
return sse;
} |
ÔÚ¶þ·ÖÊÔÑé¹ý³ÌÖУ¬ÒòΪÿ´Î¶þ·Ö¶¼Éú³É2¸öеĴأ¬Í¨¹ý¼ÆËãÕâ2¸ö´ØµÄ×ÜSSEµÄÖµ£¬Í¨¹ýµü´ú¼ÆË㣬ÕÒµ½Ò»¸ö¾Ö²¿×îСµÄ×ÜSSE¶ÔÓ¦µÄ2¸ö´ØµÄ»®·Ö·½·¨£¬È»ºó½«¾ÛÀàÉú³ÉµÄ2´Ø¼ÓÈëµ½×îÖյĴؼ¯ºÏÖУ¬×ÜSSEµÄ¼ÆËã·¨·½·¨ÔÚcomputeTotalSSE·½·¨ÖУ¬ÈçÏÂËùʾ£º
¾ÛÀàЧ¹û¶Ô±È
ÏÂÃæ£¬ÎÒÃǶԱÈһϣ¬k-meansËã·¨ÓëBisecting k-means¶à´ÎÔËÐеľÛÀà½á¹û¶Ô±È£¬ÈçÏÂͼËùʾ£º

private double computeTotalSSE(Set<CenterPoint> centroids, ClusteringResult<Point2D> clusteringResult) { double sse = 0.0; for(CenterPoint center : centroids) { // ¼ÆËã2¸ö´ØµÄ×ÜSSEÖµ int clusterId = center.getId(); for(ClusterPoint<Point2D> p : clusteringResult.getClusteredPoints().get(clusterId)) { double distance = MetricUtils.euclideanDistance(p.getPoint(), center); sse += distance * distance; } } return sse; } |
ÉÏͼÖУ¬µÚÒ»ÅÅÊÇÖ´ÐÐk-means¾ÛÀàµÃµ½µÄЧ¹ûͼ£¬µÚ¶þÅÅÊÇÖ´ÐÐBisecting k-means¾ÛÀàµÃµ½µÄЧ¹ûͼ£¬ÆäÖбàºÅΪ9999µÄµãΪÖÊÐĵ㡣µÚ¶þÅÅЧ¹ûͼ¿´Ëƶà´Î¼ÆËãÖÊÐÄûÓб仯£¬µ«ÊÇʵ¼ÊÊDZ仯µÄ£¬Ö»ÊÇÖÊÐĵãµÄλÖñȽÏÎȶ¨£¬ÀýÈ磬ÏÂÃæÊÇÁ½×éÖÊÐĵ㣺
462.9642857142857,303.5,7 172.0,279.625,8 236.54285714285714,136.25714285714287,10 105.125,65.9375,11 75.91304347826087,185.7826086956522,12 56.03333333333333,299.53333333333336,13 273.5,117.5,14 415.5952380952381,68.71428571428571,15 329.04,308.68,16 374.35,200.55,17
172.0,279.625,5
462.9642857142857,303.5,8
105.125,65.9375,9
236.54285714285714,136.25714285714287,10
273.6666666666667,124.44444444444444,11
415.5952380952381,68.71428571428571,12
75.91304347826087,185.7826086956522,13
56.03333333333333,299.53333333333336,14
379.57894736842104,201.6315789473684,15
329.04,308.68,16 |
ͬk-meansËã·¨Ò»Ñù£¬Bisecting k-meansËã·¨²»ÊÊÓÃÓÚ·ÇÇòÐδصľÛÀ࣬¶øÇÒ²»Í¬³ß´çºÍÃܶȵÄÀàÐ͵Ĵأ¬Ò²²»Ì«Êʺϡ£ |