使用K-means聚类分析如何确定最优分类？

48241

收藏 2013-03-17

对于大样本量使用K-means聚类分析。自己设定了2-6类，但是不确定分几类最优。
不少说用使用方差分析的显著性检验。

可是我看到一篇文献讲的是用另外一种kappa一致性检验，但是具体如何操作不明白，希望有高手能解释一下！
PS：文献其中讲的大概内容是将样本随机平均分为A和B，然后对A进行K-means，好像得到什么距离；
再利用得到的距离对B使用K-means，以及直接对B使用K-means。将这两种情形下对B得出的分类进行kappa一致性检验。
最有比较2~6类情形下几种kappa系数，最高的对应分类最优

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

全部回复

602dxz

2013-3-17 18:59:03

哪一种从理论与经验上说得通就选那一种，这个是你个人给出答案软件是给不出最佳答案的。统计检验最多就是给你个参考，一般会做类别间与类别内的方差检验，以及判别分析检验，不过一般都差别不大。

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

602dxz

2013-3-17 20:23:05

分几类主要取决于你个人的经验、感觉与理论。一般统计检验只能大概给你个参考，用得最多的就是类别间与类别内的方差检验、判别分析检验。不管分几类，你只要可以自愿其说就可以了。完全用定量与统计的方法确定最优k-mean聚类的类别数量的方法是不存在的。

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

mayshen008

2013-3-17 20:47:36

602dxz 发表于 2013-3-17 20:23
分几类主要取决于你个人的经验、感觉与理论。一般统计检验只能大概给你个参考，用得最多的就是类别间与类别 ...

我还是存在疑问，不如我把文献那一段原文抄给你看看吧，请您再帮忙解释下：
“Next, the raw data consisting of 396 cases was randomly split into two data sets, A and B, each containing 198 cases. The K-means cluster procedure was administrated with the two sets of data.
With the possible cluster solution n (n=2,3...5,or 6), Data A were utilized to generate the distances between initial clusters by the K-means procedure.
The distance generated then was used with Data B computed by K-means analysis. Data B were computed in an unconstrained manner using the same procedure that was used for Data A.
Then a constrained computation using the cluster distances acquired in Data A was determined.
This procedure essentially provided a cross-validation for Data B. For a given n, the constrained solution clustered the cases in Data B according to the cluster distance generated from Data A, while the unconstrained solution was free of restrictions. Accordingly, Kappa co-efficiencies(the chance corrected coefficients of agreement) were calculated for the two solutions of Data B cases.
For each n, the optional n with the maximal Kappa was chosen as candidate N for the entire data for the final cluster analysis .”