全部版块 我的主页
论坛 计量经济学与统计论坛 五区 计量经济学与统计软件 winbugs及其他软件专版
922 2
2017-04-18
GMM

本帖隐藏的内容

Spark GMM-master.zip
大小:(10.79 KB)

 马上下载



Gaussian Mixture Model Implementation in Pyspark

GMM algorithm models the entire data set as a finite mixture of Gaussian distributions,each parameterized by a mean vector, a covariance matrix and a mixture weights. Here the probability of each point to belong to each cluster is computed along with the cluster statistics.

This distributed implementation of GMM in pyspark estimates the parameters using the Expectation-Maximization algorithm and considers only diagonal covariance matrix for each component.

How to Run

There are two ways to run this code.

  • Using the library in your Python program.

You can train the GMM model by invoking the function GMMModel.trainGMM(data,k,n_iter,ct) where

      data is an RDD(of dense or Sparse Vector),         k is the number of components/clusters,         n_iter is the number of iterations(default 100),         ct is the convergence threshold(default 1e-3).

To use this library in your program simply download the GMMModel.py and GMMClustering.py and add them as Python files along with your own user code as shown below: ``` wgethttps://raw.githubusercontent.com/FlytxtRnD/GMM/master/GMMModel.py wgethttps://raw.githubusercontent.com/FlytxtRnD/GMM/master/GMMClustering.py

   ./bin/spark-submit --master <master> --py-files GMMModel.py,GMMclustering.py                          <your-program.py> <input_file> <num_of_clusters>                           [--n_iter <num_of_iterations>] [--ct <convergence_threshold>] ```       The returned object "model" has the following attributes **model.Means,model.Covars,model.Weights**. To get the cluster labels and responsibilty matrix(membership values):         responsibility_matrix,cluster_labels = GMMModel.resultPredict(model, data)
  • Running the example GMM clustering script.

    If you'd like to run our example program directly, also download the PyGMM.py file and invoke it with spark-submit.


     wget https://raw.githubusercontent.com/FlytxtRnD/GMM/master/PyGMM.py    ./bin/spark-submit --master <master> --py-files GMMModel.py,GMMclustering.py                            PyGMM.py <input_file> <num_of_clusters>                            [--n_iter <num_of_iterations>] [--ct <convergence_threshold>]  ```  where master is your spark master URL and input file should contain comma separated numeric values.     Make sure you enter the full path to the downloaded files.
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

全部回复
2017-4-18 07:05:49
谢谢楼主分享!
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2017-4-18 07:06:45
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

相关推荐
栏目导航
热门文章
推荐文章

说点什么

分享

扫码加好友,拉您进群
各岗位、行业、专业交流群