全部版块 我的主页
论坛 计量经济学与统计论坛 五区 计量经济学与统计软件 LATEX论坛
5559 49
2017-02-25

本帖隐藏的内容

Actually, this is about two R versions (standard and improved), a Python version, and a Perl version of a new machine learning technique recently published here. We asked for help to translate the original Perl script to Python and R, and finally decided to work with Naveenkumar Ramaraju, who is currently pursuing a master's in Data Science at Indiana University. So the Python and R versions are from him.

We believe that this code comparison and translation will be very valuable to anyone learning Python or R with the purpose of applying it to data science and machine learning.



[color=rgb(255, 255, 255) !important]


The code

The source code is easy to read and has deliberately made longer than needed to provide enough details, avoid complicated iterations, and facilitate maintenance.The main output file is hdt-out2.txt. The input data set is HDT-data3.txt. You need to read this article (see section 4 after clicking, it has been updated) to check out what the code is trying to accomplish. In short, it is an algorithm to classify blog posts as popular or not based on extracted features (mostly, keywords in the title.)

The code has been written in Perl, R and Python. Perl and Python run faster than R. Click on the relevant link below to access the source code, available as a text file. The code, originally written in Perl, was translated to Python and R by Naveenkumar Ramaraju.

For those learning Python or R, this is a great opportunity.

Note regarding the R implementation

Required library: hash (R doesn't have inbuilt hash or dictionary without imports.) You can use any one of below script files.

  • Standard version is the literal translation of the Perl code with same variable names to the maximum extent possible.
  • Improved version uses functions, more data frames and more R-like approach to reduce code running time (~30 % faster) and less lines of code. Variable names would vary from Perl. Output file would have comma(,) as delimiter between IDs.

Instructions to run:  Place the R file and HDT-data3.txt (input file) in root folder of R environment. Execute the '.R' file in R studio or using command line script:  > Rscript HDT_improved.R  R is known to be slow in text parsing. We can optimize further if all inputs are within double quotes or no quotes at all by using data frames.



二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

全部回复
2017-2-25 09:16:22
niubility
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2017-2-25 09:27:46
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2017-2-25 11:16:21
kankan
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2017-2-25 13:55:54
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2017-2-25 13:56:38
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

点击查看更多内容…
相关推荐
栏目导航
热门文章
推荐文章

说点什么

分享

扫码加好友,拉您进群
各岗位、行业、专业交流群