全部版块 我的主页
论坛 计量经济学与统计论坛 五区 计量经济学与统计软件 LATEX论坛
9516 102
2017-02-25

本帖隐藏的内容

Are you considering making a move from R to Python? Here are the libraries you need to know, how they stack up to their R contemporaries, and why you should learn them.




This post originally appeared on the Yhat blog. Yhat is a Brooklyn based company whose goal is to make data science applicable for developers, data scientists, and businesses alike. Yhat provides a software platform for deploying and managing predictive algorithms as REST APIs, while eliminating the painful engineering obstacles associated with production environments like testing, versioning, scaling and security.


Why the switch?


One of my favorite parts of machine learning in Python is that it got the benefit of observing the R community and then emulating the best parts of it. I'm a big believer that a language is only as helpful as its libraries. So in this post I'm going to go over some critical packages that I use almost every time I work in R, and their counterpart(s) in Python.

glm, knn, randomForest, e1071 -> scikit-learn


One thing that is a blessing and a curse in R is that the machine learning algorithms are generally segmented by package. Meaning instead of having a single (or set) of ML libraries that each implement some common algorithms, each algorithm gets its own package. It's sort of nice because you can find very esoteric, cutting edge implementations of algorithms, but it can be a pain for day-to-day use where you might be switching between algorithms. This pain is something that Python's scikit-learnsolves really well. scikit-learn provides a common set of ML algorithms all under the same API. It makes switching between LogisticRegression and GradientBoostingMachines a one-liner.

reshape/reshape2, plyr/dplyr -> pandas


This was actually the subject of one of our first posts. pandas took the best parts of data munging in R and turned it into a Python package. This includes its own implementation of a data frame along with ways to modify and restructure it. Basically it took the best parts of reshape/reshape2 and plyr/dplyrand Pythonified it!

ggplot2 -> ggplot + seaborn + bokeh


One thing that R still does better than Python is plotting. Hands down, R is better in just about every facet. Even so, Python plotting has matured though it's a fractured community. If you like the ggplot-style syntax, then look no further than Yhat's own ggplot. If you're after super statistical and technical plots then reach for seaborn. And if you're in the market for some super slick, great looking interactive plots then try out bokeh.

stringr -> nothing


String manipulation in "base R" is nearly as unintuitive as it is silly. Any time I'm working with strings in R I do 2 things (in order):

  • briefly nod in appreciation to New Zealand for producing Hadley Wickham
  • import stringr


Much obliged, New Zealand

stringr is an absolute lifesaver. It's well written, performant (at least I think so), and easy to install (don't overlook this last item. if people can't install your software, there's no sense in making it).

Ok so stringr appreciation monologue complete. So the good news for you is that Python is so great for string manipulation, you don't really need a string library! It has a fantastic built-in regular expressions library, re, and a built-in string meta-libarary appropriately called string. So lucky for you, Python comes with all string-related batteries included!

RStudio -> Rodeo


To many users, RStudio is synonymous with R. And why not? It's a great IDE for data analysis in R. Historically speaking, there haven't been a lot of comparable options for Python. Of course this is no longer the case. We released the very first version of Rodeo just over a year ago and released the 2.0 for Windows, OSX, and Linux about a month ago.

"Ever since we've used RStudio, we've been looking for an IDE like it for Python. We went through IDEs such as Sublime Text and Spyder, none of which suited our likings. We searched and found Rodeo and couldn't have been more pleased with the IDE." -Stephen Hsu, University of California, Berkeley


[size=+1]Download Rodeo!

Knitr -> Jupyter


Knitr is a great way to create reproducible and highly visual analysis using R. It's been a staple in RStudio for a while now. In the Python world, the most analagous package is Jupyter. Jupyter notebooks provide an interactive environment for programming in Python (and other languages) that focuses on reproducibility and visualization--it even has a plugin for R!

sqldf -> pandasql


sqldf is a great way for SQL users to comfortably manipulate data in R. I myself used it when I first started learning R. Way back when, Yhat actually built a similar package for Python called pandasql. Same concept: write SQL queries against your data frames, get data frames back! Fast-forward 3 years and pandasql has over 256 stars on GitHub :). Not bad for a library with only 358 lines of code!





二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

全部回复
2017-2-25 09:19:31
good        book
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2017-2-25 09:27:16
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2017-2-25 10:12:08
谢谢分享
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2017-2-25 11:05:09
kankan
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2017-2-25 12:19:43
好书,支持一下
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

点击查看更多内容…
相关推荐
栏目导航
热门文章
推荐文章

说点什么

分享

扫码加好友,拉您进群
各岗位、行业、专业交流群