全部版块 我的主页
论坛 计量经济学与统计论坛 五区 计量经济学与统计软件 LATEX论坛
1391 1
2016-01-09


(This article was first published on DataScience+, and kindly contributed to R-bloggers)

Building off other industry-specific posts, I want to use healthcare data to demonstrate the use of R packages. The data can be downloaded here. To read the .CSV file in R you might read the post how to import data in R. Packages in R are stored in libraries and often are pre-installed, but reaching the next level of skill requires being able to know when to use new packages and what they contain. With that let’s get to our example.

gsub

When working with vectors and strings, especially in cleaning up data, gsub makes cleaning data much simpler. In my healthcare data, I wanted to convert dollar values to integers (ie. $21,000 to 21000), and I used gsub as seen below.

Reading the data in R from CSV file. I am naming the dataset “hosp”.

hosp <- read.csv("Payment_and_value_of_care_-_Hospital.csv")

In the code below I will remove hospitals without estimates

hospay<-hosp[hosp$Payment.category !="Not Available" & hosp$Payment.category !="Number of Cases Too Small",]

Now its time to remove the dollar signs and commas in estimate values

hospay$Payment <- as.numeric(gsub("[$,]","",hospay$Payment))hospay$Lower.estimate <- as.numeric(gsub("[$,]", "", hospay$Lower.estimate))hospay$Higher.estimate <- as.numeric(gsub("[$,]", "", hospay$Lower.estimate))head(hospay$Payment)[1] 13469 12863 12308 12222 21376 14740reshape2

In looking at the data, I wanted to focus on the Payment estimate. So I used the melt() function that is part of reshape2. Melt allows pivot-table style capabilities to restructure data without losing values.

library(reshape2)hosp_mel<-melt(data=hospay,id=c(2,5,9,11), measure=as.numeric(c(13)), value.name='Estimate') names(hosp_melt)[1] "Hospital.name"        "State"                "Payment.measure.name" "Payment.category"     "variable"             "Estimate" sqldf

With my data melted, I wanted to get the average estimate for heart attack patients by state. This is a classic SQL query, so bringing in sqldf allows for that.

library(sqldf)names(hosp_melt) [3] <- "paymentmeasurename"hosp_est <- sqldf("select State, avg(Estimate) as Estimate from hosp_melt where paymentmeasurename = 'Payment for heart attack patients' group by State")head(hosp_est)   State  Estimate1     AK  20987.602     AL  21850.323     AR  21758.004     AZ  22690.625     CA  22707.456     CO  21795.30

If you have any question feel free to leave a comment below.










To leave a comment for the author, please follow the link and comment on their blog: DataScience+.



R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data,R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL,Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...


二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

全部回复
2016-1-9 17:25:00
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

相关推荐
栏目导航
热门文章
推荐文章

说点什么

分享

扫码加好友,拉您进群
各岗位、行业、专业交流群