全部版块 我的主页
论坛 计量经济学与统计论坛 五区 计量经济学与统计软件 LATEX论坛
1081 3
2015-11-02

When working with an analysis system (such as R) there are usually good reasons to prefer using functions from the “base” system over using functions from extension packages. However, base functions are sometimes locked into unfortunate design compromises that can now be avoided. In R’s case I would say: do not use stats::aggregate().

Read on for our example.

For our example we create a data frame. The issue is: I am working in the Pacific time zone on Saturday October 31st 2015, and I have some time data that I want to work with that is in an Asian time zone.

print(date())## [1] "Sat Oct 31 08:14:38 2015"d <- data.frame(group='x', time=as.POSIXct(strptime('2006/10/01 09:00:00',   format='%Y/%m/%d %H:%M:%S',   tz="Etc/GMT+8"),tz="Etc/GMT+8"))  # I'd like to say UTC+8 or CSTprint(d)##   group                time## 1     x 2006-10-01 09:00:00print(d$time)## [1] "2006-10-01 09:00:00 GMT+8"str(d$time)##  POSIXct[1:1], format: "2006-10-01 09:00:00"print(unclass(d$time))## [1] 1159722000## attr(,"tzone")## [1] "Etc/GMT+8"

Suppose I try to aggregate the data to find the earliest time for each group. I have a problem, aggregate loses the timezone and gives a bad answer.

d2 <- aggregate(time~group,data=d,FUN=min)print(d2)##   group                time## 1     x 2006-10-01 10:00:00print(d2$time)## [1] "2006-10-01 10:00:00 PDT"

This is bad. Our time has lost its time zone and changed from 09:00:00 to 10:00:00. This violates John M. Chambers’ “Prime Directive” that:

computations can be understood and trusted.

Software for Data Analysis, John M. Chambers, Springer 2008, page 3.

The issue is the POSIXct time time is essentially a numeric array carrying around its timezone as an attribute. Most base R code has problems if there are extra attributes on a numeric array. So R-stat code tends to have a habit of dropping attributes when it can. it is odd that the class() is kept (which itself an attribute style structure) and the timezone is lost, but R is full of hand-specified corner cases.

dplyr gets the right answer.

library('dplyr')## ## Attaching package: 'dplyr'## ## The following object is masked from 'package:stats':## ##     filter## ## The following objects are masked from 'package:base':## ##     intersect, setdiff, setequal, unionby_group = group_by(d,group)d3 <- summarize(by_group,min(time))print(d3)## Source: local data frame [1 x 2]## ##   group           min(time)## 1     x 2006-10-01 09:00:00print(d3[[2]])## [1] "2006-10-01 09:00:00 GMT+8"

And plyr also works.

library('plyr')## -------------------------------------------------------------------------## You have loaded plyr after dplyr - this is likely to cause problems.## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:## library(plyr); library(dplyr)## -------------------------------------------------------------------------## ## Attaching package: 'plyr'## ## The following objects are masked from 'package:dplyr':## ##     arrange, count, desc, failwith, id, mutate, rename, summarise,##     summarized4 <- ddply(d,.(group),summarize,time=min(time))print(d4)##   group                time## 1     x 2006-10-01 09:00:00print(d4$time)## [1] "2006-10-01 09:00:00 GMT+8"
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

全部回复
2015-11-2 08:23:29
谢谢分享
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2015-11-2 08:30:17
。。。。。。
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2015-11-2 21:32:46
谢谢分享~~~
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

相关推荐
栏目导航
热门文章
推荐文章

说点什么

分享

扫码加好友,拉您进群
各岗位、行业、专业交流群