Don’t use stats::aggregate()

1081

收藏 2015-11-02

When working with an analysis system (such as R) there are usually good reasons to prefer using functions from the “base” system over using functions from extension packages. However, base functions are sometimes locked into unfortunate design compromises that can now be avoided. In R’s case I would say: do not use stats::aggregate().

Read on for our example.

For our example we create a data frame. The issue is: I am working in the Pacific time zone on Saturday October 31st 2015, and I have some time data that I want to work with that is in an Asian time zone.

print(date())## [1] "Sat Oct 31 08:14:38 2015"d <- data.frame(group='x', time=as.POSIXct(strptime('2006/10/01 09:00:00', format='%Y/%m/%d %H:%M:%S', tz="Etc/GMT+8"),tz="Etc/GMT+8")) # I'd like to say UTC+8 or CSTprint(d)## group time## 1 x 2006-10-01 09:00:00print(d$time)## [1] "2006-10-01 09:00:00 GMT+8"str(d$time)## POSIXct[1:1], format: "2006-10-01 09:00:00"print(unclass(d$time))## [1] 1159722000## attr(,"tzone")## [1] "Etc/GMT+8"

Suppose I try to aggregate the data to find the earliest time for each group. I have a problem, aggregate loses the timezone and gives a bad answer.

d2 <- aggregate(time~group,data=d,FUN=min)print(d2)## group time## 1 x 2006-10-01 10:00:00print(d2$time)## [1] "2006-10-01 10:00:00 PDT"

This is bad. Our time has lost its time zone and changed from 09:00:00 to 10:00:00. This violates John M. Chambers’ “Prime Directive” that:

computations can be understood and trusted.

Software for Data Analysis, John M. Chambers, Springer 2008, page 3.

The issue is the POSIXct time time is essentially a numeric array carrying around its timezone as an attribute. Most base R code has problems if there are extra attributes on a numeric array. So R-stat code tends to have a habit of dropping attributes when it can. it is odd that the class() is kept (which itself an attribute style structure) and the timezone is lost, but R is full of hand-specified corner cases.

dplyr gets the right answer.

library('dplyr')## ## Attaching package: 'dplyr'## ## The following object is masked from 'package:stats':## ## filter## ## The following objects are masked from 'package:base':## ## intersect, setdiff, setequal, unionby_group = group_by(d,group)d3 <- summarize(by_group,min(time))print(d3)## Source: local data frame [1 x 2]## ## group min(time)## 1 x 2006-10-01 09:00:00print(d3[[2]])## [1] "2006-10-01 09:00:00 GMT+8"

And plyr also works.

library('plyr')## -------------------------------------------------------------------------## You have loaded plyr after dplyr - this is likely to cause problems.## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:## library(plyr); library(dplyr)## -------------------------------------------------------------------------## ## Attaching package: 'plyr'## ## The following objects are masked from 'package:dplyr':## ## arrange, count, desc, failwith, id, mutate, rename, summarise,## summarized4 <- ddply(d,.(group),summarize,time=min(time))print(d4)## group time## 1 x 2006-10-01 09:00:00print(d4$time)## [1] "2006-10-01 09:00:00 GMT+8"