全部版块 我的主页
论坛 计量经济学与统计论坛 五区 计量经济学与统计软件 LATEX论坛
1193 0
2016-01-09


(This article was first published on DataScience+, and kindly contributed to R-bloggers)

Hello everyone! In this article, I will show you how you can use tidyr for data manipulation. tidyr is a package by Hadley Wickham that makes it easy to tidy your data. It is often used in conjunction withdplyr. Data is said to be tidy when each column represents a variable, and each row represents an observation.

I will demonstrate the usage of the following four functions from the tidyr package:

  • gather – converts wide data to longer format. It is analogous to the melt function fromreshape2.
  • spread – converts long data to wider format. It is analogous to the cast function fromreshape2.
  • unite – combines two or more columns into a single column.
  • separate – splits one column into two or more columns.

I will use the mtcars dataset from the datasets library. If you are not familiar with it, this is what it looks like:

library(tidyr) library(dplyr) head(mtcars)                    mpg cyl disp  hp drat    wt  qsec vs am gear carb Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Let us include the names of the cars in a column called car for easier manipulation.

mtcars$car <- rownames(mtcars) mtcars <- mtcars[, c(12, 1:11)]gather

gather takes the form (from the help file):

gather(data, key, value, ..., na.rm = FALSE, convert = FALSE)

where ... is the specification of the columns to gather.

We can replicate what melt does as follows:

mtcarsNew <- mtcars %>% gather(attribute, value, -car) head(mtcarsNew)                 car attribute value 1         Mazda RX4       mpg  21.0 2     Mazda RX4 Wag       mpg  21.0 3        Datsun 710       mpg  22.8 4    Hornet 4 Drive       mpg  21.4 5 Hornet Sportabout       mpg  18.7 6           Valiant       mpg  18.1  tail(mtcarsNew)                car attribute value 347  Porsche 914-2      carb     2 348   Lotus Europa      carb     2 349 Ford Pantera L      carb     4 350   Ferrari Dino      carb     6 351  Maserati Bora      carb     8 352     Volvo 142E      carb     2

As you can see, it gathers all the columns except car and places their name and value into theattritube and value column respectively.

The great thing about tidyr is that you can gather only certain columns and leave the others alone. If we want to gather all the columns from mpg to gear and leave the carb and car columns as they are, we can do it as follows:

mtcarsNew <- mtcars %>% gather(attribute, value, mpg:gear) head(mtcarsNew)                 car carb attribute value 1         Mazda RX4    4       mpg  21.0 2     Mazda RX4 Wag    4       mpg  21.0 3        Datsun 710    1       mpg  22.8 4    Hornet 4 Drive    1       mpg  21.4 5 Hornet Sportabout    2       mpg  18.7 6           Valiant    1       mpg  18.1spread

spread takes the form(from the help file):

spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE)

We can replicate what cast does as follows:

mtcarsSpread <- mtcarsNew %>% spread(attribute, value) head(mtcarsSpread)                  car carb  mpg cyl disp  hp drat    wt  qsec vs am gear 1        AMC Javelin    2 15.2   8  304 150 3.15 3.435 17.30  0  0    3 2 Cadillac Fleetwood    4 10.4   8  472 205 2.93 5.250 17.98  0  0    3 3         Camaro Z28    4 13.3   8  350 245 3.73 3.840 15.41  0  0    3 4  Chrysler Imperial    4 14.7   8  440 230 3.23 5.345 17.42  0  0    3 5         Datsun 710    1 22.8   4  108  93 3.85 2.320 18.61  1  1    4 6   Dodge Challenger    2 15.5   8  318 150 2.76 3.520 16.87  0  0    3unite

unite takes the form (from the help file):

unite(data, col, ..., sep = "_", remove = TRUE)

where ... represents the columns to unite and col represents the column to add.

Let us create some fake data:

set.seed(1) date <- as.Date('2016-01-01') + 0:14 hour <- sample(1:24, 15) min <- sample(1:60, 15) second <- sample(1:60, 15) event <- sample(letters, 15) data <- data.frame(date, hour, min, second, event) data          date hour min second event 1  2016-01-01    7  30     29     u 2  2016-01-02    9  43     36     a 3  2016-01-03   13  58     60     l 4  2016-01-04   20  22     11     q 5  2016-01-05    5  44     47     p 6  2016-01-06   18  52     37     k 7  2016-01-07   19  12     43     r 8  2016-01-08   12  35      6     i 9  2016-01-09   11   7     38     e 10 2016-01-10    1  14     21     b 11 2016-01-11    3  20     42     w 12 2016-01-12   14   1     32     t 13 2016-01-13   23  19     52     h 14 2016-01-14   21  41     26     s 15 2016-01-15    8  16     25     o

Now, let us combine the date, hour, min, and second columns into a new column called datetime. Usually, datetime in R is of the form Year-Month-Day Hour:Min:Second.

dataNew <- data %>%   unite(datehour, date, hour, sep = ' ') %>%   unite(datetime, datehour, min, second, sep = ':') dataNew               datetime event 1   2016-01-01 7:30:29     u 2   2016-01-02 9:43:36     a 3  2016-01-03 13:58:60     l 4  2016-01-04 20:22:11     q 5   2016-01-05 5:44:47     p 6  2016-01-06 18:52:37     k 7  2016-01-07 19:12:43     r 8   2016-01-08 12:35:6     i 9   2016-01-09 11:7:38     e 10  2016-01-10 1:14:21     b 11  2016-01-11 3:20:42     w 12  2016-01-12 14:1:32     t 13 2016-01-13 23:19:52     h 14 2016-01-14 21:41:26     s 15  2016-01-15 8:16:25     oseparate

separate takes the form (from the help file):

separate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE,   convert = FALSE, extra = "warn", fill = "warn", ...)

We can get back the original data we created using separate as follows:

data1 <- dataNew %>%    separate(datetime, c('date', 'time'), sep = ' ') %>%    separate(time, c('hour', 'min', 'second'), sep = ':') data1          date hour min second event 1  2016-01-01   07  30     29     u 2  2016-01-02   09  43     36     a 3  2016-01-03   13  59     00     l 4  2016-01-04   20  22     11     q 5  2016-01-05   05  44     47     p 6  2016-01-06   18  52     37     k 7  2016-01-07   19  12     43     r 8  2016-01-08   12  35     06     i 9  2016-01-09   11  07     38     e 10 2016-01-10   01  14     21     b 11 2016-01-11   03  20     42     w 12 2016-01-12   14  01     32     t 13 2016-01-13   23  19     52     h 14 2016-01-14   21  41     26     s 15 2016-01-15   08  16     25     o

It first splits the datetime column into date and time, and then splits time into hour, min, andsecond.

That brings us to the end of the article. If you have questions or feedback, feel free to leave a comment or reach out to me on Twitter.












二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

相关推荐
栏目导航
热门文章
推荐文章

说点什么

分享

扫码加好友,拉您进群
各岗位、行业、专业交流群