The Split-Apply-Combine Strategy using R

ReneeBK

1479

收藏 2015-02-06

The Split-Apply-Combine Strategy using R

Jaynal Abedin

Often, we require similar types of operations in different subgroups of a dataset, such as group-wise summarization, standardization, and statistical modeling. This type of task requires us to break up a big problem into manageable pieces, perform operations on each piece separately, and finally combine the output of each piece into a single piece of output. To understand the split-apply-combine strategy intuitively, we could compare this with the map-reduce strategy for processing large amounts of data, recently popularized by Google. In the map-reduce strategy, the map step corresponds to split and apply and the reduce step consists of combining. The map-reduce approach is primarily designed to deal with a highly parallel environment where the work has been done by several hundreds or thousands of computers independently.

The split-apply-combine strategy creates an opportunity to see the similarities of problems across subgroups that were previously unconnected. This strategy can be used in many existing tools, such as the GROUP BY operation in SAS, PivotTable in MS Excel, and the SQL GROUP BY operator.

To explain the split-apply-combine strategy, we will use Fisher's iris data. This dataset contains the measurements in centimeters of these variables: sepal length and width, and petal length and width, for 50 flowers from each of the three species of iris. The species are Iris setosa, Iris versicolor, and Iris virginica. We want to calculate the mean of each variable and for each species separately. This can be done in different ways using a loop or without using one.

Split-apply-combine without a loop

In this section, we will see an example of the split-apply-combine strategy without using a loop. The steps are as follows:

Split the iris dataset into three parts.
Remove the species name variable from the data.
Calculate the mean of each variable for the three different parts separately.
Combine the output into a single data frame.

The code for this is as follows:

复制代码

Split-apply-combine with a loop

The following example will calculate the same statistics as in the previous section, but this time we will perform this task using a loop. The steps are similar but the code is different. In each iteration, we will split the data for each species and calculate the mean for each variable and then combine the output into a single data frame, as shown in the following code:

复制代码

An important fact to note in the split-apply-combine strategy is that each piece should be independent of the other. If the calculation in one piece is somehow dependent on the other, the split-apply-combine strategy will not work. This strategy is not applicable in running an average type of operation, where a current average is dependent on the previous one. This strategy is only applicable when the big problem can be broken up into smaller manageable pieces and we can perform the desired operation on each piece independently. For running average calculations, the split-apply-combine strategy is not suitable; we can use a loop instead. But if processing speed is a concern, we can write the code in some lower-level language such as C or Fortran.

Reference

Data Manipulation with R

By: Jaynal Abedin
Publisher: Packt Publishing
Pub. Date: January 15, 2014

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群