A gentle introduction to parallel computing in R

1341

收藏 2016-01-19

The reason we care is: by making the computer work harder (perform many calculations simultaneously) we wait less time for our experiments and can run more experiments. This is especially important when doing data science (as we often do using the R analysis platform) as we often need to repeat variations of large analyses to learn things, infer parameters, and estimate model stability.

Typically to get the computer to work a harder the analyst, programmer, or library designer must themselves work a bit hard to arrange calculations in a parallel friendly manner. In the best circumstances somebody has already done this for you:

Good parallel libraries, such as the multi-threaded BLAS/LAPACK libraries included in Revolution R Open (RRO, now Microsoft R Open) (see here).
Specialized parallel extensions that supply their own high performance implementations of important procedures such as rx methods from RevoScaleR or h2o methods from h2o.ai.
Parallelization abstraction frameworks such as Thrust/Rth (see here).
Using R application libraries that dealt with parallelism on their own (examples include gbm, bootand our own vtreat). (Some of these libraries do not attempt parallel operation until you specify a parallel execution environment.)

In addition to having a task ready to “parallelize” you need a facility willing to work on it in a parallel manner. Examples include:

Your own machine. Even a laptop computer usually now has four our more cores. Potentially running four times faster, or equivalently waiting only one fourth the time, is big.
Graphics processing units (GPUs). Many machines have a one or more powerful graphics cards already installed. For some numerical task these cards are 10 to 100 times faster than the basic Central Processing Unit (CPU) you normally use for computation (see here).
Clusters of computers (such as Amazon ec2, Hadoop backends and more).

Obviously parallel computation with R is a vast and specialized topic. It can seem impossible to quickly learn how to use all this magic to run your own calculation more quickly.

In this tutorial we will demonstrate how to speed up a calculation of your own choosing using basic R.

First you must have a problem that is amenable to parallelism. The most obvious such problems involve repetition (what the cognoscenti call “embarrassingly parallel”):

Tuning model parameters by repeated re-fitting of models (such as is done by the caret package).
Apply a transform to many different variables (such as is done by the vtreat package).
Estimating model quality through cross-validation, bootstrap or other repeated sampling techniques.

So we are going to assume we have a task in mind that has already been cup up into a number of simple repetitions. Note: this conceptual cutting up is not necessarily always easy, but it is the step needed to start the processes.

Here is our example task: fitting a predictive model onto a small dataset. We load the data set and some definitions into our workspace:

d <- iris # let "d" refer to one of R's built in data sets vars <- c('Sepal.Length','Sepal.Width','Petal.Length') yName <- 'Species' yLevels <- sort(unique(as.character(d[[yName]]))) print(yLevels) ## [1] "setosa" "versicolor" "virginica"

(We are using the convention that any line starting with “##” is a printed result from the R command above itself.)

We encounter a small modeling issue: the variable we are trying to predict takes on three levels. The modeling technique we were going to use (glm(family='binomial')) is not specialized to predict “multinomial outcomes” (though other libraries are). So we decide to solve this using a “one versus rest” strategy and build a series of classifiers: each separating one target from the rest of the outcomes. This is where we see a task that is an obvious candidate for parallelization. Let’s wrap building a single target model into a function for neatness:

fitOneTargetModel <- function(yName,yLevel,vars,data) { formula <- paste('(',yName,'=="',yLevel,'") ~ ', paste(vars,collapse=' + '),sep='') glm(as.formula(formula),family=binomial,data=data) }

Then the usual “serial” way to build all the models would look like the following:

for(yLevel in yLevels) { print("*****") print(yLevel) print(fitOneTargetModel(yName,yLevel,vars,d)) }

Or we could wrap our procedure into a new single argument function (this pattern is called Currying) and then use R’s elegant lapply() notation:

worker <- function(yLevel) { fitOneTargetModel(yName,yLevel,vars,d) } models <- lapply(yLevels,worker) names(models) <- yLevels print(models)

The advantage of the lapply() notation is it emphasizes the independent nature of each calculation, exactly the sort of isolation we need to parallelize our calculations. Think of the for-loop as accidentally over-specifying the calculation by introducing a needless order or sequence of operations.

Re-organizing our calculation functionally has gotten us ready to use a parallel library and perform the calculation in parallel. First we set up a parallel cluster to do the work:

# Start up a parallel cluster parallelCluster <- parallel::makeCluster(parallel::detectCores()) print(parallelCluster) ## socket cluster with 4 nodes on host ‘localhost’

Notice we created a “socket cluster.” The socket cluster is a crude-seeming “course grained parallel” cluster that is extremely flexible. A socket cluster is crude in that is fairly slow to send jobs to it (so you want to send work in big “coarse” chunks) but amazingly flexible as it can be implemented as any of: multiple cores on a single machine, multiple cores on multiple machines on a shared network, or on top of other systems such as an MPI cluster.

At this point we would expect code like below to work (see here for details on tryCatch).

tryCatch( models <- parallel::parLapply(parallelCluster, yLevels,worker), error = function(e) print(e) ) ## <simpleError in checkForRemoteErrors(val): ## 3 nodes produced errors; first error: ## could not find function "fitOneTargetModel">

Instead of results, we got the error “could not find function "fitOneTargetModel">.”

The issue is: on a socket cluster arguments to parallel::parLapply are copied to each processing node over a communications socket. However, the entirety of the current execution environment (in our case the so-called “global environment) is not copied over (or copied back, only values are returned). So our function worker() when transported to the parallel nodes must have an altered lexical closure (as it can not point back to our execution environment) and it appears this new lexical closure no longer contains references to the needed values yName, vars, d and fitOneTargetModel. This is unfortunate, but makes sense. R uses entire execution environments to implement the concept of lexical closures, and R has no way of knowing which values in a given environment are actually needed by a given function.

So we know what is wrong, how do we fix it? We fix it by using an environment that is not the global environment to transport the values we need. The easiest way to do this is to use our own lexical closure. To do this we wrap the entire process inside a function (so we are executing in a controlled environment). The code that works is as follows:

# build the single argument function we are going to pass to parallel mkWorker <- function(yName,vars,d) { # make sure each of the three values we need passed # are available in this environment force(yName) force(vars) force(d) # define any and every function our worker function # needs in this environment fitOneTargetModel <- function(yName,yLevel,vars,data) { formula <- paste('(',yName,'=="',yLevel,'") ~ ', paste(vars,collapse=' + '),sep='') glm(as.formula(formula),family=binomial,data=data) } # Finally: define and return our worker function. # The function worker's "lexical closure" # (where it looks for unbound variables) # is mkWorker's activation/execution environment # and not the usual Global environment. # The parallel library is willing to transport # this environment (which it does not # do for the Global environment). worker <- function(yLevel) { fitOneTargetModel(yName,yLevel,vars,d) } return(worker) } models <- parallel::parLapply(parallelCluster,yLevels, mkWorker(yName,vars,d)) names(models) <- yLevels print(models)

The above works because we forced the values we needed into the new execution environment and defined the function we were going to use directly in that environment. Obviously it is incredibly tedious and wasteful to have to re-define every function we need every time we need it (though we could also have passed in the wrapper as we did with the other values). A more versatile pattern is: use a helper function we supply called “bindToEnv” to do some of the work. With bindToEnv the code looks like the following.

source('bindToEnv.R') # Download from: http://winvector.github.io/Parallel/bindToEnv.R # build the single argument function we are going to pass to parallel mkWorker <- function() { bindToEnv(objNames=c('yName','vars','d','fitOneTargetModel')) function(yLevel) { fitOneTargetModel(yName,yLevel,vars,d) } } models <- parallel::parLapply(parallelCluster,yLevels, mkWorker()) names(models) <- yLevels print(models)

The above pattern is concise and works very well. A few caveats to remember are:

Remember each parallel worker is a remote environment. Make sure libraries you need are defined on each remote machine.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群