 
    The reason we care is: by making the computer work harder (perform many calculations simultaneously) we wait less time for our experiments and can run more experiments. This is especially important when doing data science (as we often do using the R analysis platform) as we often need to repeat variations of large analyses to learn things, infer parameters, and estimate model stability.
Typically to get the computer to work a harder the analyst, programmer, or library designer must themselves work a bit hard to arrange calculations in a parallel friendly manner. In the best circumstances somebody has already done this for you:
In addition to having a task ready to “parallelize” you need a facility willing to work on it in a parallel manner. Examples include:
Obviously parallel computation with R is a vast and specialized topic. It can seem impossible to quickly learn how to use all this magic to run your own calculation more quickly.
In this tutorial we will demonstrate how to speed up a calculation of your own choosing using basic R.
First you must have a problem that is amenable to parallelism. The most obvious such problems involve repetition (what the cognoscenti call “embarrassingly parallel”):
So we are going to assume we have a task in mind that has already been cup up into a number of simple repetitions. Note: this conceptual cutting up is not necessarily always easy, but it is the step needed to start the processes.
Here is our example task: fitting a predictive model onto a small dataset. We load the data set and some definitions into our workspace:
d <- iris # let "d" refer to one of R's built in data sets vars <- c('Sepal.Length','Sepal.Width','Petal.Length') yName <- 'Species' yLevels <- sort(unique(as.character(d[[yName]]))) print(yLevels) ## [1] "setosa" "versicolor" "virginica"(We are using the convention that any line starting with “##” is a printed result from the R command above itself.)
We encounter a small modeling issue: the variable we are trying to predict takes on three levels. The modeling technique we were going to use (glm(family='binomial')) is not specialized to predict “multinomial outcomes” (though other libraries are). So we decide to solve this using a “one versus rest” strategy and build a series of classifiers: each separating one target from the rest of the outcomes. This is where we see a task that is an obvious candidate for parallelization. Let’s wrap building a single target model into a function for neatness:
fitOneTargetModel <- function(yName,yLevel,vars,data) { formula <- paste('(',yName,'=="',yLevel,'") ~ ', paste(vars,collapse=' + '),sep='') glm(as.formula(formula),family=binomial,data=data) }Then the usual “serial” way to build all the models would look like the following:
for(yLevel in yLevels) { print("*****") print(yLevel) print(fitOneTargetModel(yName,yLevel,vars,d)) }Or we could wrap our procedure into a new single argument function (this pattern is called Currying) and then use R’s elegant lapply() notation:
worker <- function(yLevel) { fitOneTargetModel(yName,yLevel,vars,d) } models <- lapply(yLevels,worker) names(models) <- yLevels print(models)The advantage of the lapply() notation is it emphasizes the independent nature of each calculation, exactly the sort of isolation we need to parallelize our calculations. Think of the for-loop as accidentally over-specifying the calculation by introducing a needless order or sequence of operations.
Re-organizing our calculation functionally has gotten us ready to use a parallel library and perform the calculation in parallel. First we set up a parallel cluster to do the work:
# Start up a parallel cluster parallelCluster <- parallel::makeCluster(parallel::detectCores()) print(parallelCluster) ## socket cluster with 4 nodes on host ‘localhost’Notice we created a “socket cluster.” The socket cluster is a crude-seeming “course grained parallel” cluster that is extremely flexible. A socket cluster is crude in that is fairly slow to send jobs to it (so you want to send work in big “coarse” chunks) but amazingly flexible as it can be implemented as any of: multiple cores on a single machine, multiple cores on multiple machines on a shared network, or on top of other systems such as an MPI cluster.
At this point we would expect code like below to work (see here for details on tryCatch).
tryCatch( models <- parallel::parLapply(parallelCluster, yLevels,worker), error = function(e) print(e) ) ## <simpleError in checkForRemoteErrors(val): ## 3 nodes produced errors; first error: ## could not find function "fitOneTargetModel">Instead of results, we got the error “could not find function "fitOneTargetModel">.”
The issue is: on a socket cluster arguments to parallel::parLapply are copied to each processing node over a communications socket. However, the entirety of the current execution environment (in our case the so-called “global environment) is not copied over (or copied back, only values are returned). So our function worker() when transported to the parallel nodes must have an altered lexical closure (as it can not point back to our execution environment) and it appears this new lexical closure no longer contains references to the needed values yName, vars, d and fitOneTargetModel. This is unfortunate, but makes sense. R uses entire execution environments to implement the concept of lexical closures, and R has no way of knowing which values in a given environment are actually needed by a given function.
So we know what is wrong, how do we fix it? We fix it by using an environment that is not the global environment to transport the values we need. The easiest way to do this is to use our own lexical closure. To do this we wrap the entire process inside a function (so we are executing in a controlled environment). The code that works is as follows:
# build the single argument function we are going to pass to parallel mkWorker <- function(yName,vars,d) { # make sure each of the three values we need passed # are available in this environment force(yName) force(vars) force(d) # define any and every function our worker function # needs in this environment fitOneTargetModel <- function(yName,yLevel,vars,data) { formula <- paste('(',yName,'=="',yLevel,'") ~ ', paste(vars,collapse=' + '),sep='') glm(as.formula(formula),family=binomial,data=data) } # Finally: define and return our worker function. # The function worker's "lexical closure" # (where it looks for unbound variables) # is mkWorker's activation/execution environment # and not the usual Global environment. # The parallel library is willing to transport # this environment (which it does not # do for the Global environment). worker <- function(yLevel) { fitOneTargetModel(yName,yLevel,vars,d) } return(worker) } models <- parallel::parLapply(parallelCluster,yLevels, mkWorker(yName,vars,d)) names(models) <- yLevels print(models)The above works because we forced the values we needed into the new execution environment and defined the function we were going to use directly in that environment. Obviously it is incredibly tedious and wasteful to have to re-define every function we need every time we need it (though we could also have passed in the wrapper as we did with the other values). A more versatile pattern is: use a helper function we supply called “bindToEnv” to do some of the work. With bindToEnv the code looks like the following.
source('bindToEnv.R') # Download from: http://winvector.github.io/Parallel/bindToEnv.R # build the single argument function we are going to pass to parallel mkWorker <- function() { bindToEnv(objNames=c('yName','vars','d','fitOneTargetModel')) function(yLevel) { fitOneTargetModel(yName,yLevel,vars,d) } } models <- parallel::parLapply(parallelCluster,yLevels, mkWorker()) names(models) <- yLevels print(models)The above pattern is concise and works very well. A few caveats to remember are:
 扫码加好友,拉您进群
扫码加好友,拉您进群 
    
 
    


 收藏
收藏





















