Using foreach and iterators for manual parallel execution

مقالة
01/03/2018

Important

This content is being retired and may not be updated in the future. The support for Machine Learning Server will end on July 1, 2022. For more information, see What's happening to Machine Learning Server?

Although RevoScaleR performs parallel execution automatically, you can manage parallel execution yourself using the open-source foreach package. This package provides a looping structure for R script. When you need to loop through repeated operations, and you have multiple processors or nodes to work with, you can use foreach in your script to execute a for loop in parallel. Developed by Microsoft, foreach is an open-source package that is bundled with Machine Learning Server but is also available on the Comprehensive R Archive Network, CRAN.

Tip

One common approach to parallelization is to see if the iterations within a loop can be performed independently, and if so, you can try to run the iterations concurrently rather than sequentially.

About the foreach package

The foreach package is a set of tools that allow you to run virtually anything that can be expressed as a for-loop as a set of parallel tasks. One scenario is to run multiple simulations in parallel. As a simple example, consider the case of simulating 10000 coin flips, which can be done by sampling with replacement from the vector c(H, T). To run this simulation 10 times sequentially, use foreach with the %do% operator:

> library(foreach)
> foreach(i = 1:10) %do% sample(c("H", "T"), 10000, replace = TRUE)

Comparing the foreach output with that of a similar for loop shows one obvious difference: foreach returns a list containing the value returned by each computation. A for loop, by contrast, returns only the value of its last computation, and relies on user-defined side effects to do its work.

We can parallelize the operation immediately by replacing %do% with %dopar%:

> foreach(i = 1:10) %dopar% sample(c("H", "T"), 10000, replace = TRUE)

However, if we run this example, we see the following warning:

Warning message:
executing %dopar% sequentially: no parallel backend registered

To actually run in parallel, we need to have a “parallel backend” for foreach. Parallel backends are discussed in the next section.

Parallel Backends

In order for loops coded with foreach to run in parallel, you must register a parallel backend to manage the execution of the loop. Any type of mechanism for running code in parallel could potentially have a parallel backend written for it. Currently, Machine Learning Server includes the doParallel backend; this uses the parallel package of R 2.14.0 or later to run jobs in parallel, using either of the component parallelization methods incorporated into the parallel package: SNOW-like functionality using socket connections, or multicore-like functionality using forking (on Linux only).

The doParallel package is a parallel backend for foreach that is intended for parallel processing on a single computer with multiple cores or processors.

Additional parallel backends are available from CRAN:

doMPI for use with the Rmpi package
doRedis for use with the rredis package
doMC provides access to the multicore functionality of the parallel package
doSNOW for use with the now superseded SNOW package.

To use a parallel backend, you must first register it. Once a parallel backend is registered, calls to %dopar% run in parallel using the mechanisms provided by the parallel backend. However, the details of registering the parallel backends differ, so we consider them separately.

Using the doParallel parallel backend

The parallel package of R 2.14.0 and later combines elements of snow and multicore; doParallel similarly combines elements of both doSNOW and doMC. You can register doParallel with a cluster, as with doSNOW, or with a number of cores, as with doMC. For example, here we create a cluster and register it:

> library(doParallel)
> cl <- makeCluster(4)
> registerDoParallel(cl)

Once you’ve registered the parallel backend, you’re ready to run foreach code in parallel. For example, to see how long it takes to run 10,000 bootstrap iterations in parallel on all available cores, you can run the following code:

> x <- iris[which(iris[,5] != "setosa"), c(1, 5)]
> trials <- 10000
> ptime <- system.time({
+     r <- foreach(icount(trials), .combine = cbind) %dopar% {
+         ind <- sample(100, 100, replace = TRUE)
+         result1 <- glm(x[ind, 2] ~ x[ind, 1], family = binomial(logit))
+         coefficients(result1)
+    }
+ })[3]
> ptime

Getting information about the parallel backend

To find out how many workers foreach is going to use, you can use the getDoParWorkers function:

> getDoParWorkers()

This is a useful sanity check that you’re actually running in parallel. If you haven’t registered a parallel backend, or if your machine only has one core, getDoParWorkers will return 1. In either case, don’t expect a speed improvement.

The getDoParWorkers function is also useful when you want the number of tasks to be equal to the number of workers. You may want to pass this value to an iterator constructor, for example.

You can also get the name and version of the currently registered backend:

> getDoParName()
> getDoParVersion()

Nesting Calls to foreach

An important feature of foreach is nesting operator %:%. Like the %do% and %dopar% operators, it is a binary operator, but it operates on two foreach objects. It also returns a foreach object, which is essentially a special merger of its operands.

Let’s say that we want to perform a Monte Carlo simulation using a function called sim. The sim function takes two arguments, and we want to call it with all combinations of the values that are stored in the vectors avec and bvec. The following doubly-nested for loop does that. For testing purposes, the sim function is defined to return $10 a + b$ (although an operation this trivial is not worth executing in parallel):

sim <- function(a, b) 10 * a + b
avec <- 1:2
bvec <- 1:4

x <- matrix(0, length(avec), length(bvec))
for (j in 1:length(bvec)) {
        for (i in 1:length(avec)) {
                x[i, j] <- sim(avec[i], bvec[j])
                }
}
x

In this case, it makes sense to store the results in a matrix, so we create one of the proper size called x, and assign the return value of sim to the appropriate element of x each time through the inner loop.

When using foreach, we don’t create a matrix and assign values into it. Instead, the inner loop returns the columns of the result matrix as vectors, which are combined in the outer loop into a matrix. Here’s how to do that using the %:% operator:

x <-
    foreach(b = bvec, .combine = 'cbind') %:%
        foreach(a = avec, .combine = 'c') %do% {
            sim(a, b)
        }
x

This is structured very much like the nested for loop. The outer foreach is iterating over the values in “bvec”, passing them to the inner foreach, which iterates over the values in “avec” for each value of “bvec”. Thus, the “sim” function is called in the same way in both cases. The code is slightly cleaner in this version, and has the advantage of being easily parallelized.

When parallelizing nested for loops, there is always a question of which loop to parallelize. The standard advice is to parallelize the outer loop. This results in larger individual tasks, and larger tasks can often be performed more efficiently than smaller tasks. However, if the outer loop doesn’t have many iterations and the tasks are already large, parallelizing the outer loop results in a small number of huge tasks, which may not allow you to use all of your processors, and can also result in load-balancing problems. You could parallelize an inner loop instead, but that could be inefficient because you’re repeatedly waiting for all the results to be returned every time through the outer loop. And if the tasks and number of iterations vary in size, then it’s really hard to know which loop to parallelize.

But in our Monte Carlo example, all of the tasks are completely independent of each other, and so they can all be executed in parallel. You really want to think of the loops as specifying a single stream of tasks. You just need to be careful to process all of the results correctly, depending on which iteration of the inner loop they came from.

That is exactly what the %:% operator does: it turns multiple foreach loops into a single loop. That is why there is only one %do% operator in the example above. And when we parallelize that nested foreach loop by changing the %do% into a %dopar%, we are creating a single stream of tasks that can all be executed in parallel:

x <-
    foreach(b = bvec, .combine = 'cbind') %:%
        foreach(a = avec, .combine = 'c') %dopar% {
            sim(a, b)
        }
x

Of course, we’ll actually only run as many tasks in parallel as we have processors, but the parallel backend takes care of all that. The point is that the %:% operator makes it easy to specify the stream of tasks to be executed, and the .combine argument to foreach allows us to specify how the results should be processed. The backend handles executing the tasks in parallel.

For more on nested foreach calls, see the vignette “Nesting foreach Loops” in the foreach package.

Using Iterators

An iterator is a special type of object that generalizes the notion of a looping variable. When passed as an argument to a function that knows what to do with it, the iterator supplies a sequence of values. The iterator also maintains information about its state, in particular its current index.

The iterators package includes a number of functions for creating iterators, the simplest of which is iter, which takes virtually any R object and turns it into an iterator object. The simplest function that operates on iterators is the nextElem function, which when given an iterator, returns the next value of the iterator. For example, here we create an iterator object from the sequence 1 to 10, and then use nextElem to iterate through the values:

> i1 <- iter(1:10)
> nextElem(i1)
[1] 1
> nextElem(i1)
[2] 2

You can create iterators from matrices and data frames, using the by argument to specify whether to iterate by row or column:

> istate <- iter(state.x77, by = 'row')
> nextElem(istate)
            Population Income Illiteracy Life Exp Murder HS Grad Frost  Area
    Alabama       3615   3624        2.1    69.05   15.1    41.3    20 50708
> nextElem(istate)
           Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
    Alaska        365   6315        1.5    69.31   11.3    66.7   152 566432

Iterators can also be created from functions, in which case the iterator can be an endless source of values:

> ifun <- iter(function() sample(0:9, 4, replace=TRUE))
> nextElem(ifun)
[1] 9 5 2 8
> nextElem(ifun)
[1] 3 4 2 2

For practical applications, iterators can be paired with foreach to obtain parallel results quite easily:

> x <- matrix(rnorm(1000000), ncol = 1000)
> itx <- iter(x, by = 'row')
> foreach(i = itx, .combine = c) %dopar% mean(i)

Some Special Iterators

The notion of an iterator is new to R, but should be familiar to users of languages such as Python. The iterators package includes a number of special functions that generate iterators for some common scenarios. For example, the irnorm function creates an iterator for which each value is drawn from a specified random normal distribution:

> library(iterators)
> itrn <- irnorm(1, count = 10)
> nextElem(itrn)
[1] 0.6300053
> nextElem(itrn)
[1] 1.242886

Similarly, the irunif, irbinom, and irpois functions create iterators which draw their values from uniform, binomial, and Poisson distributions, respectively. (These functions use the standard R distribution functions to generate random numbers, and these are not necessarily useful in a distributed or parallel environment. When using random numbers with foreach, we recommend using the doRNG package to ensure independent random number streams on each worker.)

We can then use these functions just as we used irnorm:

> itru <- irunif(1, count = 10)
> nextElem(itru)
[1] 0.4960539
> nextElem(itru)
[1] 0.4071111

The icount function returns an iterator that counts starting from one:

> it <- icount(3)
> nextElem(it)
[1] 1
> nextElem(it)
[1] 2
> nextElem(it)
[1] 3

Writing Iterators

There will be times when you need an iterator that isn’t provided by the iterators package. That is when you need to write your own custom iterator.

Basically, an iterator is an S3 object whose base class is iter, and has iter and nextElem methods. The purpose of the iter method is to return an iterator for the specified object. For iterators, that usually just means returning itself, which seems odd at first. But the iter method can be defined for other objects that don’t define a nextElem method. We call those objects iterables, meaning that you can iterate over them. The iterators package defines iter methods for vectors, lists, matrices, and data frames, making those objects iterables. By defining an iter method for iterators, they can be used in the same context as an iterable, which can be convenient. For example, the foreach function takes iterables as arguments. It calls the iter method on those arguments in order to create iterators for them. By defining the iter method for all iterators, we can pass iterators to foreach that we created using any method we choose. Thus, we can pass vectors, lists, or iterators to foreach, and they are all processed by foreach in exactly the same way.

The iterators package comes with an iter method defined for the iter class that simply returns itself. That is usually all that is needed for an iterator. However, if you want to create an iterator for some existing class, you can do that by writing an iter method that returns an appropriate iterator. That will allow you to pass an instance of your class to foreach, which will automatically convert it into an iterator. The alternative is to write your own function that takes arbitrary arguments, and returns an iterator. You can choose whichever method is most natural.

The most important method required for iterators is nextElem. This simply returns the next value, or throws an error. Calling the stop function with the string StopIteration indicates that there are no more values available in the iterator.

In most cases, you don’t actually need to write the iter and nextElem methods; you can inherit them. By inheriting from the class abstractiter, you can use the following methods as the basis of your own iterators:

> iterators:::iter.iter
function (obj, ...)
{
        obj
}
<environment: namespace:iterators>
> iterators:::nextElem.abstractiter
function (obj, ...)
{
        obj$nextElem()
}
<environment: namespace:iterators>

The following function creates a simple iterator that uses these two methods:

iforever <- function(x) {
    nextEl <- function() x
    obj <- list(nextElem = nextEl)
    class(obj) <- c('iforever', 'abstractiter', 'iter')
    obj
}

Note that we called the internal function nextEl rather than nextElem to avoid masking the standard nextElem generic function. That causes problems when you want your iterator to call the nextElem method of another iterator, which can be quite useful.

We create an instance of this iterator by calling the iforever function, and then use it by calling the nextElem method on the resulting object:

it <- iforever(42)
nextElem(it)
nextElem(it)

Notice that it doesn’t make sense to implement this iterator by defining a new iter method, since there is no natural iterable on which to dispatch. The only argument that we need is the object for the iterator to return, which can be of any type. Instead, we implement this iterator by defining a normal function that returns the iterator.

This iterator is quite simple to implement, and possibly even useful, but exercise caution if you use it. Passing it to foreach will result in an infinite loop unless you pair it with a finite iterator. Similarly, never pass this iterator to as.list without the n argument.

The iterator returned by iforever is a list that has a single element named nextElem, whose value is a function that returns the value of x. Because we are subclassing abstractiter, we inherit a nextElem method that will call this function, and because we are subclassing iter, we inherit an iter method that will return itself.

Of course, the reason this iterator is so simple is because it doesn’t contain any state. Most iterators need to contain some state, or it will be difficult to make it return different values and eventually stop. Managing the state is usually the real trick to writing iterators.

As an example of writing a stateful iterator, let’s modify the previous iterator to put a limit on the number of values that it returns. We’ll call the new function irep, and give it another argument called times:

irep <- function(x, times) {
    nextEl <- function() {
        if (times > 0) {
            times <<- times - 1
        }
        else {
            stop('StopIteration')
        }
        x
    }
    obj <- list(nextElem = nextEl)
    class(obj) <- c('irep', 'abstractiter', 'iter')
    obj
}

Now let’s try it out:

it <- irep(7, 6)
unlist(as.list(it))

The real difference between iforever and irep is in the function that gets called by the nextElem method. This function not only accesses the values of the variables x and times, but it also modifies the value of times. This is accomplished by means of the <<- operator, and the magic of lexical scoping. Technically, this kind of function is called a closure, and is a somewhat advanced feature of R. The important thing to remember is that nextEl is able to get the value of variables that were passed as arguments to irep, and it can modify those values using the <<- operator. These are not global variables: they are defined in the enclosing environment of the nextEl function. You can create as many iterators as you want using the irep function, and they will all work as expected without conflicts.

Note that this iterator only uses the arguments to irep to store its state. If any other state variables are needed, they can be defined anywhere inside the irep function.

More examples of writing iterators can be found in the vignette “Writing Custom Iterators” in the iterators package.

مشاركة عبر