-
Notifications
You must be signed in to change notification settings - Fork 19
Description
Chunking in future.apply
future.apply currently relies on the internal makeChunks.R function to partition elements for processing into "chunks" that are sent to workers for processing. makeChunks outputs a list of integer vectors, where each vector is a "chunk" and its elements are the indices representing the elements to be processed in the input object (often a list).
Users have some control over the generation of chunks via the future.apply arguments future.chunk.size (which specifies the average number of elements per chunk a user prefers) and future.scheduling (which specifies the order of chunk processing). Furthermore, they can control the processing order of chunks with the ordering attribute of future.chunk.size or future.scheduling.
Nevertheless, this control is limited and even the sensible defaults of makeChunks can produce substantial load imbalance across workers and resulting inefficiency. Some of this inefficiency could be reduced if users were able to better-control chunk generation. While some of this inefficiency may be averted by the dynamic balancing of future.apply. the costs of dynamic balancing itself can be non-trivial.
The Purpose of makeChunks
Currently, makeChunks accomplishes two tasks:
- Generates chunks by partitioning elements to be processed.
- Specifies the order in which chunks are processed.
Ideally, the former is redundant as the elements of the object users pass future.apply would be the chunks they want processed and nbrOfElements == nbrOfWorkers and the latter is redundant as chunks are generated such that chunks are indexed in the order in which they should be processed. This allows for efficient static load balancing with chunks already balanced and one chunk per worker. However, users often pass objects where the ordering is ad-hoc and chunking not planned.
Adding customChunks
I envision two approaches to improving the flexibility of chunking in future.apply:
- Add a
customChunksargument tofuture.applyfunctions
Users could pass a list to customChunks. future.applywould use this list instead of the list thatmakeChunksreturns. Ifis.null(customChunks) == TRUE, then the status quo internal makeChunksfunction is used. Ifis.null(customChunks) == FALSE, makeChunks` is not executed and all other chunk-related arguments are ignored.
The primary motivation for this is that users may wish to (a) have complete control over chunking and ordering of chunks, (b) do so without modifying their input to ``future.apply` (i.e. avoid creating deeper objects or repeatedly rearranging the elements of their object just for processing) and (c) create more interpretable code distinguishing between the input object, the plan for processing, and processing itself. This also helps decouple functions for working in parallel from functions for serial pre-processing.
- Add a
customChunksargument tofuture.applyfunctions and exportmakeChunks
Users could pass their object to makeChunks and pass the result to to the customChunks argument of future.apply. In the event that customChunks == NULL, future.apply would call makeChunks as usual. This would allow users to generate chunks with makeChunks either inside or outside of future.apply. The upside to this is that users can directly observe and edit the output of makeChunks.
I could submit a pull request implementing this, but I'm not sure when/how makeChunks is called internally - don't see it in makefiles or the definitions of the future.apply functions.