Replies: 1 comment 2 replies
-
|
We started developing a way to switch the parallel backend of
|
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I've been using huggingface's dataset to deal with some computer vision tasks, and
dataset.map(..., num_proc > 1)is really handy to run things in parallel. But most of my processing happens in C libs that doesn't actually holds the GIL (like OpenCV, numpy, Pillow, etc), and spinning up new subprocesses seems a bit overkill (specially becausedataset.mapuses fork method, which may carry some parent process' lifecycle callbacks, for example.I wonder if it would be interesting for
dataset.map()to receive an optionalpool: concurrent.futures.Executorparameter, that would be either aProcessPoolExecutooror aThreadPoolExecutor, so the caller could choose which type of parallelization better suites their use case.There is a gotcha with this proposal, which is the fact that
ProcessPoolExecutoorseems to usespawninstead of the currentforkapproach used bydataset.map, so it wouldn't work with inner functions and lambdas. Because of this, we probably shouldn't replace the currentnum_procway of doing it, but I wonder if a newpoolparameter could be useful for more people.If this proposal seems reasonable, I can prepare a PR to further discuss the implementation.
EDIT: related bug report: #5976
Beta Was this translation helpful? Give feedback.
All reactions