@@ -302,85 +302,6 @@ void ProducerInputOutputCUDA::produce(edm::Event& iEvent, edm::EventSetup& iSetu
302302[Complete example](../CUDATest/plugins/TestCUDAProducerGPUEW.cc)
303303
304304
305- ### Producer with CUDA input and output, and internal chain of CPU and GPU tasks (with ExternalWork)
306-
307- ``` cpp
308- class ProducerInputOutputCUDA : public edm ::stream::EDProducer<ExternalWork > {
309- public:
310- ...
311- void acquire(edm::Event const& iEvent, edm::EventSetup& iSetup, edm::WaitingTaskWithArenaHolder waitingTaskHolder) override;
312- void produce(edm::Event& iEvent, edm::EventSetup& iSetup) override;
313- ...
314- private:
315- void addMoreWork(edm::WaitingTaskWithArenaHolder waitingTashHolder);
316-
317- ...
318- ProducerInputGPUAlgo gpuAlgo_ ;
319- edm::EDGetTokenT< cms::cuda::Product<InputData > > inputToken_ ;
320- edm::EDPutTokenT< cms::cuda::Product<OutputData > > outputToken_ ;
321- };
322- ...
323- void ProducerInputOutputCUDA::acquire(edm::Event const& iEvent, edm::EventSetup& iSetup, edm::WaitingTaskWithArenaHolder waitingTaskHolder) {
324- cms::cuda::Product<InputData > const& inputDataWrapped = iEvent.get(inputToken_ );
325-
326- // Set the current device to the same that was used to produce
327- // InputData, and also use the same CUDA stream
328- cms::cuda::ScopedContextAcquire ctx{inputDataWrapped, std::move(waitingTaskHolder), ctxState_ };
329-
330- // Grab the real input data. Checks that the input data is on the
331- // current device. If the input data was produced in a different CUDA
332- // stream than the cms::cuda::ScopedContextAcquire holds, create an inter-stream
333- // synchronization point with CUDA event and cudaStreamWaitEvent()
334- auto const& inputData = ctx.get(inputDataWrapped);
335-
336- // Queues asynchronous data transfers and kernels to the CUDA stream
337- // returned by cms::cuda::ScopedContextAcquire::stream()
338- gpuAlgo.makeAsync(inputData, ctx.stream());
339-
340- // Push a functor on top of "a stack of tasks" to be run as a next
341- // task after the work queued above before produce(). In this case ctx
342- // is a context constructed by the calling TBB task, and therefore the
343- // current device and CUDA stream have been already set up. The ctx
344- // internally holds the WaitingTaskWithArenaHolder for the next task.
345-
346- ctx.pushNextTask([ this] (cms::cuda::ScopedContextTask ctx) {
347- addMoreWork(ctx);
348- });
349-
350- // Destructor of ctx queues a callback to the CUDA stream notifying
351- // waitingTaskHolder when the queued asynchronous work has finished,
352- // and saves the device and CUDA stream to ctxState_
353- }
354-
355- // Called after the asynchronous work queued in acquire() has finished
356- void ProducerInputOutputCUDA::addMoreWork(cms::cuda::ScopedContextTask& ctx) {
357- // Current device and CUDA stream have already been set
358-
359- // Queues more asynchronous data transfer and kernels to the CUDA
360- // stream returned by cms::cuda::ScopedContextTask::stream()
361- gpuAlgo.makeMoreAsync(ctx.stream());
362-
363- // Destructor of ctx queues a callback to the CUDA stream notifying
364- // waitingTaskHolder when the queued asynchronous work has finished
365- }
366-
367- // Called after the asynchronous work queued in addMoreWork() has finished
368- void ProducerInputOutputCUDA::produce(edm::Event& iEvent, edm::EventSetup& iSetup) {
369- // Sets again the current device, uses the CUDA stream created in the acquire()
370- cms::cuda::ScopedContextProduce ctx{ctxState_ };
371-
372- // Now getResult() returns data in GPU memory that is passed to the
373- // constructor of OutputData. cms::cuda::ScopedContextProduce::emplace() wraps the
374- // OutputData to cms::cuda::Product<OutputData >. cms::cuda::Product<T > stores also
375- // the current device and the CUDA stream since those will be needed
376- // in the consumer side.
377- ctx.emplace(iEvent, outputToken_ , gpuAlgo.getResult());
378- }
379- ```
380-
381- [Complete example](../CUDATest/plugins/TestCUDAProducerGPUEWTask.cc)
382-
383-
384305### Producer with CUDA input and output (without ExternalWork)
385306
386307If the producer does not need to transfer anything back to CPU (like
@@ -791,79 +712,6 @@ The `cms::cuda::ScopedContextAcquire` saves its state to the `ctxState_` in
791712the destructor, and ` cms::cuda::ScopedContextProduce ` then restores the
792713context.
793714
794- #### Module-internal chain of CPU and GPU tasks
795-
796- Technically ` ExternalWork ` works such that the framework calls
797- ` acquire() ` with a ` edm::WaitingTaskWithArenaHolder ` that holds an
798- ` edm::WaitingTask ` (that inherits from ` tbb::task ` ) for calling
799- ` produce() ` in a ` std::shared_ptr ` semantics: spawn the task when
800- reference count hits ` 0 ` . It is also possible to create a longer chain
801- of such tasks, alternating between CPU and GPU work. This mechanism
802- can also be used to re-run (part of) the GPU work.
803-
804- The "next tasks" to run are essentially structured as a stack, such
805- that
806- - ` cms::cuda::ScopedContextAcquire ` /` cms::cuda::ScopedContextTask::pushNextTask() `
807- pushes a new functor on top of the stack
808- - Completion of both the asynchronous work and the queueing function
809- pops the top task of the stack and enqueues it (so that TBB
810- eventually runs the task)
811- * Technically the task is made eligible to run when all copies of
812- ` edm::WaitingTaskWithArenaHolder ` of the acquire() (or "previous"
813- function) have either been destructed or their ` doneWaiting() ` has
814- been called
815- * The code calling ` acquire() ` or the functor holds one copy of
816- ` edm::WaitingTaskWithArenaHolder ` so it is guaranteed that the
817- next function will not run before the earlier one has finished
818-
819-
820- Below is an example how to push a functor on top of the stack of tasks
821- to run next (following the example of the previous section)
822- ``` cpp
823- void FooProducerCUDA::acquire (...) {
824- ...
825- ctx.pushNextTask([ this] (cms::cuda::ScopedContextTask ctx) {
826- ...
827- });
828- ...
829- }
830- ```
831-
832- In this case the `ctx`argument to the function is a
833- `cms::cuda::ScopedContexTask` object constructed by the TBB task calling the
834- user-given function. It follows that the current device and CUDA
835- stream have been set up already. The `pushNextTask()` can be called
836- many times. On each invocation the `pushNextTask()` pushes a new task
837- on top of the stack (i.e. in front of the chain). It follows that in
838- ```cpp
839- void FooProducerCUDA::acquire(...) {
840- ...
841- ctx.pushNextTask([this](cms::cuda::ScopedContextTask ctx) {
842- ... // function 1
843- });
844- ctx.pushNextTask([this](cms::cuda::ScopedContextTask ctx) {
845- ... // function 2
846- });
847- ctx.pushNextTask([this](cms::cuda::ScopedContextTask ctx) {
848- ... // function 3
849- });
850- ...
851- }
852- ```
853- the functions will be run in the order 3, 2, 1.
854-
855- ** Note** that the ` CUDAService ` is ** not** available (nor is any other
856- service) in these intermediate tasks. In the near future memory
857- allocations etc. will be made possible by taking them out from the
858- ` CUDAService ` .
859-
860- The ` cms::cuda::ScopedContextAcquire ` /` cms::cuda::ScopedContextTask ` have also a
861- more generic member function, ` replaceWaitingTaskHolder() ` , that can
862- be used to just replace the currently-hold
863- ` edm::WaitingTaskWithArenaHolder ` (that will get notified by the
864- callback function) with anything. In this case the caller is
865- responsible of creating the task(s) and setting up the chain of them.
866-
867715
868716#### Transferring GPU data to CPU
869717
0 commit comments