|
2 | 2 |
|
3 | 3 | | Status | Proposed | |
4 | 4 | :-------------- |:---------------------------------------------------- | |
5 | | -| **Author(s) ** | James Ring ( [email protected]) , Anna Revinskaya ([email protected]) | |
| 5 | +| **Author(s) ** | James Ring ( [email protected]) . | |
6 | 6 | | **Sponsor ** | Günhan Gülsoy ( [email protected]) | |
7 | 7 | | **Updated** | 2019-08-14 | |
8 | 8 |
|
@@ -293,385 +293,3 @@ API will contain classes such as `Tensor`, `OpKernelContext`, and |
293 | 293 | Ideally, this API will be as close as possible to the existing non-ABI-stable |
294 | 294 | Tensorflow C++ API, so that kernels and ops currently implemented in C++ may be |
295 | 295 | ported to the ABI-stable C++ with as little implementation churn as possible. |
296 | | -
|
297 | | -## Device C API for Kernels |
298 | | -
|
299 | | -So far, this document has not dealt with the challenges of providing an |
300 | | -ABI-stable API for kernels that run on GPUs. This section describes an API that |
301 | | -addresses these challenges. |
302 | | -
|
303 | | -There are a few approaches to running kernels on GPUs: |
304 | | -
|
305 | | -* Assign computation to Eigen device (for e.g. see `OneHot`, `Transpose`, |
306 | | - training ops). (>200 occurrences in TensorFlow) |
307 | | -
|
308 | | -* Call `device.parallelFor` (for e.g. see `BatchSelect`). (4 occurrences) |
309 | | -
|
310 | | -* Call `ThreadPool::ParallelFor` (for e.g. see `MatrixDiag`). This is a |
311 | | - TensorFlow wrapper that eventually wraps calls to Eigen. For example, |
312 | | - `ThreadPool::ParallelFor` calls `device.parallelFor` in Eigen. (29 |
313 | | - occurrences) |
314 | | -
|
315 | | -* Call `Shard` (e.g. `CTCGreedyDecoder`). This approach is deprecated in favor |
316 | | - of `ThreadPool::TransformRangeConcurrently` but no kernels use the latter yet. |
317 | | - (42 occurrences) |
318 | | -
|
319 | | -* Call `GpuLaunchKernel` or `CudaLaunchKernel` directly, i.e. without calling Eigen. |
320 | | -(58 occurrences) |
321 | | -
|
322 | | -* `Matmul` op calls directly to `StreamExecutor`. |
323 | | -
|
324 | | -* Possibly others |
325 | | -
|
326 | | -In all approaches above, TensorFlow core is responsible for maintaining |
327 | | -respective device queues, streams or pools. Kernels then use these queues to |
328 | | -schedule computation. Therefore, our primary goal is to implement a C API that |
329 | | -enables this scheduling. To give an example, one approach we can take is have |
330 | | -Kernel pass a callback across C API. Tensorflow core would then call this |
331 | | -callback. See diagram below: |
332 | | -
|
333 | | - |
334 | | -
|
335 | | -Furthermore, note that most of the approaches listed above eventually call to |
336 | | -Eigen to parallelize and forward computation to device. For example, the first |
337 | | -approach above uses Eigen APIs directly. Consequently, we need to understand how |
338 | | -Eigen works with devices and in some cases make changes to Eigen codebase as |
339 | | -well. |
340 | | -
|
341 | | -Finally, we should aim to create a smaller API. Some of the approaches listed in |
342 | | -the Background section seem to be very similar. For example, calling |
343 | | -`parallelFor` in Eigen is quite similar to calling into |
344 | | -`ThreadPool::ParallelFor`. Therefore, we will only provide C API equivalents for |
345 | | -the following: |
346 | | -
|
347 | | -* `ThreadPool` and its methods. |
348 | | -
|
349 | | -* `CudaLaunchKernel` function. |
350 | | -
|
351 | | -* Computation assignment to device in Eigen. |
352 | | -
|
353 | | -This proposal focuses on these 3 components for now. Due to the complexity and |
354 | | -variety of TensorFlow kernels, it is very likely that we will need to consider |
355 | | -more approaches going forward. For example, how `MatMul` op would call |
356 | | -`StreamExecutor` directly has not been investigated. |
357 | | -
|
358 | | -### ThreadPool API |
359 | | -
|
360 | | -Here, we can just wrap relevant methods in the `ThreadPool` class. |
361 | | -
|
362 | | -```c++ |
363 | | -TF_CAPI_EXPORT extern void TF_ThreadPool_Schedule( |
364 | | - TF_OpKernelContext* context, |
365 | | - void (*fn)()); |
366 | | -
|
367 | | -TF_CAPI_EXPORT extern void TF_ThreadPool_ScheduleWithHint( |
368 | | - TF_OpKernelContext* context, |
369 | | - void (*fn)(), |
370 | | - int start, |
371 | | - int limit); |
372 | | -
|
373 | | -TF_CAPI_EXPORT extern void TF_ThreadPool_ParallelFor( |
374 | | - TF_OpKernelContext* context, |
375 | | - int64_t total, |
376 | | - int64_t cost_per_unit, |
377 | | - void (*fn)(int64_t, int64_t)); |
378 | | -
|
379 | | -TF_CAPI_EXPORT extern void TF_ThreadPool_ParallelForWithWorkerId( |
380 | | - TF_OpKernelContext* context, |
381 | | - int64_t total, |
382 | | - int64_t cost_per_unit, |
383 | | - void (*fn)(int64_t, int64_t, int)); |
384 | | -``` |
385 | | -
|
386 | | -Note that we just pass a `TF_OpKernelContext` instead of a `ThreadPool` |
387 | | -instance. Implementation of these interfaces on the TensorFlow core side can |
388 | | -then retrieve the actual ThreadPool object using: |
389 | | -
|
390 | | -```c++ |
391 | | -OpKernelContext* ctx = reinterpret_cast<OpKernelContext*>(context); |
392 | | -auto thread_pool = |
393 | | - cxt->device()->tensorflow_cpu_worker_threads()->workers; |
394 | | -``` |
395 | | -
|
396 | | -For details on how we plan to switch between `std::function<void>` and `void |
397 | | -(*fn)()`, see Appendix 1 below. |
398 | | -
|
399 | | -### Device Assignment API |
400 | | -
|
401 | | -This approach lets us construct device objects (e.g. `Eigen::GpuDevice`) on the |
402 | | -plugin side. Basically, we get an Eigen device object and can apply any |
403 | | -operations we currently apply to an Eigen device. |
404 | | -
|
405 | | -We could wrap `Eigen::StreamInterface`, `Eigen::ThreadPoolInterface` and `Eigen::Allocator`. These |
406 | | -wrappers will consist of a C API and a C++ wrapper on top of the C API. A |
407 | | -sample C API for `StreamInterface` is given below: |
408 | | -
|
409 | | -```c++ |
410 | | -TF_CAPI_EXPORT extern TF_EigenStream* TF_GetEigenStreamHandle( |
411 | | - TF_OpKernelContext*); |
412 | | -TF_CAPI_EXPORT extern gpuStream_t* TF_EigenStream_GetCudaStream( |
413 | | - TF_EigenStream*); |
414 | | -TF_CAPI_EXPORT extern gpuDeviceProp_t* TF_EigenStream_GetDeviceProperties( |
415 | | - TF_EigenStream*); |
416 | | -TF_CAPI_EXPORT extern void* TF_EigenStream_Allocate( |
417 | | - TF_EigenStream*, size_t num_bytes); |
418 | | -TF_CAPI_EXPORT extern void TF_EigenStream_Deallocate( |
419 | | - TF_EigenStream*, void* buffer); |
420 | | -TF_CAPI_EXPORT extern void* TF_EigenStream_Scratchpad( |
421 | | - TF_EigenStream*); |
422 | | -TF_CAPI_EXPORT extern int* TF_EigenStream_Semaphore( |
423 | | - TF_EigenStream*); |
424 | | -// This would just delete the C API handle for TF_EigenStream. |
425 | | -TF_CAPI_EXPORT extern TF_EigenStream* TF_DeleteEigenStreamHandle( |
426 | | - TF_EigenStream*); |
427 | | -``` |
428 | | -
|
429 | | -The following C++ API will wrap the C API to provide a `StreamInterface` implementation |
430 | | -on the kernel plugin side: |
431 | | -
|
432 | | -```c++ |
433 | | -class EigenGpuStream : public Eigen::StreamInterface { |
434 | | - public: |
435 | | - EigenGpuStream(TF_EigenStream* eigen_stream) : |
436 | | - eigen_stream_(eigen_stream) {} |
437 | | -
|
438 | | - const gpuStream_t& stream() const override { |
439 | | - return TF_EigenStream_GetCudaStream(eigen_stream_); |
440 | | - } |
441 | | -
|
442 | | - const gpuDeviceProp_t& deviceProperties() const override { |
443 | | - return TF_EigenStream_GetDeviceProperties(eigen_stream_); |
444 | | - } |
445 | | -
|
446 | | - void* allocate(size_t num_bytes) const override { |
447 | | - return TF_EigenStream_Allocate(eigen_stream_, num_bytes); |
448 | | - } |
449 | | -
|
450 | | - void deallocate(void* buffer) const override { |
451 | | - return TF_EigenStream_Deallocate(eigen_stream_, buffer); |
452 | | - } |
453 | | -
|
454 | | - virtual void* scratchpad() const override { |
455 | | - return TF_EigenStream_Scratchpad(eigen_stream_); |
456 | | - } |
457 | | -
|
458 | | - virtual unsigned int* semaphore() const override { |
459 | | - return TF_EigenStream_Semaphore(eigen_stream_); |
460 | | - } |
461 | | -
|
462 | | - private: |
463 | | - TF_EigenStream* eigen_stream; |
464 | | -}; |
465 | | -``` |
466 | | -
|
467 | | -Now, a kernel can create an instance of `Eigen::GpuDevice` using this stream: |
468 | | -
|
469 | | -```c++ |
470 | | -TF_EigenStream* eigen_stream = TF_GetEigenStream(); |
471 | | -Eigen::GpuDevice* device = Eigen::GpuDevice(EigenGpuStream(eigen_stream)); |
472 | | -... |
473 | | -tensor->device(device) = < computation > |
474 | | -... |
475 | | -TF_DeleteEigenStreamHandle(eigen_stream); |
476 | | -``` |
477 | | -
|
478 | | -Note: `gpuStream_t` and `gpuDeviceProp_t` might be aliased to ROCm's objects |
479 | | -instead of Cuda structs. See Appendix 2 for details how we are going to handle |
480 | | -ROCm support. |
481 | | - |
482 | | -Wrapping `Allocator` using similar approach should be trivial. However, |
483 | | -`ThreadPoolInterface` takes `std::function<void()>` and this approach would |
484 | | -require passing `std::function` across C API, which is non-trivial. For details |
485 | | -how we are going to handle it see Appendix 1. |
486 | | - |
487 | | -### Alternative for GPU Device Assignment API |
488 | | - |
489 | | -We can take approach similar to the CPU device assignment API. On the CPU side, |
490 | | -corresponding Eigen object - `ThreadPoolInterface` - has a `Schedule` method. This |
491 | | -method schedules a kernel function in a thread pool. |
492 | | - |
493 | | -Similarly, we can add a `Launch`/`Schedule` function to `StreamInterface`. The |
494 | | -default implementation would have same behavior as `LAUNCH_GPU_KERNEL` in |
495 | | -Eigen. However, we can customize it on the TensorFlow side and implement launch |
496 | | -logic in core TensorFlow instead of the kernel. This way, `cudaStream_t` and |
497 | | -`hipStream_t` only need to be referenced in core. |
498 | | - |
499 | | -<!-- TODO: add examples that are currently only available internally --> |
500 | | - |
501 | | -Advantages of this approach: |
502 | | - |
503 | | -* Don't need to pass `hipStream_t` and `cudaStream_t` across the API boundary. |
504 | | -
|
505 | | -* Supports customization of the `launchKernel` call which might be useful if we |
506 | | - want to handle it differently later. |
507 | | -
|
508 | | -Disadvantages of this approach: |
509 | | -
|
510 | | -* More invasive change to Eigen. |
511 | | -
|
512 | | -### CudaLaunchKernel API |
513 | | -
|
514 | | -CudaLaunchKernel appears to be a fairly thin wrapper around `cudaLaunchKernel` |
515 | | -in the Cuda Runtime library and a part of their C API. |
516 | | -
|
517 | | -For reference, this is the signature of `cudaLaunchKernel`: |
518 | | -
|
519 | | -```c++ |
520 | | -extern __host__ cudaError_t CUDARTAPI cudaLaunchKernel( |
521 | | - const void *func, |
522 | | - dim3 gridDim, |
523 | | - dim3 blockDim, |
524 | | - void **args, |
525 | | - size_t sharedMem, |
526 | | - cudaStream_t stream); |
527 | | -``` |
528 | | -
|
529 | | -where `dim3` and `cudaStream_t` are structs. |
530 | | -This is trivial to either wrap with the TensorFlow C API or just call into from |
531 | | -plugins directly. |
532 | | -
|
533 | | -However, ROCm's side of things is harder than Cuda. `gpuLaunchKernel` might |
534 | | -call ROCm's `hipLaunchKernelGGL` here instead. Its signature uses templates. |
535 | | -Fortunately, AMD is planning to add an equivalent function that provides a C |
536 | | -API. (see Appendix 2 for details) |
537 | | -
|
538 | | -### Getting Status when using device APIs |
539 | | -
|
540 | | -Kernel Device APIs described in this document rely on wrapping certain Eigen interfaces, such as `Eigen::StreamInterface` to provide a C API. Implementations of these interfaces might set an `OpKernelContext` status, which is not on the interface surface. Therefore, I propose that we add a new function that would update a given `TF_Status` with current `OpKernelContext` status: |
541 | | -
|
542 | | -```c++ |
543 | | -TF_CAPI_EXPORT extern void TF_OpKernelContext_UpdateStatus(TF_Status*); |
544 | | -``` |
545 | | -
|
546 | | -This would allow kernel implementations to return as soon as they see a failing status. For example: |
547 | | -
|
548 | | -```c++ |
549 | | -TF_EigenStream* eigen_stream = TF_GetEigenStream(); |
550 | | -... run computation using eigen_stream ... |
551 | | -
|
552 | | -TF_Status* context_status = TF_NewStatus(); |
553 | | -TF_OpKernelContext_UpdateStatus(context_status); |
554 | | -if (TF_GetCode(context_status) != TF_OK) { |
555 | | - TF_DeleteStatus(context_status); |
556 | | - return; |
557 | | -} |
558 | | -``` |
559 | | -
|
560 | | -## Appendix 1 |
561 | | -
|
562 | | -Certain parts of our design involve kernel plugins calling a function in |
563 | | -TensorFlow core of the form: |
564 | | -
|
565 | | -```c++ |
566 | | -void foo(std::function<void()> arg) { ... } |
567 | | -``` |
568 | | -
|
569 | | -We can't pass `std::function` across the C API boundary. Instead, we plan to wrap it with a struct and break this call up into 3 steps: |
570 | | - |
571 | | -* Wrap `std::function<void()>` with a struct. The struct will contain pointers |
572 | | - to callbacks for manipulating `std::function<void()>` pointer. (This will happen |
573 | | - on the kernel plugin side). |
574 | | - |
575 | | -* Pass the struct across C API boundary. |
576 | | - |
577 | | -* Wrap the struct with a callable object which can be used as |
578 | | - `std::function<void()>`. (This will happen on TensorFlow core side). |
579 | | - |
580 | | -Step 1: The wrapper struct will be defined as follows: |
581 | | - |
582 | | -```c++ |
583 | | -// Wraps std::function<void()> so that it can be called across C API. |
584 | | -struct FuncWrap { |
585 | | - void* func_ptr; // pointer to std::function<void()> |
586 | | - |
587 | | - // Function that takes std::function<void()> pointer as an argument |
588 | | - // and calls that function. |
589 | | - void (*call_func_ptr) (void*); |
590 | | - |
591 | | - // Function that takes std::function<void()> pointer as an argument |
592 | | - // and deletes it. |
593 | | - void (*delete_func_ptr) (void*); |
594 | | -}; |
595 | | -``` |
596 | | -
|
597 | | -Note that we would need to move `std::function` to the heap because `FuncWrap` |
598 | | -might be placed in a queue and called later. Specifically, `FuncWrap` |
599 | | -construction will happen on the kernel plugin side and will have the following |
600 | | -implementation: |
601 | | -
|
602 | | -```c++ |
603 | | -// Wraps std::function<void()> with FuncWrap struct. |
604 | | -FuncWrap get_func_wrap(std::function<void()> f ) { |
605 | | - // Move function to heap |
606 | | - auto* f_heap = new std::function<void()>(f); |
607 | | -
|
608 | | - return { |
609 | | - // Argument to pass to callbacks to call/delete it. |
610 | | - f_heap, |
611 | | - // Callback that calls f_heap. |
612 | | - [](void* wrapped_f) { |
613 | | - std::function<void()>* f_std = static_cast<std::function<void()>*>( |
614 | | - wrapped_f); |
615 | | - (*f_std)(); |
616 | | - }, |
617 | | - // Callback that deletes f_heap. |
618 | | - [](void* wrapped_f) { |
619 | | - std::function<void()>* f_std = static_cast<std::function<void()>*>( |
620 | | - wrapped_f); |
621 | | - delete f_std; |
622 | | - } |
623 | | - }; |
624 | | -} |
625 | | -``` |
626 | | - |
627 | | -Step 2: `FuncWrap` struct constructed in this manner can now be passed across |
628 | | -the C API to core. |
629 | | - |
630 | | -Step 3: Since we place `std::function` on the heap, we need to manage its |
631 | | -deletion. Therefore, we wrap it with a class in TensorFlow core so that it can |
632 | | -be deleted once all references are gone: |
633 | | - |
634 | | -```c++ |
635 | | -class CallFuncWrap { |
636 | | - public: |
637 | | - explicit CallFuncWrap(FuncWrap wrap) : |
638 | | - wrap_(new FuncWrap(wrap), [](FuncWrap* ptr) { |
639 | | - ptr->delete_func_ptr(ptr->func_ptr); |
640 | | - delete ptr; |
641 | | - }) {}; |
642 | | - |
643 | | - void operator() () { |
644 | | - wrap_->call_func_ptr(wrap_->func_ptr); |
645 | | - } |
646 | | - |
647 | | - private: |
648 | | - // CallFuncWrap might be copied when it is passed to functions taking |
649 | | - // std::function as an argument. |
650 | | - // We use shared_ptr to make sure we only have one copy of FuncWrap |
651 | | - // even if CallFuncWrap is copied. We want a single copy of FuncWrap |
652 | | - // because the pointer stored in FuncWrap should only be deleted once. |
653 | | - std::shared_ptr<FuncWrap> wrap_; |
654 | | -}; |
655 | | -``` |
656 | | -
|
657 | | -Now, the `CallFuncWrap` instance can be passed in as a `std::function<void()>` argument: |
658 | | -
|
659 | | -```c++ |
660 | | -CallFuncWrap call_func_wrap(func_wrap); |
661 | | -foo(call_func_wrap); // foo here takes std::function<void()> argument |
662 | | -``` |
663 | | - |
664 | | -## Appendix 2: Working with ROCm across C API |
665 | | - |
666 | | -We need to access `hipStream_t` on both sides of the C API. Since its |
667 | | -implementation is actually in C++, we will treat it as opaque pointer that we |
668 | | -get from a HIP function (on the TensorFlow core side) and pass to another HIP |
669 | | -function (on the kernel side). |
670 | | - |
671 | | -Ideally, we should only rely on extern C parts of `hip_runtime_api.h`. There is |
672 | | -no equivalent in the C API right now for `hipLaunchKernelGGL`. However, AMD |
673 | | -might add an equivalent function to the C API in the near future. |
674 | | - |
675 | | -Note that we have to update `LAUNCH_GPU_KERNEL` in Eigen to call the HIP C API |
676 | | -once it is available. |
677 | | - |
0 commit comments