Removing device part of the proposal

annarev · web-flow · commit ac4ae7f2fac9 · 2020-06-02T17:51:18.000-07:00
I removed device section of this doc. We never made a final decision on the device solution and then I moved to work on something else. If we come up with a final decision for device, we can create another RFC (I removed my name from authors as well since my only contribution was device API. ).

The rest of the doc can be merged.
diff --git a/rfcs/20190814-kernel-and-op-registration.md b/rfcs/20190814-kernel-and-op-registration.md
@@ -2,7 +2,7 @@
 
 | Status        | Proposed                                             |
 :-------------- |:---------------------------------------------------- |
-| **Author(s)** | James Ring (sjr@google.com), Anna Revinskaya (annarev@google.com) |
+| **Author(s)** | James Ring (sjr@google.com).                                      |
 | **Sponsor**   | Günhan Gülsoy (gunan@google.com)                                  |
 | **Updated**   | 2019-08-14                                                        |
 
@@ -293,385 +293,3 @@ API will contain classes such as `Tensor`, `OpKernelContext`, and
 Ideally, this API will be as close as possible to the existing non-ABI-stable
 Tensorflow C++ API, so that kernels and ops currently implemented in C++ may be
 ported to the ABI-stable C++ with as little implementation churn as possible.
-
-## Device C API for Kernels
-
-So far, this document has not dealt with the challenges of providing an
-ABI-stable API for kernels that run on GPUs. This section describes an API that
-addresses these challenges.
-
-There are a few approaches to running kernels on GPUs:
-
-* Assign computation to Eigen device (for e.g. see `OneHot`, `Transpose`,
-  training ops). (>200 occurrences in TensorFlow)
-
-* Call `device.parallelFor` (for e.g. see `BatchSelect`). (4 occurrences)
-
-* Call `ThreadPool::ParallelFor` (for e.g. see `MatrixDiag`). This is a
-  TensorFlow wrapper that eventually wraps calls to Eigen. For example,
-  `ThreadPool::ParallelFor` calls `device.parallelFor` in Eigen. (29
-  occurrences)
-
-* Call `Shard` (e.g. `CTCGreedyDecoder`). This approach is deprecated in favor
-  of `ThreadPool::TransformRangeConcurrently` but no kernels use the latter yet.
-  (42 occurrences)
-
-* Call `GpuLaunchKernel` or `CudaLaunchKernel` directly, i.e. without calling Eigen.
-(58 occurrences)
-
-* `Matmul` op calls directly to `StreamExecutor`.
-
-* Possibly others
-
-In all approaches above, TensorFlow core is responsible for maintaining
-respective device queues, streams or pools. Kernels then use these queues to
-schedule computation. Therefore, our primary goal is to implement a C API that
-enables this scheduling. To give an example, one approach we can take is have
-Kernel pass a callback across C API. Tensorflow core would then call this
-callback. See diagram below:
-
-![device API](20190814-kernel-and-op-registration/device_api_overview.png)
-
-Furthermore, note that most of the approaches listed above eventually call to
-Eigen to parallelize and forward computation to device. For example, the first
-approach above uses Eigen APIs directly. Consequently, we need to understand how
-Eigen works with devices and in some cases make changes to Eigen codebase as
-well.
-
-Finally, we should aim to create a smaller API. Some of the approaches listed in
-the Background section seem to be very similar. For example, calling
-`parallelFor` in Eigen is quite similar to calling into
-`ThreadPool::ParallelFor`. Therefore, we will only provide C API equivalents for
-the following:
-
-* `ThreadPool` and its methods.
-
-* `CudaLaunchKernel` function.
-
-* Computation assignment to device in Eigen.
-
-This proposal focuses on these 3 components for now. Due to the complexity and
-variety of TensorFlow kernels, it is very likely that we will need to consider
-more approaches going forward. For example, how `MatMul` op would call
-`StreamExecutor` directly has not been investigated.
-
-### ThreadPool API
-
-Here, we can just wrap relevant methods in the `ThreadPool` class.
-
-```c++
-TF_CAPI_EXPORT extern void TF_ThreadPool_Schedule(
-    TF_OpKernelContext* context,
-    void (*fn)());
-
-TF_CAPI_EXPORT extern void TF_ThreadPool_ScheduleWithHint(
-    TF_OpKernelContext* context,
-    void (*fn)(),
-    int start,
-    int limit);
-
-TF_CAPI_EXPORT extern void TF_ThreadPool_ParallelFor(
-    TF_OpKernelContext* context,
-    int64_t total,
-    int64_t cost_per_unit,
-    void (*fn)(int64_t, int64_t));
-
-TF_CAPI_EXPORT extern void TF_ThreadPool_ParallelForWithWorkerId(
-    TF_OpKernelContext* context,
-    int64_t total,
-    int64_t cost_per_unit,
-    void (*fn)(int64_t, int64_t, int));
-```
-
-Note that we just pass a `TF_OpKernelContext` instead of a `ThreadPool`
-instance. Implementation of these interfaces on the TensorFlow core side can
-then retrieve the actual ThreadPool object using:
-
-```c++
-OpKernelContext* ctx = reinterpret_cast<OpKernelContext*>(context);
-auto thread_pool =
-       cxt->device()->tensorflow_cpu_worker_threads()->workers;
-```
-
-For details on how we plan to switch between `std::function<void>` and `void
-(*fn)()`, see Appendix 1 below.
-
-### Device Assignment API
-
-This approach lets us construct device objects (e.g. `Eigen::GpuDevice`) on the
-plugin side. Basically, we get an Eigen device object and can apply any
-operations we currently apply to an Eigen device.
-
-We could wrap `Eigen::StreamInterface`, `Eigen::ThreadPoolInterface` and `Eigen::Allocator`. These
-wrappers will consist of a C API and a C++ wrapper on top of the C API. A
-sample C API for `StreamInterface` is given below:
-
-```c++
-TF_CAPI_EXPORT extern TF_EigenStream* TF_GetEigenStreamHandle(
-    TF_OpKernelContext*);
-TF_CAPI_EXPORT extern gpuStream_t* TF_EigenStream_GetCudaStream(
-    TF_EigenStream*);
-TF_CAPI_EXPORT extern gpuDeviceProp_t* TF_EigenStream_GetDeviceProperties(
-    TF_EigenStream*);
-TF_CAPI_EXPORT extern void* TF_EigenStream_Allocate(
-    TF_EigenStream*, size_t num_bytes);
-TF_CAPI_EXPORT extern void TF_EigenStream_Deallocate(
-    TF_EigenStream*, void* buffer);
-TF_CAPI_EXPORT extern void* TF_EigenStream_Scratchpad(
-    TF_EigenStream*);
-TF_CAPI_EXPORT extern int* TF_EigenStream_Semaphore(
-    TF_EigenStream*);
-// This would just delete the C API handle for TF_EigenStream.
-TF_CAPI_EXPORT extern TF_EigenStream* TF_DeleteEigenStreamHandle(
-    TF_EigenStream*);
-```
-
-The following C++ API will wrap the C API to provide a `StreamInterface` implementation
-on the kernel plugin side:
-
-```c++
-class EigenGpuStream : public Eigen::StreamInterface {
- public:
-  EigenGpuStream(TF_EigenStream* eigen_stream) :
-      eigen_stream_(eigen_stream) {}
-
-  const gpuStream_t& stream() const override {
-    return TF_EigenStream_GetCudaStream(eigen_stream_);
-  }
-
-  const gpuDeviceProp_t& deviceProperties() const override {
-    return TF_EigenStream_GetDeviceProperties(eigen_stream_);
-  }
-
-  void* allocate(size_t num_bytes) const override {
-    return TF_EigenStream_Allocate(eigen_stream_, num_bytes);
-  }
-
-  void deallocate(void* buffer) const override {
-    return TF_EigenStream_Deallocate(eigen_stream_, buffer);
-  }
-
-  virtual void* scratchpad() const override {
-    return TF_EigenStream_Scratchpad(eigen_stream_);
-  }
-
-  virtual unsigned int* semaphore() const override {
-    return TF_EigenStream_Semaphore(eigen_stream_);
-  }
-
- private:
-   TF_EigenStream* eigen_stream;
-};
-```
-
-Now, a kernel can create an instance of `Eigen::GpuDevice` using this stream:
-
-```c++
-TF_EigenStream* eigen_stream = TF_GetEigenStream();
-Eigen::GpuDevice* device = Eigen::GpuDevice(EigenGpuStream(eigen_stream));
-...
-tensor->device(device) = < computation >
-...
-TF_DeleteEigenStreamHandle(eigen_stream);
-```
-
-Note: `gpuStream_t` and `gpuDeviceProp_t` might be aliased to ROCm's objects
-instead of Cuda structs. See Appendix 2 for details how we are going to handle
-ROCm support.
-
-Wrapping `Allocator` using similar approach should be trivial. However,
-`ThreadPoolInterface` takes `std::function<void()>` and this approach would
-require passing `std::function` across C API, which is non-trivial. For details
-how we are going to handle it see Appendix 1.
-
-### Alternative for GPU Device Assignment API
-
-We can take approach similar to the CPU device assignment API. On the CPU side,
-corresponding Eigen object - `ThreadPoolInterface` - has a `Schedule` method. This
-method schedules a kernel function in a thread pool.
-
-Similarly, we can add a `Launch`/`Schedule` function to `StreamInterface`. The
-default implementation would have same behavior as `LAUNCH_GPU_KERNEL` in
-Eigen. However, we can customize it on the TensorFlow side and implement launch
-logic in core TensorFlow instead of the kernel. This way, `cudaStream_t` and
-`hipStream_t` only need to be referenced in core.
-
-<!-- TODO: add examples that are currently only available internally -->
-
-Advantages of this approach:
-
-* Don't need to pass `hipStream_t` and `cudaStream_t` across the API boundary.
-
-* Supports customization of the `launchKernel` call which might be useful if we
-  want to handle it differently later.
-
-Disadvantages of this approach:
-
-* More invasive change to Eigen.
-
-### CudaLaunchKernel API
-
-CudaLaunchKernel appears to be a fairly thin wrapper around `cudaLaunchKernel`
-in the Cuda Runtime library and a part of their C API.
-
-For reference, this is the signature of `cudaLaunchKernel`:
-
-```c++
-extern __host__ cudaError_t CUDARTAPI cudaLaunchKernel(
-    const void *func,
-    dim3 gridDim,
-    dim3 blockDim,
-    void **args,
-    size_t sharedMem,
-    cudaStream_t stream);
-```
-
-where `dim3` and `cudaStream_t` are structs.
-This is trivial to either wrap with the TensorFlow C API or just call into from
-plugins directly.
-
-However, ROCm's side of things is harder than Cuda. `gpuLaunchKernel` might
-call ROCm's `hipLaunchKernelGGL` here instead. Its signature uses templates.
-Fortunately, AMD is planning to add an equivalent function that provides a C
-API. (see Appendix 2 for details)
-
-### Getting Status when using device APIs
-
-Kernel Device APIs described in this document rely on wrapping certain Eigen interfaces, such as `Eigen::StreamInterface` to provide a C API. Implementations of these interfaces might set an `OpKernelContext` status, which is not on the interface surface. Therefore, I propose that we add a new function that would update a given `TF_Status` with current `OpKernelContext` status:
-
-```c++
-TF_CAPI_EXPORT extern void TF_OpKernelContext_UpdateStatus(TF_Status*);
-```
-
-This would allow kernel implementations to return as soon as they see a failing status. For example:
-
-```c++
-TF_EigenStream* eigen_stream = TF_GetEigenStream();
-... run computation using eigen_stream ...
-
-TF_Status* context_status = TF_NewStatus();
-TF_OpKernelContext_UpdateStatus(context_status);
-if (TF_GetCode(context_status) != TF_OK) {
-  TF_DeleteStatus(context_status);
-  return;
-}
-```
-
-## Appendix 1
-
-Certain parts of our design involve kernel plugins calling a function in
-TensorFlow core of the form:
-
-```c++
-void foo(std::function<void()> arg) { ... }
-```
-
-We can't pass `std::function` across the C API boundary. Instead, we plan to wrap it with a struct and break this call up into 3 steps:
-
-* Wrap `std::function<void()>` with a struct. The struct will contain pointers
-  to callbacks for manipulating `std::function<void()>` pointer.  (This will happen
-  on the kernel plugin side).
-
-* Pass the struct across C API boundary.
-
-* Wrap the struct with a callable object which can be used as
-  `std::function<void()>`. (This will happen on TensorFlow core side).
-
-Step 1: The wrapper struct will be defined as follows:
-
-```c++
-// Wraps std::function<void()> so that it can be called across C API.
-struct FuncWrap {
-  void* func_ptr;  // pointer to std::function<void()>
-
-  // Function that takes std::function<void()> pointer as an argument
-  // and calls that function.
-  void (*call_func_ptr) (void*);
-
-  // Function that takes std::function<void()> pointer as an argument
-  // and deletes it.
-  void (*delete_func_ptr) (void*);
-};
-```
-
-Note that we would need to move `std::function` to the heap because `FuncWrap`
-might be placed in a queue and called later. Specifically, `FuncWrap`
-construction will happen on the kernel plugin side and will have the following
-implementation:
-
-```c++
-// Wraps std::function<void()> with FuncWrap struct.
-FuncWrap get_func_wrap(std::function<void()> f ) {
-  // Move function to heap
-  auto* f_heap = new std::function<void()>(f);
-
-  return {
-    // Argument to pass to callbacks to call/delete it.
-    f_heap,
-    // Callback that calls f_heap.
-    [](void* wrapped_f) {
-      std::function<void()>* f_std = static_cast<std::function<void()>*>(
-          wrapped_f);
-      (*f_std)();
-    },
-    // Callback that deletes f_heap.
-    [](void* wrapped_f) {
-      std::function<void()>* f_std = static_cast<std::function<void()>*>(
-          wrapped_f);
-      delete f_std;
-    }
-  };
-}
-```
-
-Step 2: `FuncWrap` struct constructed in this manner can now be passed across
-the C API to core.
-
-Step 3: Since we place `std::function` on the heap, we need to manage its
-deletion. Therefore, we wrap it with a class in TensorFlow core so that it can
-be deleted once all references are gone:
-
-```c++
-class CallFuncWrap {
- public:
-  explicit CallFuncWrap(FuncWrap wrap) :
-    wrap_(new FuncWrap(wrap), [](FuncWrap* ptr) {
-      ptr->delete_func_ptr(ptr->func_ptr);
-      delete ptr;
-    }) {};
-
-  void operator() () {
-    wrap_->call_func_ptr(wrap_->func_ptr);
-  }
-
- private:
-  // CallFuncWrap might be copied when it is passed to functions taking
-  // std::function as an argument.
-  // We use shared_ptr to make sure we only have one copy of FuncWrap
-  // even if CallFuncWrap is copied. We want a single copy of FuncWrap
-  // because the pointer stored in FuncWrap should only be deleted once.
-  std::shared_ptr<FuncWrap> wrap_;
-};
-```
-
-Now, the `CallFuncWrap` instance can be passed in as a `std::function<void()>` argument:
-
-```c++
-CallFuncWrap call_func_wrap(func_wrap);
-foo(call_func_wrap);  // foo here takes std::function<void()> argument
-```
-
-## Appendix 2: Working with ROCm across C API
-
-We need to access `hipStream_t` on both sides of the C API. Since its
-implementation is actually in C++, we will treat it as opaque pointer that we
-get from a HIP function (on the TensorFlow core side) and pass to another HIP
-function (on the kernel side).
-
-Ideally, we should only rely on extern C parts of `hip_runtime_api.h`. There is
-no equivalent in the C API right now for `hipLaunchKernelGGL`. However, AMD
-might add an equivalent function to the C API in the near future.
-
-Note that we have to update `LAUNCH_GPU_KERNEL` in Eigen to call the HIP C API
-once it is available.
-