Skip to content
This repository was archived by the owner on Jul 10, 2025. It is now read-only.

Commit ac4ae7f

Browse files
authored
Removing device part of the proposal
I removed device section of this doc. We never made a final decision on the device solution and then I moved to work on something else. If we come up with a final decision for device, we can create another RFC (I removed my name from authors as well since my only contribution was device API. ). The rest of the doc can be merged.
1 parent 1c87177 commit ac4ae7f

File tree

1 file changed

+1
-383
lines changed

1 file changed

+1
-383
lines changed

rfcs/20190814-kernel-and-op-registration.md

Lines changed: 1 addition & 383 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
| Status | Proposed |
44
:-------------- |:---------------------------------------------------- |
5-
| **Author(s)** | James Ring ([email protected]), Anna Revinskaya ([email protected]) |
5+
| **Author(s)** | James Ring ([email protected]). |
66
| **Sponsor** | Günhan Gülsoy ([email protected]) |
77
| **Updated** | 2019-08-14 |
88

@@ -293,385 +293,3 @@ API will contain classes such as `Tensor`, `OpKernelContext`, and
293293
Ideally, this API will be as close as possible to the existing non-ABI-stable
294294
Tensorflow C++ API, so that kernels and ops currently implemented in C++ may be
295295
ported to the ABI-stable C++ with as little implementation churn as possible.
296-
297-
## Device C API for Kernels
298-
299-
So far, this document has not dealt with the challenges of providing an
300-
ABI-stable API for kernels that run on GPUs. This section describes an API that
301-
addresses these challenges.
302-
303-
There are a few approaches to running kernels on GPUs:
304-
305-
* Assign computation to Eigen device (for e.g. see `OneHot`, `Transpose`,
306-
training ops). (>200 occurrences in TensorFlow)
307-
308-
* Call `device.parallelFor` (for e.g. see `BatchSelect`). (4 occurrences)
309-
310-
* Call `ThreadPool::ParallelFor` (for e.g. see `MatrixDiag`). This is a
311-
TensorFlow wrapper that eventually wraps calls to Eigen. For example,
312-
`ThreadPool::ParallelFor` calls `device.parallelFor` in Eigen. (29
313-
occurrences)
314-
315-
* Call `Shard` (e.g. `CTCGreedyDecoder`). This approach is deprecated in favor
316-
of `ThreadPool::TransformRangeConcurrently` but no kernels use the latter yet.
317-
(42 occurrences)
318-
319-
* Call `GpuLaunchKernel` or `CudaLaunchKernel` directly, i.e. without calling Eigen.
320-
(58 occurrences)
321-
322-
* `Matmul` op calls directly to `StreamExecutor`.
323-
324-
* Possibly others
325-
326-
In all approaches above, TensorFlow core is responsible for maintaining
327-
respective device queues, streams or pools. Kernels then use these queues to
328-
schedule computation. Therefore, our primary goal is to implement a C API that
329-
enables this scheduling. To give an example, one approach we can take is have
330-
Kernel pass a callback across C API. Tensorflow core would then call this
331-
callback. See diagram below:
332-
333-
![device API](20190814-kernel-and-op-registration/device_api_overview.png)
334-
335-
Furthermore, note that most of the approaches listed above eventually call to
336-
Eigen to parallelize and forward computation to device. For example, the first
337-
approach above uses Eigen APIs directly. Consequently, we need to understand how
338-
Eigen works with devices and in some cases make changes to Eigen codebase as
339-
well.
340-
341-
Finally, we should aim to create a smaller API. Some of the approaches listed in
342-
the Background section seem to be very similar. For example, calling
343-
`parallelFor` in Eigen is quite similar to calling into
344-
`ThreadPool::ParallelFor`. Therefore, we will only provide C API equivalents for
345-
the following:
346-
347-
* `ThreadPool` and its methods.
348-
349-
* `CudaLaunchKernel` function.
350-
351-
* Computation assignment to device in Eigen.
352-
353-
This proposal focuses on these 3 components for now. Due to the complexity and
354-
variety of TensorFlow kernels, it is very likely that we will need to consider
355-
more approaches going forward. For example, how `MatMul` op would call
356-
`StreamExecutor` directly has not been investigated.
357-
358-
### ThreadPool API
359-
360-
Here, we can just wrap relevant methods in the `ThreadPool` class.
361-
362-
```c++
363-
TF_CAPI_EXPORT extern void TF_ThreadPool_Schedule(
364-
TF_OpKernelContext* context,
365-
void (*fn)());
366-
367-
TF_CAPI_EXPORT extern void TF_ThreadPool_ScheduleWithHint(
368-
TF_OpKernelContext* context,
369-
void (*fn)(),
370-
int start,
371-
int limit);
372-
373-
TF_CAPI_EXPORT extern void TF_ThreadPool_ParallelFor(
374-
TF_OpKernelContext* context,
375-
int64_t total,
376-
int64_t cost_per_unit,
377-
void (*fn)(int64_t, int64_t));
378-
379-
TF_CAPI_EXPORT extern void TF_ThreadPool_ParallelForWithWorkerId(
380-
TF_OpKernelContext* context,
381-
int64_t total,
382-
int64_t cost_per_unit,
383-
void (*fn)(int64_t, int64_t, int));
384-
```
385-
386-
Note that we just pass a `TF_OpKernelContext` instead of a `ThreadPool`
387-
instance. Implementation of these interfaces on the TensorFlow core side can
388-
then retrieve the actual ThreadPool object using:
389-
390-
```c++
391-
OpKernelContext* ctx = reinterpret_cast<OpKernelContext*>(context);
392-
auto thread_pool =
393-
cxt->device()->tensorflow_cpu_worker_threads()->workers;
394-
```
395-
396-
For details on how we plan to switch between `std::function<void>` and `void
397-
(*fn)()`, see Appendix 1 below.
398-
399-
### Device Assignment API
400-
401-
This approach lets us construct device objects (e.g. `Eigen::GpuDevice`) on the
402-
plugin side. Basically, we get an Eigen device object and can apply any
403-
operations we currently apply to an Eigen device.
404-
405-
We could wrap `Eigen::StreamInterface`, `Eigen::ThreadPoolInterface` and `Eigen::Allocator`. These
406-
wrappers will consist of a C API and a C++ wrapper on top of the C API. A
407-
sample C API for `StreamInterface` is given below:
408-
409-
```c++
410-
TF_CAPI_EXPORT extern TF_EigenStream* TF_GetEigenStreamHandle(
411-
TF_OpKernelContext*);
412-
TF_CAPI_EXPORT extern gpuStream_t* TF_EigenStream_GetCudaStream(
413-
TF_EigenStream*);
414-
TF_CAPI_EXPORT extern gpuDeviceProp_t* TF_EigenStream_GetDeviceProperties(
415-
TF_EigenStream*);
416-
TF_CAPI_EXPORT extern void* TF_EigenStream_Allocate(
417-
TF_EigenStream*, size_t num_bytes);
418-
TF_CAPI_EXPORT extern void TF_EigenStream_Deallocate(
419-
TF_EigenStream*, void* buffer);
420-
TF_CAPI_EXPORT extern void* TF_EigenStream_Scratchpad(
421-
TF_EigenStream*);
422-
TF_CAPI_EXPORT extern int* TF_EigenStream_Semaphore(
423-
TF_EigenStream*);
424-
// This would just delete the C API handle for TF_EigenStream.
425-
TF_CAPI_EXPORT extern TF_EigenStream* TF_DeleteEigenStreamHandle(
426-
TF_EigenStream*);
427-
```
428-
429-
The following C++ API will wrap the C API to provide a `StreamInterface` implementation
430-
on the kernel plugin side:
431-
432-
```c++
433-
class EigenGpuStream : public Eigen::StreamInterface {
434-
public:
435-
EigenGpuStream(TF_EigenStream* eigen_stream) :
436-
eigen_stream_(eigen_stream) {}
437-
438-
const gpuStream_t& stream() const override {
439-
return TF_EigenStream_GetCudaStream(eigen_stream_);
440-
}
441-
442-
const gpuDeviceProp_t& deviceProperties() const override {
443-
return TF_EigenStream_GetDeviceProperties(eigen_stream_);
444-
}
445-
446-
void* allocate(size_t num_bytes) const override {
447-
return TF_EigenStream_Allocate(eigen_stream_, num_bytes);
448-
}
449-
450-
void deallocate(void* buffer) const override {
451-
return TF_EigenStream_Deallocate(eigen_stream_, buffer);
452-
}
453-
454-
virtual void* scratchpad() const override {
455-
return TF_EigenStream_Scratchpad(eigen_stream_);
456-
}
457-
458-
virtual unsigned int* semaphore() const override {
459-
return TF_EigenStream_Semaphore(eigen_stream_);
460-
}
461-
462-
private:
463-
TF_EigenStream* eigen_stream;
464-
};
465-
```
466-
467-
Now, a kernel can create an instance of `Eigen::GpuDevice` using this stream:
468-
469-
```c++
470-
TF_EigenStream* eigen_stream = TF_GetEigenStream();
471-
Eigen::GpuDevice* device = Eigen::GpuDevice(EigenGpuStream(eigen_stream));
472-
...
473-
tensor->device(device) = < computation >
474-
...
475-
TF_DeleteEigenStreamHandle(eigen_stream);
476-
```
477-
478-
Note: `gpuStream_t` and `gpuDeviceProp_t` might be aliased to ROCm's objects
479-
instead of Cuda structs. See Appendix 2 for details how we are going to handle
480-
ROCm support.
481-
482-
Wrapping `Allocator` using similar approach should be trivial. However,
483-
`ThreadPoolInterface` takes `std::function<void()>` and this approach would
484-
require passing `std::function` across C API, which is non-trivial. For details
485-
how we are going to handle it see Appendix 1.
486-
487-
### Alternative for GPU Device Assignment API
488-
489-
We can take approach similar to the CPU device assignment API. On the CPU side,
490-
corresponding Eigen object - `ThreadPoolInterface` - has a `Schedule` method. This
491-
method schedules a kernel function in a thread pool.
492-
493-
Similarly, we can add a `Launch`/`Schedule` function to `StreamInterface`. The
494-
default implementation would have same behavior as `LAUNCH_GPU_KERNEL` in
495-
Eigen. However, we can customize it on the TensorFlow side and implement launch
496-
logic in core TensorFlow instead of the kernel. This way, `cudaStream_t` and
497-
`hipStream_t` only need to be referenced in core.
498-
499-
<!-- TODO: add examples that are currently only available internally -->
500-
501-
Advantages of this approach:
502-
503-
* Don't need to pass `hipStream_t` and `cudaStream_t` across the API boundary.
504-
505-
* Supports customization of the `launchKernel` call which might be useful if we
506-
want to handle it differently later.
507-
508-
Disadvantages of this approach:
509-
510-
* More invasive change to Eigen.
511-
512-
### CudaLaunchKernel API
513-
514-
CudaLaunchKernel appears to be a fairly thin wrapper around `cudaLaunchKernel`
515-
in the Cuda Runtime library and a part of their C API.
516-
517-
For reference, this is the signature of `cudaLaunchKernel`:
518-
519-
```c++
520-
extern __host__ cudaError_t CUDARTAPI cudaLaunchKernel(
521-
const void *func,
522-
dim3 gridDim,
523-
dim3 blockDim,
524-
void **args,
525-
size_t sharedMem,
526-
cudaStream_t stream);
527-
```
528-
529-
where `dim3` and `cudaStream_t` are structs.
530-
This is trivial to either wrap with the TensorFlow C API or just call into from
531-
plugins directly.
532-
533-
However, ROCm's side of things is harder than Cuda. `gpuLaunchKernel` might
534-
call ROCm's `hipLaunchKernelGGL` here instead. Its signature uses templates.
535-
Fortunately, AMD is planning to add an equivalent function that provides a C
536-
API. (see Appendix 2 for details)
537-
538-
### Getting Status when using device APIs
539-
540-
Kernel Device APIs described in this document rely on wrapping certain Eigen interfaces, such as `Eigen::StreamInterface` to provide a C API. Implementations of these interfaces might set an `OpKernelContext` status, which is not on the interface surface. Therefore, I propose that we add a new function that would update a given `TF_Status` with current `OpKernelContext` status:
541-
542-
```c++
543-
TF_CAPI_EXPORT extern void TF_OpKernelContext_UpdateStatus(TF_Status*);
544-
```
545-
546-
This would allow kernel implementations to return as soon as they see a failing status. For example:
547-
548-
```c++
549-
TF_EigenStream* eigen_stream = TF_GetEigenStream();
550-
... run computation using eigen_stream ...
551-
552-
TF_Status* context_status = TF_NewStatus();
553-
TF_OpKernelContext_UpdateStatus(context_status);
554-
if (TF_GetCode(context_status) != TF_OK) {
555-
TF_DeleteStatus(context_status);
556-
return;
557-
}
558-
```
559-
560-
## Appendix 1
561-
562-
Certain parts of our design involve kernel plugins calling a function in
563-
TensorFlow core of the form:
564-
565-
```c++
566-
void foo(std::function<void()> arg) { ... }
567-
```
568-
569-
We can't pass `std::function` across the C API boundary. Instead, we plan to wrap it with a struct and break this call up into 3 steps:
570-
571-
* Wrap `std::function<void()>` with a struct. The struct will contain pointers
572-
to callbacks for manipulating `std::function<void()>` pointer. (This will happen
573-
on the kernel plugin side).
574-
575-
* Pass the struct across C API boundary.
576-
577-
* Wrap the struct with a callable object which can be used as
578-
`std::function<void()>`. (This will happen on TensorFlow core side).
579-
580-
Step 1: The wrapper struct will be defined as follows:
581-
582-
```c++
583-
// Wraps std::function<void()> so that it can be called across C API.
584-
struct FuncWrap {
585-
void* func_ptr; // pointer to std::function<void()>
586-
587-
// Function that takes std::function<void()> pointer as an argument
588-
// and calls that function.
589-
void (*call_func_ptr) (void*);
590-
591-
// Function that takes std::function<void()> pointer as an argument
592-
// and deletes it.
593-
void (*delete_func_ptr) (void*);
594-
};
595-
```
596-
597-
Note that we would need to move `std::function` to the heap because `FuncWrap`
598-
might be placed in a queue and called later. Specifically, `FuncWrap`
599-
construction will happen on the kernel plugin side and will have the following
600-
implementation:
601-
602-
```c++
603-
// Wraps std::function<void()> with FuncWrap struct.
604-
FuncWrap get_func_wrap(std::function<void()> f ) {
605-
// Move function to heap
606-
auto* f_heap = new std::function<void()>(f);
607-
608-
return {
609-
// Argument to pass to callbacks to call/delete it.
610-
f_heap,
611-
// Callback that calls f_heap.
612-
[](void* wrapped_f) {
613-
std::function<void()>* f_std = static_cast<std::function<void()>*>(
614-
wrapped_f);
615-
(*f_std)();
616-
},
617-
// Callback that deletes f_heap.
618-
[](void* wrapped_f) {
619-
std::function<void()>* f_std = static_cast<std::function<void()>*>(
620-
wrapped_f);
621-
delete f_std;
622-
}
623-
};
624-
}
625-
```
626-
627-
Step 2: `FuncWrap` struct constructed in this manner can now be passed across
628-
the C API to core.
629-
630-
Step 3: Since we place `std::function` on the heap, we need to manage its
631-
deletion. Therefore, we wrap it with a class in TensorFlow core so that it can
632-
be deleted once all references are gone:
633-
634-
```c++
635-
class CallFuncWrap {
636-
public:
637-
explicit CallFuncWrap(FuncWrap wrap) :
638-
wrap_(new FuncWrap(wrap), [](FuncWrap* ptr) {
639-
ptr->delete_func_ptr(ptr->func_ptr);
640-
delete ptr;
641-
}) {};
642-
643-
void operator() () {
644-
wrap_->call_func_ptr(wrap_->func_ptr);
645-
}
646-
647-
private:
648-
// CallFuncWrap might be copied when it is passed to functions taking
649-
// std::function as an argument.
650-
// We use shared_ptr to make sure we only have one copy of FuncWrap
651-
// even if CallFuncWrap is copied. We want a single copy of FuncWrap
652-
// because the pointer stored in FuncWrap should only be deleted once.
653-
std::shared_ptr<FuncWrap> wrap_;
654-
};
655-
```
656-
657-
Now, the `CallFuncWrap` instance can be passed in as a `std::function<void()>` argument:
658-
659-
```c++
660-
CallFuncWrap call_func_wrap(func_wrap);
661-
foo(call_func_wrap); // foo here takes std::function<void()> argument
662-
```
663-
664-
## Appendix 2: Working with ROCm across C API
665-
666-
We need to access `hipStream_t` on both sides of the C API. Since its
667-
implementation is actually in C++, we will treat it as opaque pointer that we
668-
get from a HIP function (on the TensorFlow core side) and pass to another HIP
669-
function (on the kernel side).
670-
671-
Ideally, we should only rely on extern C parts of `hip_runtime_api.h`. There is
672-
no equivalent in the C API right now for `hipLaunchKernelGGL`. However, AMD
673-
might add an equivalent function to the C API in the near future.
674-
675-
Note that we have to update `LAUNCH_GPU_KERNEL` in Eigen to call the HIP C API
676-
once it is available.
677-

0 commit comments

Comments
 (0)