Skip to content
This repository was archived by the owner on Jul 10, 2025. It is now read-only.

Commit d1a3262

Browse files
authored
Benchmark update, wording update
`final` keyword removes previously observed large regression
1 parent 5ebe1ba commit d1a3262

File tree

1 file changed

+11
-13
lines changed

1 file changed

+11
-13
lines changed

rfcs/20200712-tfrt-kernel-fallback.md

Lines changed: 11 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
| **RFC #** | [266](https://github.com/tensorflow/community/pull/266) |
66
| **Author(s)** | Anna Revinskaya ([email protected]), Jeremy Lau ([email protected]) |
77
| **Sponsor** | Jeremy Lau ([email protected]) |
8-
| **Updated** | 2020-07-16 |
8+
| **Updated** | 2020-09-09 |
99

1010
## Objective
1111

@@ -242,10 +242,7 @@ and templating approaches. Key findings are summarized below:
242242
multiplication, division, addition) was only 0.3% slower on mobile with
243243
inheritance compared to templates. The benchmark was run on a real device (Pixel 3) with ABI: arm64-v8a and SDK version: 29.
244244
* [basic\_ops\_benchmark](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/kernels/basic_ops_benchmark_test.cc?q=basic_ops_benchmark_test)
245-
with inheritance is significantly slower: ~7% (median) or ~19% (mean)
246-
(running on Linux). Note that this difference was measured *without* Kernel
247-
Fallback. Adding inheritance would impact all existing TensorFlow kernels
248-
even those that don't support Kernel Fallback.
245+
with inheritance was originally measured to be significantly slower: ~7% (median). However, we determined that the regression goes away if we use `final` keywords. (More details in [Appendix 2](#appendix-2-extension-options).)
249246
* Binary size increase when using templates compared to inheritance is
250247
estimated at 2.6% (based on adding `AddN` op).
251248

@@ -260,7 +257,7 @@ that calls per-device pure-virtual implementations.
260257

261258
We will then introduce `TFRTOpKernelConstruction` and `TFRTOpKernelContext`
262259
subclasses that implement `OpKernelConstructionInterface` and
263-
`OpKernelContextInterface` in terms of TFRT data structures. Example how
260+
`OpKernelContextInterface` in terms of TFRT data structures. Here's an example of how
264261
`TFRTOpKernelConstruction` might look like:
265262

266263
```cpp
@@ -745,7 +742,7 @@ REGISTER_FALLBACK_KERNEL(
745742
</td>
746743
<td>Same
747744
</td>
748-
<td>Increase (vtable lookups) (negligible for model benchmarks, 7% median/19% mean increase for `basic_ops_benchmark`)
745+
<td>We expect increase due to vtable lookups. However, increase is negligible (0-2%) in our benchmarks when using `final` keywords *
749746
</td>
750747
</tr>
751748
<tr>
@@ -761,7 +758,7 @@ REGISTER_FALLBACK_KERNEL(
761758
</td>
762759
<td>Increase the most (2.6% estimate for AddN)
763760
</td>
764-
<td>Increase in some cases*
761+
<td>Increase in some cases**
765762
</td>
766763
</tr>
767764
<tr>
@@ -790,14 +787,15 @@ REGISTER_FALLBACK_KERNEL(
790787
</tr>
791788
</table>
792789
793-
\* Increase will happen when we have intermediate subclass of `OpKernel`. For example, [AveragePoolingOp](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/kernels/avgpooling_op.cc;l=56?q=%22:%20public%20UnaryOp%22) extends `UnaryOp` and `UnaryOp` extends `OpKernel`. In this case, `UnaryOp` is the *intermediate subclass*. Now that a kernel can inherit either from `OpKernel` or `OpKernelBase`, we would need two implementations: `UnaryOp` and `UnaryOpBase` respectively. Kernels that support Kernel Fallback and inherit `UnaryOp` now will instead switch to inherit `UnaryOpBase`. Addition of `UnaryOpBase` increases binary size.
794790
795-
### Selected approach
791+
&ast; We initially measured a ~7% increase in latency for [basic_ops_benchmark](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/kernels/basic_ops_benchmark_test.cc;l=65;drc=51caa2b03f2975be51ab3f03999f35046b34f4af) . This benchmark runs a series of scalar multiplications and devisions and primarily measures kernel overhead. However, we determined that declaring `OpKernelContext` and `OpKernelConstruction` final gets read of this regression. `final` helps because a call made by a kernel is the tip of the iceberg - the called functions then make multiple calls to other functions in the same class. For example, [OpKernelContext::forward_input_or_allocate_output](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/framework/op_kernel.h;l=1647;drc=b64dfc0c63defad2704f224dff2aa3cf97469f91) implementation calls >10 other functions in `OpKernelContext`.
792+
796793
797-
Currently we are thinking of proceeding with the inheritance approach.
794+
&ast;&ast; Increase will happen when we have intermediate subclass of `OpKernel`. For example, [AveragePoolingOp](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/kernels/avgpooling_op.cc;l=56?q=%22:%20public%20UnaryOp%22) extends `UnaryOp` and `UnaryOp` extends `OpKernel`. In this case, `UnaryOp` is the *intermediate subclass*. Now that a kernel can inherit either from `OpKernel` or `OpKernelBase`, we would need two implementations: `UnaryOp` and `UnaryOpBase` respectively. Kernels that support Kernel Fallback and inherit `UnaryOp` now will instead switch to inherit `UnaryOpBase`. Addition of `UnaryOpBase` increases binary size.
795+
796+
### Selected approach
798797
799-
Inheritance seems to add negligible overhead to kernels for most benchmarks that we ran.
800-
However, it does introduce a ~7% median, ~19% mean penalty for [basic_ops_benchmark](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/kernels/basic_ops_benchmark_test.cc?q=basic_ops_benchmark&ss=tensorflow%2Ftensorflow) which runs a series of scalar multiplications and is used to measure kernel overhead.
798+
Currently we are thinking of proceeding with the inheritance approach as it doesn't seem to cause a significant performance regression based on our benchmarks.
801799
802800
Therefore, we expect that using inheritance would not add a noticeable overhead in most real world models. At the same time, inheritance can simplify code structure and debugging.
803801

0 commit comments

Comments
 (0)