You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jul 10, 2025. It is now read-only.
with inheritance is significantly slower: ~7% (median) or ~19% (mean)
246
-
(running on Linux). Note that this difference was measured *without* Kernel
247
-
Fallback. Adding inheritance would impact all existing TensorFlow kernels
248
-
even those that don't support Kernel Fallback.
245
+
with inheritance was originally measured to be significantly slower: ~7% (median). However, we determined that the regression goes away if we use `final` keywords. (More details in [Appendix 2](#appendix-2-extension-options).)
249
246
* Binary size increase when using templates compared to inheritance is
250
247
estimated at 2.6% (based on adding `AddN` op).
251
248
@@ -260,7 +257,7 @@ that calls per-device pure-virtual implementations.
260
257
261
258
We will then introduce `TFRTOpKernelConstruction` and `TFRTOpKernelContext`
262
259
subclasses that implement `OpKernelConstructionInterface` and
263
-
`OpKernelContextInterface` in terms of TFRT data structures. Example how
260
+
`OpKernelContextInterface` in terms of TFRT data structures. Here's an example of how
264
261
`TFRTOpKernelConstruction` might look like:
265
262
266
263
```cpp
@@ -745,7 +742,7 @@ REGISTER_FALLBACK_KERNEL(
745
742
</td>
746
743
<td>Same
747
744
</td>
748
-
<td>Increase (vtable lookups) (negligible for model benchmarks, 7% median/19% mean increase for `basic_ops_benchmark`)
745
+
<td>We expect increase due to vtable lookups. However, increase is negligible (0-2%) in our benchmarks when using `final` keywords *
749
746
</td>
750
747
</tr>
751
748
<tr>
@@ -761,7 +758,7 @@ REGISTER_FALLBACK_KERNEL(
761
758
</td>
762
759
<td>Increase the most (2.6% estimate for AddN)
763
760
</td>
764
-
<td>Increase in some cases*
761
+
<td>Increase in some cases**
765
762
</td>
766
763
</tr>
767
764
<tr>
@@ -790,14 +787,15 @@ REGISTER_FALLBACK_KERNEL(
790
787
</tr>
791
788
</table>
792
789
793
-
\* Increase will happen when we have intermediate subclass of `OpKernel`. For example, [AveragePoolingOp](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/kernels/avgpooling_op.cc;l=56?q=%22:%20public%20UnaryOp%22) extends `UnaryOp` and `UnaryOp` extends `OpKernel`. In this case, `UnaryOp` is the *intermediate subclass*. Now that a kernel can inherit either from `OpKernel` or `OpKernelBase`, we would need two implementations: `UnaryOp` and `UnaryOpBase` respectively. Kernels that support Kernel Fallback and inherit `UnaryOp` now will instead switch to inherit `UnaryOpBase`. Addition of `UnaryOpBase` increases binary size.
794
790
795
-
### Selected approach
791
+
* We initially measured a ~7% increase in latency for [basic_ops_benchmark](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/kernels/basic_ops_benchmark_test.cc;l=65;drc=51caa2b03f2975be51ab3f03999f35046b34f4af) . This benchmark runs a series of scalar multiplications and devisions and primarily measures kernel overhead. However, we determined that declaring `OpKernelContext` and `OpKernelConstruction` final gets read of this regression. `final` helps because a call made by a kernel is the tip of the iceberg - the called functions then make multiple calls to other functions in the same class. For example, [OpKernelContext::forward_input_or_allocate_output](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/framework/op_kernel.h;l=1647;drc=b64dfc0c63defad2704f224dff2aa3cf97469f91) implementation calls >10 other functions in `OpKernelContext`.
792
+
796
793
797
-
Currently we are thinking of proceeding with the inheritance approach.
794
+
** Increase will happen when we have intermediate subclass of `OpKernel`. For example, [AveragePoolingOp](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/kernels/avgpooling_op.cc;l=56?q=%22:%20public%20UnaryOp%22) extends `UnaryOp` and `UnaryOp` extends `OpKernel`. In this case, `UnaryOp` is the *intermediate subclass*. Now that a kernel can inherit either from `OpKernel` or `OpKernelBase`, we would need two implementations: `UnaryOp` and `UnaryOpBase` respectively. Kernels that support Kernel Fallback and inherit `UnaryOp` now will instead switch to inherit `UnaryOpBase`. Addition of `UnaryOpBase` increases binary size.
795
+
796
+
### Selected approach
798
797
799
-
Inheritance seems to add negligible overhead to kernels for most benchmarks that we ran.
800
-
However, it does introduce a ~7% median, ~19% mean penalty for [basic_ops_benchmark](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/kernels/basic_ops_benchmark_test.cc?q=basic_ops_benchmark&ss=tensorflow%2Ftensorflow) which runs a series of scalar multiplications and is used to measure kernel overhead.
798
+
Currently we are thinking of proceeding with the inheritance approach as it doesn't seem to cause a significant performance regression based on our benchmarks.
801
799
802
800
Therefore, we expect that using inheritance would not add a noticeable overhead in most real world models. At the same time, inheritance can simplify code structure and debugging.
0 commit comments