Design Doc: Tensorflow as a backend

Objective

Speed up large NIMBLE DSL computations by compiling DSL code to Tensorflow.

Background

Nimble currently uses the Eigen C++ library as a back-end for tensor and linear algebra computations.

Tensorflow is a tensor computation library that targets all of: multicore CPUs, NVIDIA GPUs (via cuda), AMD GPUs (via OpenCL), and Google TPUs. The Tensorflow architecture includes a C++ core exposed through a C API, and multiple language clients including a Python client (the most mature), a C++ client (less mature), and an R client (that wraps the Python client).

Overview

This design doc proposes to support Tensorflow as an alternative to Eigen as a computational back-end for Nimble.

Compiler stages

Nimble currently handles Eigenization logic in the sizeProcessing compiler stage. The first step to supporting Tensor flow as a NIMBLE back-end is to factor out this Eigen-specific logic into a separate compiler stage, so that we can implement an alternative Tensorflow compiler stage.

Compile large chunks of DSL code

To take full advantage of Tensorflow, we plan to compile large chunks of DSL code to Tensorflow. These chunks can be much large than the Eigen expressions that Nimble currently compiles. Specifically, we can compile math expressions, multiple assignment statements, some conditional statements, and limited control flow to Tensorflow. Tran et al. (2017) found that Edward achieved a 6x speedup over PyMC3 in one task, because Edward compiled to a single Tensorflow graph, whereas PyMC3 compiled to multiple smaller graphs and was bottlenecked in shuttling data between CPU and GPU.

Design Details

Linking dynamically compiled code

See Prototype.

As of June 2017, it appears that the best-supported method for dynamically generating Tensorflow code is to split code generation into two parts and use different APIs for each part.

Dynamically generate Tensorflow graphs using the R tensorflow package. The results of this step will be serialized Tensorflow graphs.
Dynamically generate C++ code to pass tensors between Nimble and the Tensorflow C API.

The prototype uses the _pywrap_tensorflow_internal.so shared library for both the R API (via the Python API) and the C API. Support for dual API use is under review for upstreaming into Tensorflow, and is available immediately as the v1.1.0-nimble branch of Nimble's fork of Tensorflow. See the prototype for instructions on building.

As of June 2017, Tensorflow supports custom C++ extension (e.g. for custom ops), but does not distribute a C++ library interface for using Tensorflow in existing projects (see Github issue). The C API is stable and can be built on a local machine and linked as libtensorflow.so, but conflicts with _pywrap_tensorflow_internal.so loaded by the R tensorflow package (see Github issue). The C API is very limited, yet it does support running existing graphs (constructed by another interface) and i/o of tensor data in dense row-major order. Note that the C API does not need C++11 support, in contrast to the C++ API.

An advantage of using the R tensorflow package is that it gives Nimble access to the most mature Tensorflow client interface (Python), and in particular gives us access to the well-engineered distributions library for statistical distributions.

Tensor memory and conversion

Nimble currently uses the custom NimArr<> for tensor memory management and uses Eigen::Map<> to alias this memory for interaction with Eigen. We plan to dynamically generate conversion code to feed inputs to- and read outputs from Tensorflow graphs (as an alternative to the existing compiler code to generate Eigen::Map<>s).

To use Tensorflow efficiently, Nimble will need modifications to NimArr() to use aligned memory. This might be accomplished by any of:

allowing NimArr<> to alias an existing block of contiguous memory (rather than insisting on owning the memory), or
aligning memory using Eigen::aligned_allocator, or
replacing NimArr<> with Eigen::Tensor (would require C++11).

As a short-term workaround, we can copy rather than alias.

// Create a NimArr.
NimArr<3, double> my_array;
my_array.initialize(0.0, False, 2, 3, 5);

// Create a TF_Tensor.
const int num_dims = 3;
const int64_t dims[3] = {5, 3, 2};  // Row-major.
const size_t len = 2 * 3 * 5 * sizeof(double);  // Byte size.
TF_Tensor* my_tensor = TF_AllocateTensor(TF_DOUBLE, dims, num_dims, len);

// Copy from NumArr to TF_Tensor.
memcpy(TF_TensorData(my_tensor), my_array.getPtr(), TF_TensorByteSize(my_tensor));

// Copy from TF_Tensor to NimArr.
memcpy(my_array.getPtr(), TF_TensorData(my_tensor), TF_TensorByteSize(my_tensor));

For reference, here is the TF_Tensor portion of the C API:

// --------------------------------------------------------------------------
// TF_Tensor holds a multi-dimensional array of elements of a single data type.
// For all types other than TF_STRING, the data buffer stores elements
// in row major order.  E.g. if data is treated as a vector of TF_DataType:
//
//   element 0:   index (0, ..., 0)
//   element 1:   index (0, ..., 1)
//   ...
//
// The format for TF_STRING tensors is:
//   start_offset: array[uint64]
//   data:         byte[...]
//
//   The string length (as a varint), followed by the contents of the string
//   is encoded at data[start_offset[i]]]. TF_StringEncode and TF_StringDecode
//   facilitate this encoding.

typedef struct TF_Tensor TF_Tensor;

// Return a new tensor that holds the bytes data[0,len-1].
//
// The data will be deallocated by a subsequent call to TF_DeleteTensor via:
//      (*deallocator)(data, len, deallocator_arg)
// Clients must provide a custom deallocator function so they can pass in
// memory managed by something like numpy.
TF_CAPI_EXPORT extern TF_Tensor* TF_NewTensor(
    TF_DataType, const int64_t* dims, int num_dims, void* data, size_t len,
    void (*deallocator)(void* data, size_t len, void* arg),
    void* deallocator_arg);

// Allocate and return a new Tensor.
//
// This function is an alternative to TF_NewTensor and should be used when
// memory is allocated to pass the Tensor to the C API. The allocated memory
// satisfies TensorFlow's memory alignment preferences and should be preferred
// over calling malloc and free.
//
// The caller must set the Tensor values by writing them to the pointer returned
// by TF_TensorData with length TF_TensorByteSize.
TF_CAPI_EXPORT extern TF_Tensor* TF_AllocateTensor(TF_DataType,
                                                   const int64_t* dims,
                                                   int num_dims, size_t len);

// Deletes `tensor` and returns a new TF_Tensor with the same content if
// possible. Returns nullptr and leaves `tensor` untouched if not.
TF_CAPI_EXPORT extern TF_Tensor* TF_TensorMaybeMove(TF_Tensor* tensor);

// Destroy a tensor.
TF_CAPI_EXPORT extern void TF_DeleteTensor(TF_Tensor*);

// Return the type of a tensor element.
TF_CAPI_EXPORT extern TF_DataType TF_TensorType(const TF_Tensor*);

// Return the number of dimensions that the tensor has.
TF_CAPI_EXPORT extern int TF_NumDims(const TF_Tensor*);

// Return the length of the tensor in the "dim_index" dimension.
// REQUIRES: 0 <= dim_index < TF_NumDims(tensor)
TF_CAPI_EXPORT extern int64_t TF_Dim(const TF_Tensor* tensor, int dim_index);

// Return the size of the underlying data in bytes.
TF_CAPI_EXPORT extern size_t TF_TensorByteSize(const TF_Tensor*);

// Return a pointer to the underlying data buffer.
TF_CAPI_EXPORT extern void* TF_TensorData(const TF_Tensor*);

Design Doc: Tensorflow as a backend

Objective

Background

Overview

Compiler stages

Compile large chunks of DSL code

Design Details

Linking dynamically compiled code

Tensor memory and conversion

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally