From 5df328d31c8b91e16e88a53338a8b6aab4ee3dcc Mon Sep 17 00:00:00 2001 From: Mircea Trofin Date: Thu, 8 May 2025 21:56:48 -0700 Subject: [PATCH 1/3] [docs][mlgo] Document `MLModelRunner` --- llvm/docs/MLGO.rst | 168 ++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 157 insertions(+), 11 deletions(-) diff --git a/llvm/docs/MLGO.rst b/llvm/docs/MLGO.rst index 28518b83d8c3e..28ff0bd31007c 100644 --- a/llvm/docs/MLGO.rst +++ b/llvm/docs/MLGO.rst @@ -1,19 +1,28 @@ -==== -MLGO -==== +============================================= +Machine Learning - Guided Optimization (MLGO) +============================================= Introduction ============ -MLGO is a framework for integrating ML techniques systematically in LLVM. It is -designed primarily to replace heuristics within LLVM with machine learned -models. Currently there is upstream infrastructure for the following -heuristics: +MLGO refers to integrating ML techniques (primarily) to replace heuristics within +LLVM with machine learned models. + +Currently the following heuristics feature such integration: * Inlining for size * Register allocation (LLVM greedy eviction heuristic) for performance -This document is an outline of the tooling that composes MLGO. +This document is an outline of the tooling and APIs facilitating MLGO. + +Note that tools for orchestrating ML training are not part of LLVM, as they are +dependency-heavy - both on the ML infrastructure choice, as well as choices of +distrubuted computing. For the training scenario, LLVM only contains facilities +enabling it, such as corpus extraction, training data extraction, evaluation of +models during training. + + +.. contents:: Corpus Tooling ============== @@ -21,8 +30,145 @@ Corpus Tooling .. TODO(boomanaiden154): Write this section. -Model Runner Interfaces -======================= +Interacting with ML models +========================== + +We interact with ML models in 2 primary scenarios: one is to train such a model. +The other, inference, is to use a model during compilation, to make optimization +decisions. + +For a specific optimization problem - i.e. inlining, or regalloc eviction - we +first separate correctness - preserving decisions from optimization decisions. +For example, not inlining functions marked "no inline" is an example of the +former. Same is not evicting an unevictable live range. An exmple of the latter +is deciding to inline a function that will bloat the caller size, just because +we have reason to believe that later, the effect will be some constant +propagation that will actually reduce the size (or dynamic instruction count). + +ML models can be understood as functions. Their inputs are tensors - buffers of +scalars. The output (in our case, singular) is a scalar. For example, for +inlining, the inputs are properties of the caller, callee, and the callsite +being analyzed for inlining. The output is a boolean. + +Inputs and outputs are named, have a scalar type (e.g. int32_t) and a shape +(e.g. 3x4). These are the elements that we use to bind to a ML model. + +In both training and inference, we want to expose to ML (training algorithms or +trained model, respectivelly) the features we want to make optimization +decisions on. In that regard, the interface from the compiler side to the ML +side is the same: pass features, and get a decision. It's essentially a function +call, where the parameters and result are bound by name and are described by +name, scalar type, and shape tuples. + +The main types in LLVM are: +- ``MLModelRunner`` - an abstraction for the decision making mechanism +- ``TensorSpec`` which describes a tensor. + +TensorSpec +---------- + +See ``llvm/Analysis/TensorSpec.h``. This is a simple data bag, identifying a +tensor by name (a string), scalar type, and shape (a vector of ints). The scalar +type can only be int (8, 16, 32, or 64), signed or unsigned; float; or double. + +MLModelRunner +------------- + +See ``llvm/Analysis/MLModelRunner.h``. The abstraction has a pure virtual, +``evaluateUntyped``, but the contract with implementers is a bit more involved: + +Implementers +^^^^^^^^^^^^ + +At construction, the implementer is expected to receive a list of ``TensorSpec`` +for input features and the ``TensorSpec`` of the output (e.g. +``std::vector``). The list type is not contractual, but it must be +a 0-based indexing array-like container. In the order of appearance in the input +list, for a ``TensorSpec`` with a name "N", shape D1xD2x...Dn, and scalar type +"T", the implementer must set up a contiguous buffer sized +``sizeof(T) * D1 * D2 * ... * Dn``. This buffer's lifetime must be the same as +the lifetime of the implementer object; finally, for each given ``TensorSpec``, +the implementer must call ``MLModelRunner::setUpBufferForTensor``. + +Internally, the expectation is that the implementer uses the name (and maybe +shape) of a ``TensorSpec`` for binding (e.g. lookup in an underlying ML model). + +``MLModelRunner::setUpBufferForTensor`` stores each buffer at the corresponding +index (i.e. its position in the list used at construction). The expectation is +that the user will use that position when calling ``MLModelRunner::getTensor`` +to retrieve the underlying buffer (more on that in a bit). + +The implementation of ``evaluateUntyped`` is expected to use the value in the +buffers described above, carry out whatever computation (e.g. evaluate a ML +model) and then place the outcome in an output buffer which will be returned to +the caller. Importantly, ``evaluateUntyped`` must not reset the input buffers. +This is because during training we may want to log the features and decisions, +and since the data is already buffered, there's no reason to force backing it +up elsewhere. + +Users +^^^^^ + +The users must pass the input ``TensorSpec`` list at the construction of a +specific ``MLModelRunner`` object. After that, users can be agnostic of the +specific implementation, and would typically follow the following workflow: + +- call ``getTensor`` or ``getTensorUntyped``, for each input tensor, identified + by its index (i.e. the index of the corresponding ``TensorSpec`` in the list + used at construction). +- populate the tensor buffer of each input tensor with values. Users can take + advantage of the stability of the tensor buffers like set only once those that + don't change, or cache the buffer address +- call ``evaluate`` and use its result. + +Versioning +^^^^^^^^^^ + +We support a model "knowing" less inputs than the compiler. This is supported by +``MLModelRunner::setUpBufferForTensor``. If a ``TensorSpec`` requested by the +compiler is not supported by the underlying model, the ``MLModelRunner`` +implementer must still call ``setUpBufferForTensor`` with a ``nullptr`` value +for the buffer. In turn, ``MLModelRunner`` will allocate an appropriately - sized +buffer and track its lifetime. The user can safely populate that buffer. Since +the rest of the inputs are still provided, this allows an evolution model where +we first add features to the compiler and continue using older models without +regressing. Then, the new compiler can be used to train new models. Deprecating +features in the compiler involves, then, training first a model without those +features. + +``MLModelRunner`` implementations +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +We currently feature 3 implementations: + +- ``ModelUnderTrainingRunner``. This requires the compiler be built with TFLite + support. It allows loading a TFLite model dynamically and is primarily + intended for training scenarios, but it can be used relatively easily in + production build environments, as it does not change how the compiler operates + (why this remark is necessary will become clear in a few paragraphs) + +- ``ReleaseModeModelRunner``. This is intended for inference scenarios. This + uses the rules defined in ``llvm/cmake/modules/TensorFlowCompile.cmake`` to + convert, at the time the compiler is built, TensorFlow Saved Models into a + header (.h) and native object (.o). The latter is a CPU-based implementation of + the neural network, together with its weights (essentially, loops performing + matrix multiplications) + +NOTE: we are activelly working on replacing this with an EmitC implementation +requiring no out of tree build-time dependencies. + +- ``InteractiveModelRunner``. This is intended for training scenarios where the + training algorithm drives compilation. This model runner has no special + dependencies, and relies on I/O pipes to communicate with a separate process +- presumably a python training algorithm. We do not envision using this in a + production environment. + +Note that training leaves it to the training infrastructure to handle +distributed computing. The assumed architecture has python processes +communicating remotely between themselves, but managing local communication with +clang. .. - TODO(mtrofin): Write this section. + TODO(mtrofin): + - logging, and the use in interactive mode. + - discuss an example (like the inliner) From 674074be6b63a74dddb9a513bd6705361113f511 Mon Sep 17 00:00:00 2001 From: Mircea Trofin Date: Fri, 9 May 2025 07:40:47 -0700 Subject: [PATCH 2/3] feedback --- llvm/docs/MLGO.rst | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/llvm/docs/MLGO.rst b/llvm/docs/MLGO.rst index 28ff0bd31007c..07a67d80fbac5 100644 --- a/llvm/docs/MLGO.rst +++ b/llvm/docs/MLGO.rst @@ -18,8 +18,8 @@ This document is an outline of the tooling and APIs facilitating MLGO. Note that tools for orchestrating ML training are not part of LLVM, as they are dependency-heavy - both on the ML infrastructure choice, as well as choices of distrubuted computing. For the training scenario, LLVM only contains facilities -enabling it, such as corpus extraction, training data extraction, evaluation of -models during training. +enabling it, such as corpus extraction, training data extraction, and evaluation +of models during training. .. contents:: @@ -54,7 +54,7 @@ Inputs and outputs are named, have a scalar type (e.g. int32_t) and a shape (e.g. 3x4). These are the elements that we use to bind to a ML model. In both training and inference, we want to expose to ML (training algorithms or -trained model, respectivelly) the features we want to make optimization +trained model, respectively) the features we want to make optimization decisions on. In that regard, the interface from the compiler side to the ML side is the same: pass features, and get a decision. It's essentially a function call, where the parameters and result are bound by name and are described by @@ -83,12 +83,13 @@ Implementers At construction, the implementer is expected to receive a list of ``TensorSpec`` for input features and the ``TensorSpec`` of the output (e.g. ``std::vector``). The list type is not contractual, but it must be -a 0-based indexing array-like container. In the order of appearance in the input -list, for a ``TensorSpec`` with a name "N", shape D1xD2x...Dn, and scalar type -"T", the implementer must set up a contiguous buffer sized -``sizeof(T) * D1 * D2 * ... * Dn``. This buffer's lifetime must be the same as -the lifetime of the implementer object; finally, for each given ``TensorSpec``, -the implementer must call ``MLModelRunner::setUpBufferForTensor``. +a 0-based indexing array-like container. Given a ``TensorSpec`` at index "I" in +the input list, that has a name "N", shape "D1 x D2x ... Dn", and scalar type +"T", the implementer must: +- set up a contiguous buffer sized ``sizeof(T) * D1 * D2 * ... * Dn``. This + buffer's lifetime must be the same as the lifetime of the implementer object; +- call ``MLModelRunner::setUpBufferForTensor`` passing I, the ``TensorSpec``, + and the buffer above. Internally, the expectation is that the implementer uses the name (and maybe shape) of a ``TensorSpec`` for binding (e.g. lookup in an underlying ML model). @@ -154,7 +155,7 @@ We currently feature 3 implementations: the neural network, together with its weights (essentially, loops performing matrix multiplications) -NOTE: we are activelly working on replacing this with an EmitC implementation +NOTE: we are actively working on replacing this with an EmitC implementation requiring no out of tree build-time dependencies. - ``InteractiveModelRunner``. This is intended for training scenarios where the From 633e73d17e8df9469e1b1e12158e4d0ea86e6d57 Mon Sep 17 00:00:00 2001 From: Mircea Trofin Date: Fri, 9 May 2025 07:47:53 -0700 Subject: [PATCH 3/3] indenting fix --- llvm/docs/MLGO.rst | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/llvm/docs/MLGO.rst b/llvm/docs/MLGO.rst index 07a67d80fbac5..382dbd1ece7c5 100644 --- a/llvm/docs/MLGO.rst +++ b/llvm/docs/MLGO.rst @@ -86,8 +86,9 @@ for input features and the ``TensorSpec`` of the output (e.g. a 0-based indexing array-like container. Given a ``TensorSpec`` at index "I" in the input list, that has a name "N", shape "D1 x D2x ... Dn", and scalar type "T", the implementer must: + - set up a contiguous buffer sized ``sizeof(T) * D1 * D2 * ... * Dn``. This - buffer's lifetime must be the same as the lifetime of the implementer object; + buffer's lifetime must be the same as the lifetime of the implementer object. - call ``MLModelRunner::setUpBufferForTensor`` passing I, the ``TensorSpec``, and the buffer above.