Skip to content

Conversation

@christopherbate
Copy link
Contributor

Previously, the BufferizableOpInterface implementation for 'tensor.reshape'
listed the 'shape' operand as an alias for the result tensor, causing
unnecessary conflicts with ops that "write" to the shape operand.

Previously, the BufferizableOpInterface implementation for 'tensor.reshape'
listed the 'shape' operand as an alias for the result tensor, causing
unnecessary conflicts with ops that "write" to the shape operand.
@llvmbot
Copy link
Member

llvmbot commented Feb 24, 2025

@llvm/pr-subscribers-mlir-tensor

@llvm/pr-subscribers-mlir

Author: Christopher Bate (christopherbate)

Changes

Previously, the BufferizableOpInterface implementation for 'tensor.reshape'
listed the 'shape' operand as an alias for the result tensor, causing
unnecessary conflicts with ops that "write" to the shape operand.


Full diff: https://github.com/llvm/llvm-project/pull/128590.diff

2 Files Affected:

  • (modified) mlir/lib/Dialect/Tensor/Transforms/BufferizableOpInterfaceImpl.cpp (+4)
  • (modified) mlir/test/Dialect/Tensor/one-shot-bufferize.mlir (+27)
diff --git a/mlir/lib/Dialect/Tensor/Transforms/BufferizableOpInterfaceImpl.cpp b/mlir/lib/Dialect/Tensor/Transforms/BufferizableOpInterfaceImpl.cpp
index 81404fa664cd4..8b7aee67ea5c2 100644
--- a/mlir/lib/Dialect/Tensor/Transforms/BufferizableOpInterfaceImpl.cpp
+++ b/mlir/lib/Dialect/Tensor/Transforms/BufferizableOpInterfaceImpl.cpp
@@ -862,6 +862,10 @@ struct ReshapeOpInterface
 
   AliasingValueList getAliasingValues(Operation *op, OpOperand &opOperand,
                                       const AnalysisState &state) const {
+    // Only the 'source' operand aliases the result.
+    auto reshapeOp = cast<tensor::ReshapeOp>(op);
+    if (reshapeOp.getSourceMutable() != opOperand)
+      return {};
     return {{op->getOpResult(0), BufferRelation::Equivalent}};
   }
 
diff --git a/mlir/test/Dialect/Tensor/one-shot-bufferize.mlir b/mlir/test/Dialect/Tensor/one-shot-bufferize.mlir
index af4f84640890b..2983cd30258a5 100644
--- a/mlir/test/Dialect/Tensor/one-shot-bufferize.mlir
+++ b/mlir/test/Dialect/Tensor/one-shot-bufferize.mlir
@@ -398,6 +398,33 @@ func.func @tensor.reshape() -> tensor<2x2x5xf32> {
 
 // -----
 
+// CHECK-LABEL: func @tensor_reshape_aliasing
+//  CHECK-SAME:  (%[[ARG0:.+]]: index, %[[ARG1:.+]]: index)
+func.func @tensor_reshape_aliasing(%arg0: index, %arg1: index) -> tensor<?x?xf32> {
+  %t1_static = arith.constant dense<0.> : tensor<10xf32>
+  // CHECK-DAG: %[[T1:.+]] = memref.cast
+  %t1 = tensor.cast %t1_static : tensor<10xf32> to tensor<?xf32>
+
+  // CHECK-DAG: %[[C0:.+]] = arith.constant 0 : index
+  %c0 = arith.constant 0 : index
+  // CHECK-DAG: %[[C1:.+]] = arith.constant 1 : index
+  %c1 = arith.constant 1 : index
+
+  // CHECK-DAG: %[[SHAPE:.+]] = memref.alloc() {{.*}} : memref<2xindex>
+  %shape = bufferization.alloc_tensor() : tensor<2xindex>
+  // CHECK: memref.store %[[ARG0]], %[[SHAPE]][%[[C0]]]
+  %shape.0 = tensor.insert %arg0 into %shape[%c0] : tensor<2xindex>
+  // CHECK: memref.store %[[ARG1]], %[[SHAPE]][%[[C1]]]
+  %shape.1 = tensor.insert %arg1 into %shape.0[%c1] : tensor<2xindex>
+
+  // CHECK: %[[RESHAPED:.+]] = memref.reshape %[[T1]](%[[SHAPE]])
+  %reshaped = tensor.reshape %t1(%shape.1) : (tensor<?xf32>, tensor<2xindex>) -> tensor<?x?xf32>
+  // CHECK: return %[[RESHAPED]]
+  return %reshaped : tensor<?x?xf32>
+}
+
+// -----
+
 // CHECK-LABEL: @reshape_with_non_identity_layout(
 // CHECK-SAME:    %[[INPUT:[a-zA-Z0-9]*]]: memref<2x2xf32, strided<[?, ?], offset: ?>, 3>,
 // CHECK-SAME:    %[[LAYOUT:[a-zA-Z0-9]*]]: memref<2xi32, strided<[?], offset: ?>>,

@christopherbate christopherbate merged commit 3438dfc into llvm:main Mar 13, 2025
11 checks passed
shelkesagar29 pushed a commit to NVIDIA/TensorRT-Incubator that referenced this pull request Jun 30, 2025
This PR moves the following internal changes to OSS,

commit 2986cac97888b5a9c1cd9064e4728cb38ca9dc45
Author: Sagar Shelke <[email protected]>

    [executor] Add complex type support to `ScalarValue`

    Previously, ScalarValue which represents scalar runtime value did not
    support complex type. This MR adds support for complex type by making storage
    union of real and complex data instaed of just real.

    MLIR tests are added via constant subgraph execution.

commit cf83a0d318b8035695d0b9fd24d578733632e253
Author: Christopher Bate <[email protected]>

    [compiler] Enable more `stablehlo.dot_general` to TensorRT using `tensorrt.einsum`

    Previously, we relied on canonicalization of `stablehlo.dot_general` to
    put all such contraction operations into a form that could be converted to
    `tensorrt.matrix_multiply`. Based on recent experiments, this can actually
    produce very inefficient TensorRT programs due to the number of reshapes
    and transpositions that must be inserted to coerce general
    `stablehlo.dot_general` into batched matrix multiplications. This change
    enables conversion of `stablehlo.dot_general` to `tensorrt.einsum`, and
    the pass and patterns now contain configurable parameters to control whether
    `tensorrt.einsum` is used as the primary method or only for fallback when
    conversion to `tensorrt.matrix_multiply` is not possible.

    A follow on change will revamp the Stablehlo preprocessing that we
    perform on 'stablehlo.dot_general' to avoid creating inefficient patterns
    and enable wider use of this pattern.

commit 528651ed1cd36c36376180c1c2232526ce972fef
Author: Christopher Bate <[email protected]>

    [compiler] Fix stablehlo-to-scf scalarization heuristics

    Fixes an issue where float tensors in the 'before' region of converted
    while loops where scalarized. The transform should only scalarize operands
    which are likely to be for-style induction variables.

commit 1d52e0a9e30dc104178c4761c1a24153abc7ea90
Author: Christopher Bate <[email protected]>

    [compiler] NFC: Drop dead code from StablehloToExecutableTask

commit f1c8d8c7cd860aedfe339d76ef7fb953baf9bd55
Author: Chris Bate <[email protected]>

    [compiler] Add `plan-promote-host-tensors-to-host-pinned` pass

    Adds a simple pass to promote "host" tensors to "host-pinned" tensors
    in common cases where we know a tensor will be transferred between host
    and device spaces. This pass runs after `plan-optimize-memory-spaces`
    since the former is sensitive to mismatching host spaces for patterns
    related to moving tranfers out of loops.

commit c27d56ea7a9661395e17fa895c610a79a92fa0c2
Author: Sagar Shelke <[email protected]>

    [executor] Handle elided dense resource elements attr during translation

    Translation to executable (which is flatbuffer) uses MLIR attr serialization to serialize
    `ElementsAttr`. However, this doesn't work when attr is elided dense resource and results
    in segfault. This MR handles this situation by replacing elided resource with `DenseElementsAttr`
    of all `one`s (`true` in case of boolean).

    IR with elided resource is usally seen only during testing of passes and not useful for
    e2e functional execution. Testing of `ExecuteConstantFoldableSubgraphs` pass is such case.
    Thus,  MLIR test cases for this pass are added.

commit 920a84e648833764563d3dc1de544a8f1b9f027e
Author: Chris Bate <[email protected]>

    [tensorrt] Fix TRT layer name generation function

    The TRT layer naming had some faulty logic that could cause the layer
    name to grow very large in the process to create a unique name. Fix
    the issue and use a static counter to reduce time spent in the loop.

commit ff0c5fa4bf5321ad0ce18579598c49f4b552fb37
Author: Christopher Bate <[email protected]>

    Further fixes to LIT configs

    Previously, we were setting `lit_config.parallelism_group` instead of
    `config.parallelism_group`. Apparently, the previous method does nothing,
    only `config.parallelism_group` has any effect.

commit d65c220b712c262992dbdf5a87fa3220a06bfb21
Author: Chris Bate <[email protected]>

    Update LIG test parallelism configs

    In more recent versions of TensorRT (10.11+ at least), the builder is taking a
    much larger amount of host memory. This can cause OOM when running the LIT test
    suites under their existing configurations.

    This change updates all LIT configs:

    - Make sure to use `%pick-one-gpu` in the LIT command line to ensure we
      stall if there are not enough GPU or host resources available. Add
      a hard limit that there must be at least 5GB of host memory available.

    - Update configurations to reduce the amount of estimated parallelism by
      increasing host memory requirements and reducing the amount of host
      memory to 50% for the purposes of the parallelism calculation.

    - Force all tests to use a common parallelism group unless otherwise
      specified in the test config.

commit 1f996f607640d81bf7137a4ed874b20c2a16cca2
Author: Christopher Bate <[email protected]>

    [compiler] Fix failure case in stablehlo-to-scf

    Fixes a failure case due to one of the recently introduced rewrites in
    `stablehlo-to-scf`.

commit 2779b632465fc3e840f5ce987f6233e824fe2ed3
Author: Christopher Bate <[email protected]>

    [compiler] Further improvements to plan bufferization pipeline

    - Split `plan-assign-memory-spaces` into three passes:
      - `plan-assign-memory-spaces`
      - `plan-optimize-memory-spaces`
      - `plan-materialize-explicit-transfers`
    - The last one is the only new code: `plan-materialize-explicit-transfers`
      converts `tensor.cast` ops that change the memory space encoding into
      explicit `bufferization.alloc_tensor` +
      `bufferization.materialize_in_destination` operations.
    - Improve handling of `bufferization.alloc_tensor` and optimization of
      `scf.for` iteration args in `plan-assign-memory-spaces`.
    - Improve handling of `tensor.reshape` in `plan-assign-memory-spaces`.
    - Fix handling of `tensor.reshape` when rewriting functions to be in
      DPS style in `plan-alloc-tensors`.

    This change also updates the LLVM dependencies in order to cherry-pick
    fix to the `tensor.reshape` bufferization interface that I merged
    upstream (llvm/llvm-project#128590).

    In addition, fix APInt assertions in `plan-execute-constant-foldable-subgraphs`.

commit 312170d8cbcd4c1fcf9cefdd628583e3dbdcc4f5
Author: Chris Bate <[email protected]>

    [compiler] Enable While-to-For conversion in Stablehlo-to-Scf pass

    Stablehlo only has one type of loop construct, `stablehlo.while`. The
    `while` loop can represent "for"-style loops as well, but if we only
    have `scf.while` loops after conversion to SCF, then we miss out on
    lot of potential optimizations which are rooted on `scf.for`.
    Experiments show that complicated JAX programs like the
    PhysicalIntelligence Pi0 model can benefit from converting `scf.while`
    to `scf.for` where possible. This improves opportunities for
    constant folding and makes analysis much easier to gauge the benefit
    of transforms like unrolling.

    This change adds some patterns to the Stablehlo-to-Scf pass to enable
    While-to-For conversion after the Stablehlo-to-Scf conversion. This
    transformation is combined with the Stablehlo-to-Scf conversion because
    the While-to-For patterns require first scalarizing block arguments of
    the While operation. The heuristics for which block arguments should
    be scalarized are implemented as control callbacks for the scalarization
    patterns. These callbacks need Stablehlo-specific logic, so it makes sense
    to test the combined conversion as a single pass. From the pass users'
    perspective, it gives the appearence of going directly from `stablehlo.while`
    to `scf.for`.

    The test cases are updated to cover the new patterns.

commit 425d19e749104354b5ea9e76e7509d029f9eac59
Author: Chris Bate <[email protected]>

    [compiler] Fix assign-memory-spaces pass to respect function-level constraints

    Fixes an issue where the `plan.memory_space` attribute on a function was
    not being respected when converting function signatures.

    MR: initialdl/mlir-tensorrt!2146

commit b612d5a22e7e3c4f08bf80fd504df5193b370bd3
Author: Chris Bate <[email protected]>

    [compiler] Update scf.while detensorization to increase flexibility

    In order to incorporate the upstream "uplift scf.while to scf.for"
    transformation as part of the `stablehlo-to-scf` conversion, we need to
    detensorize the operands of `scf.while` that are likely to correspond to
    the loop induction variable. This change refactors our existing 'scf.while'
    detensorization transformation to give more flexibility and control. The
    TensorKindAnalysis is no longer required in order to use the pattern(s).
    Detensorization of `after` and `before` arguments of `scf.while` are now
    controlled separately.

commit 3e21bf465b90e1eaaad872da40c305b70253cce0
Author: Chris Bate <[email protected]>

    [compiler] Improve handling of memory space constraints in the Plan dialect

    This commit improves the handling of memory space constraints in the
    Plan dialect. Constraints are now specified using a common attribute
    'plan.memory_space' that can be applied to functions or individual
    arguments/results. In addition, patterns in `plan-alloc-tensors`
    and `plan-assign-memory-spaces` are updated to avoid introducing
    unnecessary transfers between memory spaces.

commit 36a3b4a77242685e473817cb692a4010f690c0b3
Author: Chris Bate <[email protected]>

    [compiler] Add plan-buffer-results-to-out-params pass

    This change adds a new Plan dialect pass `plan-buffer-results-to-out-params`.
    This pass is based on the upstream Bufferization pass
    `buffer-results-to-out-params`, but it can handle a wider number of
    cases (such as promoting dynamic allocations) and uses alias analysis
    utilities to guard against failure cases that the upstream pass currently
    cannot handle. These improvements should eventually be upstreamed back
    to the Bufferization dialect.

commit 9e7127ca1e61be72b032a54d270a3da0d75639b2
Author: Chris Bate <[email protected]>

    [compiler] Update func conversion in host-to-emitc

    In the EmitC conversion/translation process, you can use `func.func`
    or `emitc.func` to define functions. Previously, we converted all
    `func.func` to `emitc.func`. However, `emitc.func` does not have a
    path for supporting multiple return values. Therefore, prefer use
    of type conversions on `func.func` instead of converting the entire
    op to `emitc.func`. Add tests to verify that we can support multiple
    return values.

commit 934db1f78ef3e7bedb67f1252b41ded7419010f8
Author: Chris Bate <[email protected]>

    [compiler] Fix two host-to-emitc bugs

    This change fixes two bugs exposed by new 'host-to-emitc' conversion
    testing:

    - The `!emitc.size_t` type does not have DataLayout information specified
      upstream. Therefore, to ensure that the type can be queried using
      DataLayout, we add a DataLayoutTypeInterface external model to the type.
      All queries are simply mapped to queries to the `index` type.

    - The upstream `func.call` conversion has a bug where it does not
      correctly convert the result types of the call operation, which
      can lead to a type mismatch for any type that does not have an
      identity conversion.

    Additional tests are added to `host-to-emitc`. Eventually the fixes for
    both these issues should be moved upstream.

commit 9d27f08ee4429f4ffbb72023babc193c7724a700
Author: Chris Bate <[email protected]>

    [common] Add Linalg-to-loops (on tensors) implementation and conversion pass

    Adds a ToLoopsOpInterface implementation and for Linalg operations. In addition,
    a conversion pass is added that converts ToLoopOpInterface operations to loops.

commit 3a419f120808eafc31f45516977ed6169b809ab9
Author: Chris Bate <[email protected]>

    NFC: Move ToLoopsOpInterface to 'mlir-tensorrt-common'

    Moves the ToLoopsOpInterface to the 'mlir-tensorrt-common' project. This
    is in preperation for enabling the ToLoopsOpInterface on LinalgOp (lowering
    while still using Tensor types) to replace the `convert-stablehlo-arith-to-scalar`
    pipeline.

    MR: initialdl/mlir-tensorrt!2137

commit 442bea12b763dd36fce864695f63896912438d87
Author: Christopher Bate <[email protected]>

    NFC: Fix formatting across several files

commit b2a65bc3e806aaa95d932af512cfa4750a9cbe4e
Author: Chris Bate <[email protected]>

    [executor] Introduce RuntimeSession "features" to control loading of runtime modules

    Previously, the RuntimeSession would always load all available runtime
    modules. This causes some inefficiences. For example, in certain integration
    tests for the Executor runtime, we don't use CUDA at all. However, because
    CUDA is still initialized by default, we would still require a GPU to
    be present just to run the integration test. Furthermore, some experimental
    modules (e.g. Lua cublas module) are not ready for "production" use and
    are only really invoked inside special integration tests.

    This change inroduces a notion of "features" to the RuntimeSession and
    RuntimeSessionOptions. A feature is just a string that identifies a
    particular runtime component. The particular semantic of a "feature" depends
    on the the actual runtime implementation. For example, for the
    LuaRuntimeSession, the feature names correspond to the available Lua
    "modules" (a module is just a group of C++ Lua extension functions),
    e.g. "core", "cuda", "tensorrt", etc.

    The RuntimeSessionOptions gains methods for enabling/disabling features.
    Certain features cause others to be added to the set automatically, e.g.
    "tensorrt" and "nccl" both require "cuda" to be added.

    The API is piped through all the way to the Python bindings to allow
    control of loaded modules at all levels. To preserve existing behavior,
    RuntimeSessions created from Python will load all available modules by
    default, but the `executor-runner|mlir-tensorrt-runner` tools now require
    features to be explicitly specified.

commit b90f8f345b2941e958f3a1cc5bcac21daebe783b
Author: Christopher Bate <[email protected]>

    NFC: Fix include guard for 'mlir-executor/Support/Status.h'

commit cdbe1f560483047291a30115a043a60bdce34d99
Author: Sagar Shelke <[email protected]>

    [compiler/lib] Add stablehlo composite to call pass to pre-processing pipeline

    This MR adds `StablehloLegalizeCompositeToCallPass` to the pre-processing
    pipeline.

    MLIR test is added.

commit 6ea3ab77aa2909cee11d08aa24543f247e8a24bf
Author: Chris Bate <[email protected]>

    [compiler] Add "default memory space" to ClusterKindAttrInterface

    Adds a new method to the ClusterKindAttrInterface so that backends can
    control the default tensor encoding (#plan.memory_space<..>) assigned by
    the `plan.assign-memory-spaces` pass at a function-scope level. In
    addition, we also allow an attribute to override the default space
    at function argument/results. This override mechnanism was previously
    lacking and will help resolve a long-standing issue where users cannot
    control the memory space of arguments/results reliably.

commit 0ea59238f5c280ab3ffbc340bb9aee7ed7bfbebb
Author: Christopher Bate <[email protected]>

    [compiler] Fix some issues related to pipeline extension mechanism

    The StablehloToExecutableTensorRTExtension had both 'disable' and
    an inherited 'disabled' member variable. Delete the inherited one
    such it should not have been introduced and was not bound to any
    option. Further, remove unused 'extensions' vector from
    CompilationTaskOptionsBase.

commit 372476d77fcaa399460965ab7bfc052f0e44c99f
Author: Christopher Bate <[email protected]>

    [executor] Fix ptrtoint and inttoptr op translation to Lua

    Previously, we could generate conflicting function types (due to pointer
    address space) when converting `executor.ptrtoint` and `executor.inttoptr`
    ops to opaque calls. Instead, defer the conversion to function call until
    the actual Lua translation point. At that point we can generate a
    function name without having to consider the pointer address space.

commit 75d18534fa67b452dd2253d6981bda6954bf1056
Author: Chris Bate <[email protected]>

    Introduce 'MLIRTensorRTCommmon' sub-project

    Certain targets need to be used across multiple sub-projects. For example,
    the 'TensorRTDynamicLoader' target is used in all sub-projects. In addition,
    the sub-projects need to be independently buildable. This change introduces
    another sub-project under the 'common' directory where shared code can be
    placed. This allows us to use `find_package` to declare the dependency, and
    downstream consumers to meet the requirement using any number of
    techniques to fullfill the 'find_package' call.

commit d7d8104087cf272bdd08f6330f27734754f0d71d
Author: Chris Bate <[email protected]>

    [compiler] Harden `stablehlo.constant` to `arith.constant` conversion

    There is a utility pass that runs in the stablehlo-to-executable pipeline
    that converts `stablehlo.constant` to `arith.constant`. This pass
    can temporarily create invalid IR due to `arith.constant` not supporting
    signful integer types. If the "verify-each" option is off,
    then the issue will not be caught since it happens to be self-correcting.
    However, the issue can still cause verification failures while debugging.
    This change fixes the issue by adding a `builtin.unrealized_conversion_cast`
    operation to bridge the type change between signless-and-signfull integer
    types.

commit a500de82a7bd70d6bfe32234719b4daa7cf32a8a
Author: Chris Bate <[email protected]>

    Integrate LLVM at f137c3d592e96330e450a8fd63ef7e8877fc1908

commit cd56aa6a511e2091fcd86106f20d27ff3673db75
Author: Christopher Bate <[email protected]>

    Fix build with BUILD_SHARED_LIBS=ON

    The new InferTensorValueRangeInterface was used without correctly
    specifying the library dependency the PlanIR and StablehloExtIR
    libraries.

commit cf1aff0ad0997947ab87485cfeec4595cb0285d7
Author: Sagar Shelke <[email protected]>

    [compiler] Maintain output order in TensorRT engine.

    For TensorRT engine conversion, first step in lowering a
    cluster containing TensorRT ops is created inline group op.
    Operands to the yield op (i.e. terminator) of inline group op
    are values from the cluster that are used outside the cluster.
    These values are collected by getting uses of each op (with
    `op->getUses()`) and checking if they are outside the cluster.
    However, this use order is not deterministic and sometimes
    it is desired to get yield results in a certian order.

    This MR makes the following changes,
    1. Add a function callback option named `ReorderRegionOpYieldValues`
    to `mlir::createRegionOpFromCluster` method. This callback function
    has signature `std::function<void(SetVector<Value> &yieldValues,
    SmallVectorImpl<Type> &yieldTypes)>` which takes cluster values used
    outside the cluster (in SetVector) and their types. By default this is
    set to nullptr.
    2. TensorRTToExecutable task is used in cases where a single `func.func`
    represents a single TensorRT engine. In this case, `ReorderRegionOpYieldValues`
    callback is implemented to make sure inline group op yield value order
    is same as func.func return values order.

    Valid MLIR test is added.

GitOrigin-RevId: 630a69d8e14506db43cfefe4be2c790f9352da4f
shelkesagar29 pushed a commit to NVIDIA/TensorRT-Incubator that referenced this pull request Jun 30, 2025
This PR moves the following internal changes to OSS,

commit 2986cac97888b5a9c1cd9064e4728cb38ca9dc45
Author: Sagar Shelke <[email protected]>

    [executor] Add complex type support to `ScalarValue`

    Previously, ScalarValue which represents scalar runtime value did not
    support complex type. This MR adds support for complex type by making storage
    union of real and complex data instaed of just real.

    MLIR tests are added via constant subgraph execution.

commit cf83a0d318b8035695d0b9fd24d578733632e253
Author: Christopher Bate <[email protected]>

    [compiler] Enable more `stablehlo.dot_general` to TensorRT using `tensorrt.einsum`

    Previously, we relied on canonicalization of `stablehlo.dot_general` to
    put all such contraction operations into a form that could be converted to
    `tensorrt.matrix_multiply`. Based on recent experiments, this can actually
    produce very inefficient TensorRT programs due to the number of reshapes
    and transpositions that must be inserted to coerce general
    `stablehlo.dot_general` into batched matrix multiplications. This change
    enables conversion of `stablehlo.dot_general` to `tensorrt.einsum`, and
    the pass and patterns now contain configurable parameters to control whether
    `tensorrt.einsum` is used as the primary method or only for fallback when
    conversion to `tensorrt.matrix_multiply` is not possible.

    A follow on change will revamp the Stablehlo preprocessing that we
    perform on 'stablehlo.dot_general' to avoid creating inefficient patterns
    and enable wider use of this pattern.

commit 528651ed1cd36c36376180c1c2232526ce972fef
Author: Christopher Bate <[email protected]>

    [compiler] Fix stablehlo-to-scf scalarization heuristics

    Fixes an issue where float tensors in the 'before' region of converted
    while loops where scalarized. The transform should only scalarize operands
    which are likely to be for-style induction variables.

commit 1d52e0a9e30dc104178c4761c1a24153abc7ea90
Author: Christopher Bate <[email protected]>

    [compiler] NFC: Drop dead code from StablehloToExecutableTask

commit f1c8d8c7cd860aedfe339d76ef7fb953baf9bd55
Author: Chris Bate <[email protected]>

    [compiler] Add `plan-promote-host-tensors-to-host-pinned` pass

    Adds a simple pass to promote "host" tensors to "host-pinned" tensors
    in common cases where we know a tensor will be transferred between host
    and device spaces. This pass runs after `plan-optimize-memory-spaces`
    since the former is sensitive to mismatching host spaces for patterns
    related to moving tranfers out of loops.

commit c27d56ea7a9661395e17fa895c610a79a92fa0c2
Author: Sagar Shelke <[email protected]>

    [executor] Handle elided dense resource elements attr during translation

    Translation to executable (which is flatbuffer) uses MLIR attr serialization to serialize
    `ElementsAttr`. However, this doesn't work when attr is elided dense resource and results
    in segfault. This MR handles this situation by replacing elided resource with `DenseElementsAttr`
    of all `one`s (`true` in case of boolean).

    IR with elided resource is usally seen only during testing of passes and not useful for
    e2e functional execution. Testing of `ExecuteConstantFoldableSubgraphs` pass is such case.
    Thus,  MLIR test cases for this pass are added.

commit 920a84e648833764563d3dc1de544a8f1b9f027e
Author: Chris Bate <[email protected]>

    [tensorrt] Fix TRT layer name generation function

    The TRT layer naming had some faulty logic that could cause the layer
    name to grow very large in the process to create a unique name. Fix
    the issue and use a static counter to reduce time spent in the loop.

commit ff0c5fa4bf5321ad0ce18579598c49f4b552fb37
Author: Christopher Bate <[email protected]>

    Further fixes to LIT configs

    Previously, we were setting `lit_config.parallelism_group` instead of
    `config.parallelism_group`. Apparently, the previous method does nothing,
    only `config.parallelism_group` has any effect.

commit d65c220b712c262992dbdf5a87fa3220a06bfb21
Author: Chris Bate <[email protected]>

    Update LIG test parallelism configs

    In more recent versions of TensorRT (10.11+ at least), the builder is taking a
    much larger amount of host memory. This can cause OOM when running the LIT test
    suites under their existing configurations.

    This change updates all LIT configs:

    - Make sure to use `%pick-one-gpu` in the LIT command line to ensure we
      stall if there are not enough GPU or host resources available. Add
      a hard limit that there must be at least 5GB of host memory available.

    - Update configurations to reduce the amount of estimated parallelism by
      increasing host memory requirements and reducing the amount of host
      memory to 50% for the purposes of the parallelism calculation.

    - Force all tests to use a common parallelism group unless otherwise
      specified in the test config.

commit 1f996f607640d81bf7137a4ed874b20c2a16cca2
Author: Christopher Bate <[email protected]>

    [compiler] Fix failure case in stablehlo-to-scf

    Fixes a failure case due to one of the recently introduced rewrites in
    `stablehlo-to-scf`.

commit 2779b632465fc3e840f5ce987f6233e824fe2ed3
Author: Christopher Bate <[email protected]>

    [compiler] Further improvements to plan bufferization pipeline

    - Split `plan-assign-memory-spaces` into three passes:
      - `plan-assign-memory-spaces`
      - `plan-optimize-memory-spaces`
      - `plan-materialize-explicit-transfers`
    - The last one is the only new code: `plan-materialize-explicit-transfers`
      converts `tensor.cast` ops that change the memory space encoding into
      explicit `bufferization.alloc_tensor` +
      `bufferization.materialize_in_destination` operations.
    - Improve handling of `bufferization.alloc_tensor` and optimization of
      `scf.for` iteration args in `plan-assign-memory-spaces`.
    - Improve handling of `tensor.reshape` in `plan-assign-memory-spaces`.
    - Fix handling of `tensor.reshape` when rewriting functions to be in
      DPS style in `plan-alloc-tensors`.

    This change also updates the LLVM dependencies in order to cherry-pick
    fix to the `tensor.reshape` bufferization interface that I merged
    upstream (llvm/llvm-project#128590).

    In addition, fix APInt assertions in `plan-execute-constant-foldable-subgraphs`.

commit 312170d8cbcd4c1fcf9cefdd628583e3dbdcc4f5
Author: Chris Bate <[email protected]>

    [compiler] Enable While-to-For conversion in Stablehlo-to-Scf pass

    Stablehlo only has one type of loop construct, `stablehlo.while`. The
    `while` loop can represent "for"-style loops as well, but if we only
    have `scf.while` loops after conversion to SCF, then we miss out on
    lot of potential optimizations which are rooted on `scf.for`.
    Experiments show that complicated JAX programs like the
    PhysicalIntelligence Pi0 model can benefit from converting `scf.while`
    to `scf.for` where possible. This improves opportunities for
    constant folding and makes analysis much easier to gauge the benefit
    of transforms like unrolling.

    This change adds some patterns to the Stablehlo-to-Scf pass to enable
    While-to-For conversion after the Stablehlo-to-Scf conversion. This
    transformation is combined with the Stablehlo-to-Scf conversion because
    the While-to-For patterns require first scalarizing block arguments of
    the While operation. The heuristics for which block arguments should
    be scalarized are implemented as control callbacks for the scalarization
    patterns. These callbacks need Stablehlo-specific logic, so it makes sense
    to test the combined conversion as a single pass. From the pass users'
    perspective, it gives the appearence of going directly from `stablehlo.while`
    to `scf.for`.

    The test cases are updated to cover the new patterns.

commit 425d19e749104354b5ea9e76e7509d029f9eac59
Author: Chris Bate <[email protected]>

    [compiler] Fix assign-memory-spaces pass to respect function-level constraints

    Fixes an issue where the `plan.memory_space` attribute on a function was
    not being respected when converting function signatures.

    MR: initialdl/mlir-tensorrt!2146

commit b612d5a22e7e3c4f08bf80fd504df5193b370bd3
Author: Chris Bate <[email protected]>

    [compiler] Update scf.while detensorization to increase flexibility

    In order to incorporate the upstream "uplift scf.while to scf.for"
    transformation as part of the `stablehlo-to-scf` conversion, we need to
    detensorize the operands of `scf.while` that are likely to correspond to
    the loop induction variable. This change refactors our existing 'scf.while'
    detensorization transformation to give more flexibility and control. The
    TensorKindAnalysis is no longer required in order to use the pattern(s).
    Detensorization of `after` and `before` arguments of `scf.while` are now
    controlled separately.

commit 3e21bf465b90e1eaaad872da40c305b70253cce0
Author: Chris Bate <[email protected]>

    [compiler] Improve handling of memory space constraints in the Plan dialect

    This commit improves the handling of memory space constraints in the
    Plan dialect. Constraints are now specified using a common attribute
    'plan.memory_space' that can be applied to functions or individual
    arguments/results. In addition, patterns in `plan-alloc-tensors`
    and `plan-assign-memory-spaces` are updated to avoid introducing
    unnecessary transfers between memory spaces.

commit 36a3b4a77242685e473817cb692a4010f690c0b3
Author: Chris Bate <[email protected]>

    [compiler] Add plan-buffer-results-to-out-params pass

    This change adds a new Plan dialect pass `plan-buffer-results-to-out-params`.
    This pass is based on the upstream Bufferization pass
    `buffer-results-to-out-params`, but it can handle a wider number of
    cases (such as promoting dynamic allocations) and uses alias analysis
    utilities to guard against failure cases that the upstream pass currently
    cannot handle. These improvements should eventually be upstreamed back
    to the Bufferization dialect.

commit 9e7127ca1e61be72b032a54d270a3da0d75639b2
Author: Chris Bate <[email protected]>

    [compiler] Update func conversion in host-to-emitc

    In the EmitC conversion/translation process, you can use `func.func`
    or `emitc.func` to define functions. Previously, we converted all
    `func.func` to `emitc.func`. However, `emitc.func` does not have a
    path for supporting multiple return values. Therefore, prefer use
    of type conversions on `func.func` instead of converting the entire
    op to `emitc.func`. Add tests to verify that we can support multiple
    return values.

commit 934db1f78ef3e7bedb67f1252b41ded7419010f8
Author: Chris Bate <[email protected]>

    [compiler] Fix two host-to-emitc bugs

    This change fixes two bugs exposed by new 'host-to-emitc' conversion
    testing:

    - The `!emitc.size_t` type does not have DataLayout information specified
      upstream. Therefore, to ensure that the type can be queried using
      DataLayout, we add a DataLayoutTypeInterface external model to the type.
      All queries are simply mapped to queries to the `index` type.

    - The upstream `func.call` conversion has a bug where it does not
      correctly convert the result types of the call operation, which
      can lead to a type mismatch for any type that does not have an
      identity conversion.

    Additional tests are added to `host-to-emitc`. Eventually the fixes for
    both these issues should be moved upstream.

commit 9d27f08ee4429f4ffbb72023babc193c7724a700
Author: Chris Bate <[email protected]>

    [common] Add Linalg-to-loops (on tensors) implementation and conversion pass

    Adds a ToLoopsOpInterface implementation and for Linalg operations. In addition,
    a conversion pass is added that converts ToLoopOpInterface operations to loops.

commit 3a419f120808eafc31f45516977ed6169b809ab9
Author: Chris Bate <[email protected]>

    NFC: Move ToLoopsOpInterface to 'mlir-tensorrt-common'

    Moves the ToLoopsOpInterface to the 'mlir-tensorrt-common' project. This
    is in preperation for enabling the ToLoopsOpInterface on LinalgOp (lowering
    while still using Tensor types) to replace the `convert-stablehlo-arith-to-scalar`
    pipeline.

    MR: initialdl/mlir-tensorrt!2137

commit 442bea12b763dd36fce864695f63896912438d87
Author: Christopher Bate <[email protected]>

    NFC: Fix formatting across several files

commit b2a65bc3e806aaa95d932af512cfa4750a9cbe4e
Author: Chris Bate <[email protected]>

    [executor] Introduce RuntimeSession "features" to control loading of runtime modules

    Previously, the RuntimeSession would always load all available runtime
    modules. This causes some inefficiences. For example, in certain integration
    tests for the Executor runtime, we don't use CUDA at all. However, because
    CUDA is still initialized by default, we would still require a GPU to
    be present just to run the integration test. Furthermore, some experimental
    modules (e.g. Lua cublas module) are not ready for "production" use and
    are only really invoked inside special integration tests.

    This change inroduces a notion of "features" to the RuntimeSession and
    RuntimeSessionOptions. A feature is just a string that identifies a
    particular runtime component. The particular semantic of a "feature" depends
    on the the actual runtime implementation. For example, for the
    LuaRuntimeSession, the feature names correspond to the available Lua
    "modules" (a module is just a group of C++ Lua extension functions),
    e.g. "core", "cuda", "tensorrt", etc.

    The RuntimeSessionOptions gains methods for enabling/disabling features.
    Certain features cause others to be added to the set automatically, e.g.
    "tensorrt" and "nccl" both require "cuda" to be added.

    The API is piped through all the way to the Python bindings to allow
    control of loaded modules at all levels. To preserve existing behavior,
    RuntimeSessions created from Python will load all available modules by
    default, but the `executor-runner|mlir-tensorrt-runner` tools now require
    features to be explicitly specified.

commit b90f8f345b2941e958f3a1cc5bcac21daebe783b
Author: Christopher Bate <[email protected]>

    NFC: Fix include guard for 'mlir-executor/Support/Status.h'

commit cdbe1f560483047291a30115a043a60bdce34d99
Author: Sagar Shelke <[email protected]>

    [compiler/lib] Add stablehlo composite to call pass to pre-processing pipeline

    This MR adds `StablehloLegalizeCompositeToCallPass` to the pre-processing
    pipeline.

    MLIR test is added.

commit 6ea3ab77aa2909cee11d08aa24543f247e8a24bf
Author: Chris Bate <[email protected]>

    [compiler] Add "default memory space" to ClusterKindAttrInterface

    Adds a new method to the ClusterKindAttrInterface so that backends can
    control the default tensor encoding (#plan.memory_space<..>) assigned by
    the `plan.assign-memory-spaces` pass at a function-scope level. In
    addition, we also allow an attribute to override the default space
    at function argument/results. This override mechnanism was previously
    lacking and will help resolve a long-standing issue where users cannot
    control the memory space of arguments/results reliably.

commit 0ea59238f5c280ab3ffbc340bb9aee7ed7bfbebb
Author: Christopher Bate <[email protected]>

    [compiler] Fix some issues related to pipeline extension mechanism

    The StablehloToExecutableTensorRTExtension had both 'disable' and
    an inherited 'disabled' member variable. Delete the inherited one
    such it should not have been introduced and was not bound to any
    option. Further, remove unused 'extensions' vector from
    CompilationTaskOptionsBase.

commit 372476d77fcaa399460965ab7bfc052f0e44c99f
Author: Christopher Bate <[email protected]>

    [executor] Fix ptrtoint and inttoptr op translation to Lua

    Previously, we could generate conflicting function types (due to pointer
    address space) when converting `executor.ptrtoint` and `executor.inttoptr`
    ops to opaque calls. Instead, defer the conversion to function call until
    the actual Lua translation point. At that point we can generate a
    function name without having to consider the pointer address space.

commit 75d18534fa67b452dd2253d6981bda6954bf1056
Author: Chris Bate <[email protected]>

    Introduce 'MLIRTensorRTCommmon' sub-project

    Certain targets need to be used across multiple sub-projects. For example,
    the 'TensorRTDynamicLoader' target is used in all sub-projects. In addition,
    the sub-projects need to be independently buildable. This change introduces
    another sub-project under the 'common' directory where shared code can be
    placed. This allows us to use `find_package` to declare the dependency, and
    downstream consumers to meet the requirement using any number of
    techniques to fullfill the 'find_package' call.

commit d7d8104087cf272bdd08f6330f27734754f0d71d
Author: Chris Bate <[email protected]>

    [compiler] Harden `stablehlo.constant` to `arith.constant` conversion

    There is a utility pass that runs in the stablehlo-to-executable pipeline
    that converts `stablehlo.constant` to `arith.constant`. This pass
    can temporarily create invalid IR due to `arith.constant` not supporting
    signful integer types. If the "verify-each" option is off,
    then the issue will not be caught since it happens to be self-correcting.
    However, the issue can still cause verification failures while debugging.
    This change fixes the issue by adding a `builtin.unrealized_conversion_cast`
    operation to bridge the type change between signless-and-signfull integer
    types.

commit a500de82a7bd70d6bfe32234719b4daa7cf32a8a
Author: Chris Bate <[email protected]>

    Integrate LLVM at f137c3d592e96330e450a8fd63ef7e8877fc1908

commit cd56aa6a511e2091fcd86106f20d27ff3673db75
Author: Christopher Bate <[email protected]>

    Fix build with BUILD_SHARED_LIBS=ON

    The new InferTensorValueRangeInterface was used without correctly
    specifying the library dependency the PlanIR and StablehloExtIR
    libraries.

commit cf1aff0ad0997947ab87485cfeec4595cb0285d7
Author: Sagar Shelke <[email protected]>

    [compiler] Maintain output order in TensorRT engine.

    For TensorRT engine conversion, first step in lowering a
    cluster containing TensorRT ops is created inline group op.
    Operands to the yield op (i.e. terminator) of inline group op
    are values from the cluster that are used outside the cluster.
    These values are collected by getting uses of each op (with
    `op->getUses()`) and checking if they are outside the cluster.
    However, this use order is not deterministic and sometimes
    it is desired to get yield results in a certian order.

    This MR makes the following changes,
    1. Add a function callback option named `ReorderRegionOpYieldValues`
    to `mlir::createRegionOpFromCluster` method. This callback function
    has signature `std::function<void(SetVector<Value> &yieldValues,
    SmallVectorImpl<Type> &yieldTypes)>` which takes cluster values used
    outside the cluster (in SetVector) and their types. By default this is
    set to nullptr.
    2. TensorRTToExecutable task is used in cases where a single `func.func`
    represents a single TensorRT engine. In this case, `ReorderRegionOpYieldValues`
    callback is implemented to make sure inline group op yield value order
    is same as func.func return values order.

    Valid MLIR test is added.

GitOrigin-RevId: 630a69d8e14506db43cfefe4be2c790f9352da4f
shelkesagar29 pushed a commit to NVIDIA/TensorRT-Incubator that referenced this pull request Jun 30, 2025
This PR moves the following internal changes to OSS,

commit 2986cac97888b5a9c1cd9064e4728cb38ca9dc45
Author: Sagar Shelke <[email protected]>

    [executor] Add complex type support to `ScalarValue`

    Previously, ScalarValue which represents scalar runtime value did not
    support complex type. This MR adds support for complex type by making storage
    union of real and complex data instaed of just real.

    MLIR tests are added via constant subgraph execution.

commit cf83a0d318b8035695d0b9fd24d578733632e253
Author: Christopher Bate <[email protected]>

    [compiler] Enable more `stablehlo.dot_general` to TensorRT using `tensorrt.einsum`

    Previously, we relied on canonicalization of `stablehlo.dot_general` to
    put all such contraction operations into a form that could be converted to
    `tensorrt.matrix_multiply`. Based on recent experiments, this can actually
    produce very inefficient TensorRT programs due to the number of reshapes
    and transpositions that must be inserted to coerce general
    `stablehlo.dot_general` into batched matrix multiplications. This change
    enables conversion of `stablehlo.dot_general` to `tensorrt.einsum`, and
    the pass and patterns now contain configurable parameters to control whether
    `tensorrt.einsum` is used as the primary method or only for fallback when
    conversion to `tensorrt.matrix_multiply` is not possible.

    A follow on change will revamp the Stablehlo preprocessing that we
    perform on 'stablehlo.dot_general' to avoid creating inefficient patterns
    and enable wider use of this pattern.

commit 528651ed1cd36c36376180c1c2232526ce972fef
Author: Christopher Bate <[email protected]>

    [compiler] Fix stablehlo-to-scf scalarization heuristics

    Fixes an issue where float tensors in the 'before' region of converted
    while loops where scalarized. The transform should only scalarize operands
    which are likely to be for-style induction variables.

commit 1d52e0a9e30dc104178c4761c1a24153abc7ea90
Author: Christopher Bate <[email protected]>

    [compiler] NFC: Drop dead code from StablehloToExecutableTask

commit f1c8d8c7cd860aedfe339d76ef7fb953baf9bd55
Author: Chris Bate <[email protected]>

    [compiler] Add `plan-promote-host-tensors-to-host-pinned` pass

    Adds a simple pass to promote "host" tensors to "host-pinned" tensors
    in common cases where we know a tensor will be transferred between host
    and device spaces. This pass runs after `plan-optimize-memory-spaces`
    since the former is sensitive to mismatching host spaces for patterns
    related to moving tranfers out of loops.

commit c27d56ea7a9661395e17fa895c610a79a92fa0c2
Author: Sagar Shelke <[email protected]>

    [executor] Handle elided dense resource elements attr during translation

    Translation to executable (which is flatbuffer) uses MLIR attr serialization to serialize
    `ElementsAttr`. However, this doesn't work when attr is elided dense resource and results
    in segfault. This MR handles this situation by replacing elided resource with `DenseElementsAttr`
    of all `one`s (`true` in case of boolean).

    IR with elided resource is usally seen only during testing of passes and not useful for
    e2e functional execution. Testing of `ExecuteConstantFoldableSubgraphs` pass is such case.
    Thus,  MLIR test cases for this pass are added.

commit 920a84e648833764563d3dc1de544a8f1b9f027e
Author: Chris Bate <[email protected]>

    [tensorrt] Fix TRT layer name generation function

    The TRT layer naming had some faulty logic that could cause the layer
    name to grow very large in the process to create a unique name. Fix
    the issue and use a static counter to reduce time spent in the loop.

commit ff0c5fa4bf5321ad0ce18579598c49f4b552fb37
Author: Christopher Bate <[email protected]>

    Further fixes to LIT configs

    Previously, we were setting `lit_config.parallelism_group` instead of
    `config.parallelism_group`. Apparently, the previous method does nothing,
    only `config.parallelism_group` has any effect.

commit d65c220b712c262992dbdf5a87fa3220a06bfb21
Author: Chris Bate <[email protected]>

    Update LIG test parallelism configs

    In more recent versions of TensorRT (10.11+ at least), the builder is taking a
    much larger amount of host memory. This can cause OOM when running the LIT test
    suites under their existing configurations.

    This change updates all LIT configs:

    - Make sure to use `%pick-one-gpu` in the LIT command line to ensure we
      stall if there are not enough GPU or host resources available. Add
      a hard limit that there must be at least 5GB of host memory available.

    - Update configurations to reduce the amount of estimated parallelism by
      increasing host memory requirements and reducing the amount of host
      memory to 50% for the purposes of the parallelism calculation.

    - Force all tests to use a common parallelism group unless otherwise
      specified in the test config.

commit 1f996f607640d81bf7137a4ed874b20c2a16cca2
Author: Christopher Bate <[email protected]>

    [compiler] Fix failure case in stablehlo-to-scf

    Fixes a failure case due to one of the recently introduced rewrites in
    `stablehlo-to-scf`.

commit 2779b632465fc3e840f5ce987f6233e824fe2ed3
Author: Christopher Bate <[email protected]>

    [compiler] Further improvements to plan bufferization pipeline

    - Split `plan-assign-memory-spaces` into three passes:
      - `plan-assign-memory-spaces`
      - `plan-optimize-memory-spaces`
      - `plan-materialize-explicit-transfers`
    - The last one is the only new code: `plan-materialize-explicit-transfers`
      converts `tensor.cast` ops that change the memory space encoding into
      explicit `bufferization.alloc_tensor` +
      `bufferization.materialize_in_destination` operations.
    - Improve handling of `bufferization.alloc_tensor` and optimization of
      `scf.for` iteration args in `plan-assign-memory-spaces`.
    - Improve handling of `tensor.reshape` in `plan-assign-memory-spaces`.
    - Fix handling of `tensor.reshape` when rewriting functions to be in
      DPS style in `plan-alloc-tensors`.

    This change also updates the LLVM dependencies in order to cherry-pick
    fix to the `tensor.reshape` bufferization interface that I merged
    upstream (llvm/llvm-project#128590).

    In addition, fix APInt assertions in `plan-execute-constant-foldable-subgraphs`.

commit 312170d8cbcd4c1fcf9cefdd628583e3dbdcc4f5
Author: Chris Bate <[email protected]>

    [compiler] Enable While-to-For conversion in Stablehlo-to-Scf pass

    Stablehlo only has one type of loop construct, `stablehlo.while`. The
    `while` loop can represent "for"-style loops as well, but if we only
    have `scf.while` loops after conversion to SCF, then we miss out on
    lot of potential optimizations which are rooted on `scf.for`.
    Experiments show that complicated JAX programs like the
    PhysicalIntelligence Pi0 model can benefit from converting `scf.while`
    to `scf.for` where possible. This improves opportunities for
    constant folding and makes analysis much easier to gauge the benefit
    of transforms like unrolling.

    This change adds some patterns to the Stablehlo-to-Scf pass to enable
    While-to-For conversion after the Stablehlo-to-Scf conversion. This
    transformation is combined with the Stablehlo-to-Scf conversion because
    the While-to-For patterns require first scalarizing block arguments of
    the While operation. The heuristics for which block arguments should
    be scalarized are implemented as control callbacks for the scalarization
    patterns. These callbacks need Stablehlo-specific logic, so it makes sense
    to test the combined conversion as a single pass. From the pass users'
    perspective, it gives the appearence of going directly from `stablehlo.while`
    to `scf.for`.

    The test cases are updated to cover the new patterns.

commit 425d19e749104354b5ea9e76e7509d029f9eac59
Author: Chris Bate <[email protected]>

    [compiler] Fix assign-memory-spaces pass to respect function-level constraints

    Fixes an issue where the `plan.memory_space` attribute on a function was
    not being respected when converting function signatures.

    MR: initialdl/mlir-tensorrt!2146

commit b612d5a22e7e3c4f08bf80fd504df5193b370bd3
Author: Chris Bate <[email protected]>

    [compiler] Update scf.while detensorization to increase flexibility

    In order to incorporate the upstream "uplift scf.while to scf.for"
    transformation as part of the `stablehlo-to-scf` conversion, we need to
    detensorize the operands of `scf.while` that are likely to correspond to
    the loop induction variable. This change refactors our existing 'scf.while'
    detensorization transformation to give more flexibility and control. The
    TensorKindAnalysis is no longer required in order to use the pattern(s).
    Detensorization of `after` and `before` arguments of `scf.while` are now
    controlled separately.

commit 3e21bf465b90e1eaaad872da40c305b70253cce0
Author: Chris Bate <[email protected]>

    [compiler] Improve handling of memory space constraints in the Plan dialect

    This commit improves the handling of memory space constraints in the
    Plan dialect. Constraints are now specified using a common attribute
    'plan.memory_space' that can be applied to functions or individual
    arguments/results. In addition, patterns in `plan-alloc-tensors`
    and `plan-assign-memory-spaces` are updated to avoid introducing
    unnecessary transfers between memory spaces.

commit 36a3b4a77242685e473817cb692a4010f690c0b3
Author: Chris Bate <[email protected]>

    [compiler] Add plan-buffer-results-to-out-params pass

    This change adds a new Plan dialect pass `plan-buffer-results-to-out-params`.
    This pass is based on the upstream Bufferization pass
    `buffer-results-to-out-params`, but it can handle a wider number of
    cases (such as promoting dynamic allocations) and uses alias analysis
    utilities to guard against failure cases that the upstream pass currently
    cannot handle. These improvements should eventually be upstreamed back
    to the Bufferization dialect.

commit 9e7127ca1e61be72b032a54d270a3da0d75639b2
Author: Chris Bate <[email protected]>

    [compiler] Update func conversion in host-to-emitc

    In the EmitC conversion/translation process, you can use `func.func`
    or `emitc.func` to define functions. Previously, we converted all
    `func.func` to `emitc.func`. However, `emitc.func` does not have a
    path for supporting multiple return values. Therefore, prefer use
    of type conversions on `func.func` instead of converting the entire
    op to `emitc.func`. Add tests to verify that we can support multiple
    return values.

commit 934db1f78ef3e7bedb67f1252b41ded7419010f8
Author: Chris Bate <[email protected]>

    [compiler] Fix two host-to-emitc bugs

    This change fixes two bugs exposed by new 'host-to-emitc' conversion
    testing:

    - The `!emitc.size_t` type does not have DataLayout information specified
      upstream. Therefore, to ensure that the type can be queried using
      DataLayout, we add a DataLayoutTypeInterface external model to the type.
      All queries are simply mapped to queries to the `index` type.

    - The upstream `func.call` conversion has a bug where it does not
      correctly convert the result types of the call operation, which
      can lead to a type mismatch for any type that does not have an
      identity conversion.

    Additional tests are added to `host-to-emitc`. Eventually the fixes for
    both these issues should be moved upstream.

commit 9d27f08ee4429f4ffbb72023babc193c7724a700
Author: Chris Bate <[email protected]>

    [common] Add Linalg-to-loops (on tensors) implementation and conversion pass

    Adds a ToLoopsOpInterface implementation and for Linalg operations. In addition,
    a conversion pass is added that converts ToLoopOpInterface operations to loops.

commit 3a419f120808eafc31f45516977ed6169b809ab9
Author: Chris Bate <[email protected]>

    NFC: Move ToLoopsOpInterface to 'mlir-tensorrt-common'

    Moves the ToLoopsOpInterface to the 'mlir-tensorrt-common' project. This
    is in preperation for enabling the ToLoopsOpInterface on LinalgOp (lowering
    while still using Tensor types) to replace the `convert-stablehlo-arith-to-scalar`
    pipeline.

    MR: initialdl/mlir-tensorrt!2137

commit 442bea12b763dd36fce864695f63896912438d87
Author: Christopher Bate <[email protected]>

    NFC: Fix formatting across several files

commit b2a65bc3e806aaa95d932af512cfa4750a9cbe4e
Author: Chris Bate <[email protected]>

    [executor] Introduce RuntimeSession "features" to control loading of runtime modules

    Previously, the RuntimeSession would always load all available runtime
    modules. This causes some inefficiences. For example, in certain integration
    tests for the Executor runtime, we don't use CUDA at all. However, because
    CUDA is still initialized by default, we would still require a GPU to
    be present just to run the integration test. Furthermore, some experimental
    modules (e.g. Lua cublas module) are not ready for "production" use and
    are only really invoked inside special integration tests.

    This change inroduces a notion of "features" to the RuntimeSession and
    RuntimeSessionOptions. A feature is just a string that identifies a
    particular runtime component. The particular semantic of a "feature" depends
    on the the actual runtime implementation. For example, for the
    LuaRuntimeSession, the feature names correspond to the available Lua
    "modules" (a module is just a group of C++ Lua extension functions),
    e.g. "core", "cuda", "tensorrt", etc.

    The RuntimeSessionOptions gains methods for enabling/disabling features.
    Certain features cause others to be added to the set automatically, e.g.
    "tensorrt" and "nccl" both require "cuda" to be added.

    The API is piped through all the way to the Python bindings to allow
    control of loaded modules at all levels. To preserve existing behavior,
    RuntimeSessions created from Python will load all available modules by
    default, but the `executor-runner|mlir-tensorrt-runner` tools now require
    features to be explicitly specified.

commit b90f8f345b2941e958f3a1cc5bcac21daebe783b
Author: Christopher Bate <[email protected]>

    NFC: Fix include guard for 'mlir-executor/Support/Status.h'

commit cdbe1f560483047291a30115a043a60bdce34d99
Author: Sagar Shelke <[email protected]>

    [compiler/lib] Add stablehlo composite to call pass to pre-processing pipeline

    This MR adds `StablehloLegalizeCompositeToCallPass` to the pre-processing
    pipeline.

    MLIR test is added.

commit 6ea3ab77aa2909cee11d08aa24543f247e8a24bf
Author: Chris Bate <[email protected]>

    [compiler] Add "default memory space" to ClusterKindAttrInterface

    Adds a new method to the ClusterKindAttrInterface so that backends can
    control the default tensor encoding (#plan.memory_space<..>) assigned by
    the `plan.assign-memory-spaces` pass at a function-scope level. In
    addition, we also allow an attribute to override the default space
    at function argument/results. This override mechnanism was previously
    lacking and will help resolve a long-standing issue where users cannot
    control the memory space of arguments/results reliably.

commit 0ea59238f5c280ab3ffbc340bb9aee7ed7bfbebb
Author: Christopher Bate <[email protected]>

    [compiler] Fix some issues related to pipeline extension mechanism

    The StablehloToExecutableTensorRTExtension had both 'disable' and
    an inherited 'disabled' member variable. Delete the inherited one
    such it should not have been introduced and was not bound to any
    option. Further, remove unused 'extensions' vector from
    CompilationTaskOptionsBase.

commit 372476d77fcaa399460965ab7bfc052f0e44c99f
Author: Christopher Bate <[email protected]>

    [executor] Fix ptrtoint and inttoptr op translation to Lua

    Previously, we could generate conflicting function types (due to pointer
    address space) when converting `executor.ptrtoint` and `executor.inttoptr`
    ops to opaque calls. Instead, defer the conversion to function call until
    the actual Lua translation point. At that point we can generate a
    function name without having to consider the pointer address space.

commit 75d18534fa67b452dd2253d6981bda6954bf1056
Author: Chris Bate <[email protected]>

    Introduce 'MLIRTensorRTCommmon' sub-project

    Certain targets need to be used across multiple sub-projects. For example,
    the 'TensorRTDynamicLoader' target is used in all sub-projects. In addition,
    the sub-projects need to be independently buildable. This change introduces
    another sub-project under the 'common' directory where shared code can be
    placed. This allows us to use `find_package` to declare the dependency, and
    downstream consumers to meet the requirement using any number of
    techniques to fullfill the 'find_package' call.

commit d7d8104087cf272bdd08f6330f27734754f0d71d
Author: Chris Bate <[email protected]>

    [compiler] Harden `stablehlo.constant` to `arith.constant` conversion

    There is a utility pass that runs in the stablehlo-to-executable pipeline
    that converts `stablehlo.constant` to `arith.constant`. This pass
    can temporarily create invalid IR due to `arith.constant` not supporting
    signful integer types. If the "verify-each" option is off,
    then the issue will not be caught since it happens to be self-correcting.
    However, the issue can still cause verification failures while debugging.
    This change fixes the issue by adding a `builtin.unrealized_conversion_cast`
    operation to bridge the type change between signless-and-signfull integer
    types.

commit a500de82a7bd70d6bfe32234719b4daa7cf32a8a
Author: Chris Bate <[email protected]>

    Integrate LLVM at f137c3d592e96330e450a8fd63ef7e8877fc1908

commit cd56aa6a511e2091fcd86106f20d27ff3673db75
Author: Christopher Bate <[email protected]>

    Fix build with BUILD_SHARED_LIBS=ON

    The new InferTensorValueRangeInterface was used without correctly
    specifying the library dependency the PlanIR and StablehloExtIR
    libraries.

commit cf1aff0ad0997947ab87485cfeec4595cb0285d7
Author: Sagar Shelke <[email protected]>

    [compiler] Maintain output order in TensorRT engine.

    For TensorRT engine conversion, first step in lowering a
    cluster containing TensorRT ops is created inline group op.
    Operands to the yield op (i.e. terminator) of inline group op
    are values from the cluster that are used outside the cluster.
    These values are collected by getting uses of each op (with
    `op->getUses()`) and checking if they are outside the cluster.
    However, this use order is not deterministic and sometimes
    it is desired to get yield results in a certian order.

    This MR makes the following changes,
    1. Add a function callback option named `ReorderRegionOpYieldValues`
    to `mlir::createRegionOpFromCluster` method. This callback function
    has signature `std::function<void(SetVector<Value> &yieldValues,
    SmallVectorImpl<Type> &yieldTypes)>` which takes cluster values used
    outside the cluster (in SetVector) and their types. By default this is
    set to nullptr.
    2. TensorRTToExecutable task is used in cases where a single `func.func`
    represents a single TensorRT engine. In this case, `ReorderRegionOpYieldValues`
    callback is implemented to make sure inline group op yield value order
    is same as func.func return values order.

    Valid MLIR test is added.

GitOrigin-RevId: 630a69d8e14506db43cfefe4be2c790f9352da4f
shelkesagar29 pushed a commit to NVIDIA/TensorRT-Incubator that referenced this pull request Jun 30, 2025
This PR moves the following internal changes to OSS,

commit 2986cac97888b5a9c1cd9064e4728cb38ca9dc45
Author: Sagar Shelke <[email protected]>

    [executor] Add complex type support to `ScalarValue`

    Previously, ScalarValue which represents scalar runtime value did not
    support complex type. This MR adds support for complex type by making storage
    union of real and complex data instaed of just real.

    MLIR tests are added via constant subgraph execution.

commit cf83a0d318b8035695d0b9fd24d578733632e253
Author: Christopher Bate <[email protected]>

    [compiler] Enable more `stablehlo.dot_general` to TensorRT using `tensorrt.einsum`

    Previously, we relied on canonicalization of `stablehlo.dot_general` to
    put all such contraction operations into a form that could be converted to
    `tensorrt.matrix_multiply`. Based on recent experiments, this can actually
    produce very inefficient TensorRT programs due to the number of reshapes
    and transpositions that must be inserted to coerce general
    `stablehlo.dot_general` into batched matrix multiplications. This change
    enables conversion of `stablehlo.dot_general` to `tensorrt.einsum`, and
    the pass and patterns now contain configurable parameters to control whether
    `tensorrt.einsum` is used as the primary method or only for fallback when
    conversion to `tensorrt.matrix_multiply` is not possible.

    A follow on change will revamp the Stablehlo preprocessing that we
    perform on 'stablehlo.dot_general' to avoid creating inefficient patterns
    and enable wider use of this pattern.

commit 528651ed1cd36c36376180c1c2232526ce972fef
Author: Christopher Bate <[email protected]>

    [compiler] Fix stablehlo-to-scf scalarization heuristics

    Fixes an issue where float tensors in the 'before' region of converted
    while loops where scalarized. The transform should only scalarize operands
    which are likely to be for-style induction variables.

commit 1d52e0a9e30dc104178c4761c1a24153abc7ea90
Author: Christopher Bate <[email protected]>

    [compiler] NFC: Drop dead code from StablehloToExecutableTask

commit f1c8d8c7cd860aedfe339d76ef7fb953baf9bd55
Author: Chris Bate <[email protected]>

    [compiler] Add `plan-promote-host-tensors-to-host-pinned` pass

    Adds a simple pass to promote "host" tensors to "host-pinned" tensors
    in common cases where we know a tensor will be transferred between host
    and device spaces. This pass runs after `plan-optimize-memory-spaces`
    since the former is sensitive to mismatching host spaces for patterns
    related to moving tranfers out of loops.

commit c27d56ea7a9661395e17fa895c610a79a92fa0c2
Author: Sagar Shelke <[email protected]>

    [executor] Handle elided dense resource elements attr during translation

    Translation to executable (which is flatbuffer) uses MLIR attr serialization to serialize
    `ElementsAttr`. However, this doesn't work when attr is elided dense resource and results
    in segfault. This MR handles this situation by replacing elided resource with `DenseElementsAttr`
    of all `one`s (`true` in case of boolean).

    IR with elided resource is usally seen only during testing of passes and not useful for
    e2e functional execution. Testing of `ExecuteConstantFoldableSubgraphs` pass is such case.
    Thus,  MLIR test cases for this pass are added.

commit 920a84e648833764563d3dc1de544a8f1b9f027e
Author: Chris Bate <[email protected]>

    [tensorrt] Fix TRT layer name generation function

    The TRT layer naming had some faulty logic that could cause the layer
    name to grow very large in the process to create a unique name. Fix
    the issue and use a static counter to reduce time spent in the loop.

commit ff0c5fa4bf5321ad0ce18579598c49f4b552fb37
Author: Christopher Bate <[email protected]>

    Further fixes to LIT configs

    Previously, we were setting `lit_config.parallelism_group` instead of
    `config.parallelism_group`. Apparently, the previous method does nothing,
    only `config.parallelism_group` has any effect.

commit d65c220b712c262992dbdf5a87fa3220a06bfb21
Author: Chris Bate <[email protected]>

    Update LIG test parallelism configs

    In more recent versions of TensorRT (10.11+ at least), the builder is taking a
    much larger amount of host memory. This can cause OOM when running the LIT test
    suites under their existing configurations.

    This change updates all LIT configs:

    - Make sure to use `%pick-one-gpu` in the LIT command line to ensure we
      stall if there are not enough GPU or host resources available. Add
      a hard limit that there must be at least 5GB of host memory available.

    - Update configurations to reduce the amount of estimated parallelism by
      increasing host memory requirements and reducing the amount of host
      memory to 50% for the purposes of the parallelism calculation.

    - Force all tests to use a common parallelism group unless otherwise
      specified in the test config.

commit 1f996f607640d81bf7137a4ed874b20c2a16cca2
Author: Christopher Bate <[email protected]>

    [compiler] Fix failure case in stablehlo-to-scf

    Fixes a failure case due to one of the recently introduced rewrites in
    `stablehlo-to-scf`.

commit 2779b632465fc3e840f5ce987f6233e824fe2ed3
Author: Christopher Bate <[email protected]>

    [compiler] Further improvements to plan bufferization pipeline

    - Split `plan-assign-memory-spaces` into three passes:
      - `plan-assign-memory-spaces`
      - `plan-optimize-memory-spaces`
      - `plan-materialize-explicit-transfers`
    - The last one is the only new code: `plan-materialize-explicit-transfers`
      converts `tensor.cast` ops that change the memory space encoding into
      explicit `bufferization.alloc_tensor` +
      `bufferization.materialize_in_destination` operations.
    - Improve handling of `bufferization.alloc_tensor` and optimization of
      `scf.for` iteration args in `plan-assign-memory-spaces`.
    - Improve handling of `tensor.reshape` in `plan-assign-memory-spaces`.
    - Fix handling of `tensor.reshape` when rewriting functions to be in
      DPS style in `plan-alloc-tensors`.

    This change also updates the LLVM dependencies in order to cherry-pick
    fix to the `tensor.reshape` bufferization interface that I merged
    upstream (llvm/llvm-project#128590).

    In addition, fix APInt assertions in `plan-execute-constant-foldable-subgraphs`.

commit 312170d8cbcd4c1fcf9cefdd628583e3dbdcc4f5
Author: Chris Bate <[email protected]>

    [compiler] Enable While-to-For conversion in Stablehlo-to-Scf pass

    Stablehlo only has one type of loop construct, `stablehlo.while`. The
    `while` loop can represent "for"-style loops as well, but if we only
    have `scf.while` loops after conversion to SCF, then we miss out on
    lot of potential optimizations which are rooted on `scf.for`.
    Experiments show that complicated JAX programs like the
    PhysicalIntelligence Pi0 model can benefit from converting `scf.while`
    to `scf.for` where possible. This improves opportunities for
    constant folding and makes analysis much easier to gauge the benefit
    of transforms like unrolling.

    This change adds some patterns to the Stablehlo-to-Scf pass to enable
    While-to-For conversion after the Stablehlo-to-Scf conversion. This
    transformation is combined with the Stablehlo-to-Scf conversion because
    the While-to-For patterns require first scalarizing block arguments of
    the While operation. The heuristics for which block arguments should
    be scalarized are implemented as control callbacks for the scalarization
    patterns. These callbacks need Stablehlo-specific logic, so it makes sense
    to test the combined conversion as a single pass. From the pass users'
    perspective, it gives the appearence of going directly from `stablehlo.while`
    to `scf.for`.

    The test cases are updated to cover the new patterns.

commit 425d19e749104354b5ea9e76e7509d029f9eac59
Author: Chris Bate <[email protected]>

    [compiler] Fix assign-memory-spaces pass to respect function-level constraints

    Fixes an issue where the `plan.memory_space` attribute on a function was
    not being respected when converting function signatures.

    MR: initialdl/mlir-tensorrt!2146

commit b612d5a22e7e3c4f08bf80fd504df5193b370bd3
Author: Chris Bate <[email protected]>

    [compiler] Update scf.while detensorization to increase flexibility

    In order to incorporate the upstream "uplift scf.while to scf.for"
    transformation as part of the `stablehlo-to-scf` conversion, we need to
    detensorize the operands of `scf.while` that are likely to correspond to
    the loop induction variable. This change refactors our existing 'scf.while'
    detensorization transformation to give more flexibility and control. The
    TensorKindAnalysis is no longer required in order to use the pattern(s).
    Detensorization of `after` and `before` arguments of `scf.while` are now
    controlled separately.

commit 3e21bf465b90e1eaaad872da40c305b70253cce0
Author: Chris Bate <[email protected]>

    [compiler] Improve handling of memory space constraints in the Plan dialect

    This commit improves the handling of memory space constraints in the
    Plan dialect. Constraints are now specified using a common attribute
    'plan.memory_space' that can be applied to functions or individual
    arguments/results. In addition, patterns in `plan-alloc-tensors`
    and `plan-assign-memory-spaces` are updated to avoid introducing
    unnecessary transfers between memory spaces.

commit 36a3b4a77242685e473817cb692a4010f690c0b3
Author: Chris Bate <[email protected]>

    [compiler] Add plan-buffer-results-to-out-params pass

    This change adds a new Plan dialect pass `plan-buffer-results-to-out-params`.
    This pass is based on the upstream Bufferization pass
    `buffer-results-to-out-params`, but it can handle a wider number of
    cases (such as promoting dynamic allocations) and uses alias analysis
    utilities to guard against failure cases that the upstream pass currently
    cannot handle. These improvements should eventually be upstreamed back
    to the Bufferization dialect.

commit 9e7127ca1e61be72b032a54d270a3da0d75639b2
Author: Chris Bate <[email protected]>

    [compiler] Update func conversion in host-to-emitc

    In the EmitC conversion/translation process, you can use `func.func`
    or `emitc.func` to define functions. Previously, we converted all
    `func.func` to `emitc.func`. However, `emitc.func` does not have a
    path for supporting multiple return values. Therefore, prefer use
    of type conversions on `func.func` instead of converting the entire
    op to `emitc.func`. Add tests to verify that we can support multiple
    return values.

commit 934db1f78ef3e7bedb67f1252b41ded7419010f8
Author: Chris Bate <[email protected]>

    [compiler] Fix two host-to-emitc bugs

    This change fixes two bugs exposed by new 'host-to-emitc' conversion
    testing:

    - The `!emitc.size_t` type does not have DataLayout information specified
      upstream. Therefore, to ensure that the type can be queried using
      DataLayout, we add a DataLayoutTypeInterface external model to the type.
      All queries are simply mapped to queries to the `index` type.

    - The upstream `func.call` conversion has a bug where it does not
      correctly convert the result types of the call operation, which
      can lead to a type mismatch for any type that does not have an
      identity conversion.

    Additional tests are added to `host-to-emitc`. Eventually the fixes for
    both these issues should be moved upstream.

commit 9d27f08ee4429f4ffbb72023babc193c7724a700
Author: Chris Bate <[email protected]>

    [common] Add Linalg-to-loops (on tensors) implementation and conversion pass

    Adds a ToLoopsOpInterface implementation and for Linalg operations. In addition,
    a conversion pass is added that converts ToLoopOpInterface operations to loops.

commit 3a419f120808eafc31f45516977ed6169b809ab9
Author: Chris Bate <[email protected]>

    NFC: Move ToLoopsOpInterface to 'mlir-tensorrt-common'

    Moves the ToLoopsOpInterface to the 'mlir-tensorrt-common' project. This
    is in preperation for enabling the ToLoopsOpInterface on LinalgOp (lowering
    while still using Tensor types) to replace the `convert-stablehlo-arith-to-scalar`
    pipeline.

    MR: initialdl/mlir-tensorrt!2137

commit 442bea12b763dd36fce864695f63896912438d87
Author: Christopher Bate <[email protected]>

    NFC: Fix formatting across several files

commit b2a65bc3e806aaa95d932af512cfa4750a9cbe4e
Author: Chris Bate <[email protected]>

    [executor] Introduce RuntimeSession "features" to control loading of runtime modules

    Previously, the RuntimeSession would always load all available runtime
    modules. This causes some inefficiences. For example, in certain integration
    tests for the Executor runtime, we don't use CUDA at all. However, because
    CUDA is still initialized by default, we would still require a GPU to
    be present just to run the integration test. Furthermore, some experimental
    modules (e.g. Lua cublas module) are not ready for "production" use and
    are only really invoked inside special integration tests.

    This change inroduces a notion of "features" to the RuntimeSession and
    RuntimeSessionOptions. A feature is just a string that identifies a
    particular runtime component. The particular semantic of a "feature" depends
    on the the actual runtime implementation. For example, for the
    LuaRuntimeSession, the feature names correspond to the available Lua
    "modules" (a module is just a group of C++ Lua extension functions),
    e.g. "core", "cuda", "tensorrt", etc.

    The RuntimeSessionOptions gains methods for enabling/disabling features.
    Certain features cause others to be added to the set automatically, e.g.
    "tensorrt" and "nccl" both require "cuda" to be added.

    The API is piped through all the way to the Python bindings to allow
    control of loaded modules at all levels. To preserve existing behavior,
    RuntimeSessions created from Python will load all available modules by
    default, but the `executor-runner|mlir-tensorrt-runner` tools now require
    features to be explicitly specified.

commit b90f8f345b2941e958f3a1cc5bcac21daebe783b
Author: Christopher Bate <[email protected]>

    NFC: Fix include guard for 'mlir-executor/Support/Status.h'

commit cdbe1f560483047291a30115a043a60bdce34d99
Author: Sagar Shelke <[email protected]>

    [compiler/lib] Add stablehlo composite to call pass to pre-processing pipeline

    This MR adds `StablehloLegalizeCompositeToCallPass` to the pre-processing
    pipeline.

    MLIR test is added.

commit 6ea3ab77aa2909cee11d08aa24543f247e8a24bf
Author: Chris Bate <[email protected]>

    [compiler] Add "default memory space" to ClusterKindAttrInterface

    Adds a new method to the ClusterKindAttrInterface so that backends can
    control the default tensor encoding (#plan.memory_space<..>) assigned by
    the `plan.assign-memory-spaces` pass at a function-scope level. In
    addition, we also allow an attribute to override the default space
    at function argument/results. This override mechnanism was previously
    lacking and will help resolve a long-standing issue where users cannot
    control the memory space of arguments/results reliably.

commit 0ea59238f5c280ab3ffbc340bb9aee7ed7bfbebb
Author: Christopher Bate <[email protected]>

    [compiler] Fix some issues related to pipeline extension mechanism

    The StablehloToExecutableTensorRTExtension had both 'disable' and
    an inherited 'disabled' member variable. Delete the inherited one
    such it should not have been introduced and was not bound to any
    option. Further, remove unused 'extensions' vector from
    CompilationTaskOptionsBase.

commit 372476d77fcaa399460965ab7bfc052f0e44c99f
Author: Christopher Bate <[email protected]>

    [executor] Fix ptrtoint and inttoptr op translation to Lua

    Previously, we could generate conflicting function types (due to pointer
    address space) when converting `executor.ptrtoint` and `executor.inttoptr`
    ops to opaque calls. Instead, defer the conversion to function call until
    the actual Lua translation point. At that point we can generate a
    function name without having to consider the pointer address space.

commit 75d18534fa67b452dd2253d6981bda6954bf1056
Author: Chris Bate <[email protected]>

    Introduce 'MLIRTensorRTCommmon' sub-project

    Certain targets need to be used across multiple sub-projects. For example,
    the 'TensorRTDynamicLoader' target is used in all sub-projects. In addition,
    the sub-projects need to be independently buildable. This change introduces
    another sub-project under the 'common' directory where shared code can be
    placed. This allows us to use `find_package` to declare the dependency, and
    downstream consumers to meet the requirement using any number of
    techniques to fullfill the 'find_package' call.

commit d7d8104087cf272bdd08f6330f27734754f0d71d
Author: Chris Bate <[email protected]>

    [compiler] Harden `stablehlo.constant` to `arith.constant` conversion

    There is a utility pass that runs in the stablehlo-to-executable pipeline
    that converts `stablehlo.constant` to `arith.constant`. This pass
    can temporarily create invalid IR due to `arith.constant` not supporting
    signful integer types. If the "verify-each" option is off,
    then the issue will not be caught since it happens to be self-correcting.
    However, the issue can still cause verification failures while debugging.
    This change fixes the issue by adding a `builtin.unrealized_conversion_cast`
    operation to bridge the type change between signless-and-signfull integer
    types.

commit a500de82a7bd70d6bfe32234719b4daa7cf32a8a
Author: Chris Bate <[email protected]>

    Integrate LLVM at f137c3d592e96330e450a8fd63ef7e8877fc1908

commit cd56aa6a511e2091fcd86106f20d27ff3673db75
Author: Christopher Bate <[email protected]>

    Fix build with BUILD_SHARED_LIBS=ON

    The new InferTensorValueRangeInterface was used without correctly
    specifying the library dependency the PlanIR and StablehloExtIR
    libraries.

commit cf1aff0ad0997947ab87485cfeec4595cb0285d7
Author: Sagar Shelke <[email protected]>

    [compiler] Maintain output order in TensorRT engine.

    For TensorRT engine conversion, first step in lowering a
    cluster containing TensorRT ops is created inline group op.
    Operands to the yield op (i.e. terminator) of inline group op
    are values from the cluster that are used outside the cluster.
    These values are collected by getting uses of each op (with
    `op->getUses()`) and checking if they are outside the cluster.
    However, this use order is not deterministic and sometimes
    it is desired to get yield results in a certian order.

    This MR makes the following changes,
    1. Add a function callback option named `ReorderRegionOpYieldValues`
    to `mlir::createRegionOpFromCluster` method. This callback function
    has signature `std::function<void(SetVector<Value> &yieldValues,
    SmallVectorImpl<Type> &yieldTypes)>` which takes cluster values used
    outside the cluster (in SetVector) and their types. By default this is
    set to nullptr.
    2. TensorRTToExecutable task is used in cases where a single `func.func`
    represents a single TensorRT engine. In this case, `ReorderRegionOpYieldValues`
    callback is implemented to make sure inline group op yield value order
    is same as func.func return values order.

    Valid MLIR test is added.

GitOrigin-RevId: 630a69d8e14506db43cfefe4be2c790f9352da4f
shelkesagar29 pushed a commit to NVIDIA/TensorRT-Incubator that referenced this pull request Jun 30, 2025
This PR moves the following internal changes to OSS,

commit 2986cac97888b5a9c1cd9064e4728cb38ca9dc45
Author: Sagar Shelke <[email protected]>

    [executor] Add complex type support to `ScalarValue`

    Previously, ScalarValue which represents scalar runtime value did not
    support complex type. This MR adds support for complex type by making storage
    union of real and complex data instaed of just real.

    MLIR tests are added via constant subgraph execution.

commit cf83a0d318b8035695d0b9fd24d578733632e253
Author: Christopher Bate <[email protected]>

    [compiler] Enable more `stablehlo.dot_general` to TensorRT using `tensorrt.einsum`

    Previously, we relied on canonicalization of `stablehlo.dot_general` to
    put all such contraction operations into a form that could be converted to
    `tensorrt.matrix_multiply`. Based on recent experiments, this can actually
    produce very inefficient TensorRT programs due to the number of reshapes
    and transpositions that must be inserted to coerce general
    `stablehlo.dot_general` into batched matrix multiplications. This change
    enables conversion of `stablehlo.dot_general` to `tensorrt.einsum`, and
    the pass and patterns now contain configurable parameters to control whether
    `tensorrt.einsum` is used as the primary method or only for fallback when
    conversion to `tensorrt.matrix_multiply` is not possible.

    A follow on change will revamp the Stablehlo preprocessing that we
    perform on 'stablehlo.dot_general' to avoid creating inefficient patterns
    and enable wider use of this pattern.

commit 528651ed1cd36c36376180c1c2232526ce972fef
Author: Christopher Bate <[email protected]>

    [compiler] Fix stablehlo-to-scf scalarization heuristics

    Fixes an issue where float tensors in the 'before' region of converted
    while loops where scalarized. The transform should only scalarize operands
    which are likely to be for-style induction variables.

commit 1d52e0a9e30dc104178c4761c1a24153abc7ea90
Author: Christopher Bate <[email protected]>

    [compiler] NFC: Drop dead code from StablehloToExecutableTask

commit f1c8d8c7cd860aedfe339d76ef7fb953baf9bd55
Author: Chris Bate <[email protected]>

    [compiler] Add `plan-promote-host-tensors-to-host-pinned` pass

    Adds a simple pass to promote "host" tensors to "host-pinned" tensors
    in common cases where we know a tensor will be transferred between host
    and device spaces. This pass runs after `plan-optimize-memory-spaces`
    since the former is sensitive to mismatching host spaces for patterns
    related to moving tranfers out of loops.

commit c27d56ea7a9661395e17fa895c610a79a92fa0c2
Author: Sagar Shelke <[email protected]>

    [executor] Handle elided dense resource elements attr during translation

    Translation to executable (which is flatbuffer) uses MLIR attr serialization to serialize
    `ElementsAttr`. However, this doesn't work when attr is elided dense resource and results
    in segfault. This MR handles this situation by replacing elided resource with `DenseElementsAttr`
    of all `one`s (`true` in case of boolean).

    IR with elided resource is usally seen only during testing of passes and not useful for
    e2e functional execution. Testing of `ExecuteConstantFoldableSubgraphs` pass is such case.
    Thus,  MLIR test cases for this pass are added.

commit 920a84e648833764563d3dc1de544a8f1b9f027e
Author: Chris Bate <[email protected]>

    [tensorrt] Fix TRT layer name generation function

    The TRT layer naming had some faulty logic that could cause the layer
    name to grow very large in the process to create a unique name. Fix
    the issue and use a static counter to reduce time spent in the loop.

commit ff0c5fa4bf5321ad0ce18579598c49f4b552fb37
Author: Christopher Bate <[email protected]>

    Further fixes to LIT configs

    Previously, we were setting `lit_config.parallelism_group` instead of
    `config.parallelism_group`. Apparently, the previous method does nothing,
    only `config.parallelism_group` has any effect.

commit d65c220b712c262992dbdf5a87fa3220a06bfb21
Author: Chris Bate <[email protected]>

    Update LIG test parallelism configs

    In more recent versions of TensorRT (10.11+ at least), the builder is taking a
    much larger amount of host memory. This can cause OOM when running the LIT test
    suites under their existing configurations.

    This change updates all LIT configs:

    - Make sure to use `%pick-one-gpu` in the LIT command line to ensure we
      stall if there are not enough GPU or host resources available. Add
      a hard limit that there must be at least 5GB of host memory available.

    - Update configurations to reduce the amount of estimated parallelism by
      increasing host memory requirements and reducing the amount of host
      memory to 50% for the purposes of the parallelism calculation.

    - Force all tests to use a common parallelism group unless otherwise
      specified in the test config.

commit 1f996f607640d81bf7137a4ed874b20c2a16cca2
Author: Christopher Bate <[email protected]>

    [compiler] Fix failure case in stablehlo-to-scf

    Fixes a failure case due to one of the recently introduced rewrites in
    `stablehlo-to-scf`.

commit 2779b632465fc3e840f5ce987f6233e824fe2ed3
Author: Christopher Bate <[email protected]>

    [compiler] Further improvements to plan bufferization pipeline

    - Split `plan-assign-memory-spaces` into three passes:
      - `plan-assign-memory-spaces`
      - `plan-optimize-memory-spaces`
      - `plan-materialize-explicit-transfers`
    - The last one is the only new code: `plan-materialize-explicit-transfers`
      converts `tensor.cast` ops that change the memory space encoding into
      explicit `bufferization.alloc_tensor` +
      `bufferization.materialize_in_destination` operations.
    - Improve handling of `bufferization.alloc_tensor` and optimization of
      `scf.for` iteration args in `plan-assign-memory-spaces`.
    - Improve handling of `tensor.reshape` in `plan-assign-memory-spaces`.
    - Fix handling of `tensor.reshape` when rewriting functions to be in
      DPS style in `plan-alloc-tensors`.

    This change also updates the LLVM dependencies in order to cherry-pick
    fix to the `tensor.reshape` bufferization interface that I merged
    upstream (llvm/llvm-project#128590).

    In addition, fix APInt assertions in `plan-execute-constant-foldable-subgraphs`.

commit 312170d8cbcd4c1fcf9cefdd628583e3dbdcc4f5
Author: Chris Bate <[email protected]>

    [compiler] Enable While-to-For conversion in Stablehlo-to-Scf pass

    Stablehlo only has one type of loop construct, `stablehlo.while`. The
    `while` loop can represent "for"-style loops as well, but if we only
    have `scf.while` loops after conversion to SCF, then we miss out on
    lot of potential optimizations which are rooted on `scf.for`.
    Experiments show that complicated JAX programs like the
    PhysicalIntelligence Pi0 model can benefit from converting `scf.while`
    to `scf.for` where possible. This improves opportunities for
    constant folding and makes analysis much easier to gauge the benefit
    of transforms like unrolling.

    This change adds some patterns to the Stablehlo-to-Scf pass to enable
    While-to-For conversion after the Stablehlo-to-Scf conversion. This
    transformation is combined with the Stablehlo-to-Scf conversion because
    the While-to-For patterns require first scalarizing block arguments of
    the While operation. The heuristics for which block arguments should
    be scalarized are implemented as control callbacks for the scalarization
    patterns. These callbacks need Stablehlo-specific logic, so it makes sense
    to test the combined conversion as a single pass. From the pass users'
    perspective, it gives the appearence of going directly from `stablehlo.while`
    to `scf.for`.

    The test cases are updated to cover the new patterns.

commit 425d19e749104354b5ea9e76e7509d029f9eac59
Author: Chris Bate <[email protected]>

    [compiler] Fix assign-memory-spaces pass to respect function-level constraints

    Fixes an issue where the `plan.memory_space` attribute on a function was
    not being respected when converting function signatures.

    MR: initialdl/mlir-tensorrt!2146

commit b612d5a22e7e3c4f08bf80fd504df5193b370bd3
Author: Chris Bate <[email protected]>

    [compiler] Update scf.while detensorization to increase flexibility

    In order to incorporate the upstream "uplift scf.while to scf.for"
    transformation as part of the `stablehlo-to-scf` conversion, we need to
    detensorize the operands of `scf.while` that are likely to correspond to
    the loop induction variable. This change refactors our existing 'scf.while'
    detensorization transformation to give more flexibility and control. The
    TensorKindAnalysis is no longer required in order to use the pattern(s).
    Detensorization of `after` and `before` arguments of `scf.while` are now
    controlled separately.

commit 3e21bf465b90e1eaaad872da40c305b70253cce0
Author: Chris Bate <[email protected]>

    [compiler] Improve handling of memory space constraints in the Plan dialect

    This commit improves the handling of memory space constraints in the
    Plan dialect. Constraints are now specified using a common attribute
    'plan.memory_space' that can be applied to functions or individual
    arguments/results. In addition, patterns in `plan-alloc-tensors`
    and `plan-assign-memory-spaces` are updated to avoid introducing
    unnecessary transfers between memory spaces.

commit 36a3b4a77242685e473817cb692a4010f690c0b3
Author: Chris Bate <[email protected]>

    [compiler] Add plan-buffer-results-to-out-params pass

    This change adds a new Plan dialect pass `plan-buffer-results-to-out-params`.
    This pass is based on the upstream Bufferization pass
    `buffer-results-to-out-params`, but it can handle a wider number of
    cases (such as promoting dynamic allocations) and uses alias analysis
    utilities to guard against failure cases that the upstream pass currently
    cannot handle. These improvements should eventually be upstreamed back
    to the Bufferization dialect.

commit 9e7127ca1e61be72b032a54d270a3da0d75639b2
Author: Chris Bate <[email protected]>

    [compiler] Update func conversion in host-to-emitc

    In the EmitC conversion/translation process, you can use `func.func`
    or `emitc.func` to define functions. Previously, we converted all
    `func.func` to `emitc.func`. However, `emitc.func` does not have a
    path for supporting multiple return values. Therefore, prefer use
    of type conversions on `func.func` instead of converting the entire
    op to `emitc.func`. Add tests to verify that we can support multiple
    return values.

commit 934db1f78ef3e7bedb67f1252b41ded7419010f8
Author: Chris Bate <[email protected]>

    [compiler] Fix two host-to-emitc bugs

    This change fixes two bugs exposed by new 'host-to-emitc' conversion
    testing:

    - The `!emitc.size_t` type does not have DataLayout information specified
      upstream. Therefore, to ensure that the type can be queried using
      DataLayout, we add a DataLayoutTypeInterface external model to the type.
      All queries are simply mapped to queries to the `index` type.

    - The upstream `func.call` conversion has a bug where it does not
      correctly convert the result types of the call operation, which
      can lead to a type mismatch for any type that does not have an
      identity conversion.

    Additional tests are added to `host-to-emitc`. Eventually the fixes for
    both these issues should be moved upstream.

commit 9d27f08ee4429f4ffbb72023babc193c7724a700
Author: Chris Bate <[email protected]>

    [common] Add Linalg-to-loops (on tensors) implementation and conversion pass

    Adds a ToLoopsOpInterface implementation and for Linalg operations. In addition,
    a conversion pass is added that converts ToLoopOpInterface operations to loops.

commit 3a419f120808eafc31f45516977ed6169b809ab9
Author: Chris Bate <[email protected]>

    NFC: Move ToLoopsOpInterface to 'mlir-tensorrt-common'

    Moves the ToLoopsOpInterface to the 'mlir-tensorrt-common' project. This
    is in preperation for enabling the ToLoopsOpInterface on LinalgOp (lowering
    while still using Tensor types) to replace the `convert-stablehlo-arith-to-scalar`
    pipeline.

    MR: initialdl/mlir-tensorrt!2137

commit 442bea12b763dd36fce864695f63896912438d87
Author: Christopher Bate <[email protected]>

    NFC: Fix formatting across several files

commit b2a65bc3e806aaa95d932af512cfa4750a9cbe4e
Author: Chris Bate <[email protected]>

    [executor] Introduce RuntimeSession "features" to control loading of runtime modules

    Previously, the RuntimeSession would always load all available runtime
    modules. This causes some inefficiences. For example, in certain integration
    tests for the Executor runtime, we don't use CUDA at all. However, because
    CUDA is still initialized by default, we would still require a GPU to
    be present just to run the integration test. Furthermore, some experimental
    modules (e.g. Lua cublas module) are not ready for "production" use and
    are only really invoked inside special integration tests.

    This change inroduces a notion of "features" to the RuntimeSession and
    RuntimeSessionOptions. A feature is just a string that identifies a
    particular runtime component. The particular semantic of a "feature" depends
    on the the actual runtime implementation. For example, for the
    LuaRuntimeSession, the feature names correspond to the available Lua
    "modules" (a module is just a group of C++ Lua extension functions),
    e.g. "core", "cuda", "tensorrt", etc.

    The RuntimeSessionOptions gains methods for enabling/disabling features.
    Certain features cause others to be added to the set automatically, e.g.
    "tensorrt" and "nccl" both require "cuda" to be added.

    The API is piped through all the way to the Python bindings to allow
    control of loaded modules at all levels. To preserve existing behavior,
    RuntimeSessions created from Python will load all available modules by
    default, but the `executor-runner|mlir-tensorrt-runner` tools now require
    features to be explicitly specified.

commit b90f8f345b2941e958f3a1cc5bcac21daebe783b
Author: Christopher Bate <[email protected]>

    NFC: Fix include guard for 'mlir-executor/Support/Status.h'

commit cdbe1f560483047291a30115a043a60bdce34d99
Author: Sagar Shelke <[email protected]>

    [compiler/lib] Add stablehlo composite to call pass to pre-processing pipeline

    This MR adds `StablehloLegalizeCompositeToCallPass` to the pre-processing
    pipeline.

    MLIR test is added.

commit 6ea3ab77aa2909cee11d08aa24543f247e8a24bf
Author: Chris Bate <[email protected]>

    [compiler] Add "default memory space" to ClusterKindAttrInterface

    Adds a new method to the ClusterKindAttrInterface so that backends can
    control the default tensor encoding (#plan.memory_space<..>) assigned by
    the `plan.assign-memory-spaces` pass at a function-scope level. In
    addition, we also allow an attribute to override the default space
    at function argument/results. This override mechnanism was previously
    lacking and will help resolve a long-standing issue where users cannot
    control the memory space of arguments/results reliably.

commit 0ea59238f5c280ab3ffbc340bb9aee7ed7bfbebb
Author: Christopher Bate <[email protected]>

    [compiler] Fix some issues related to pipeline extension mechanism

    The StablehloToExecutableTensorRTExtension had both 'disable' and
    an inherited 'disabled' member variable. Delete the inherited one
    such it should not have been introduced and was not bound to any
    option. Further, remove unused 'extensions' vector from
    CompilationTaskOptionsBase.

commit 372476d77fcaa399460965ab7bfc052f0e44c99f
Author: Christopher Bate <[email protected]>

    [executor] Fix ptrtoint and inttoptr op translation to Lua

    Previously, we could generate conflicting function types (due to pointer
    address space) when converting `executor.ptrtoint` and `executor.inttoptr`
    ops to opaque calls. Instead, defer the conversion to function call until
    the actual Lua translation point. At that point we can generate a
    function name without having to consider the pointer address space.

commit 75d18534fa67b452dd2253d6981bda6954bf1056
Author: Chris Bate <[email protected]>

    Introduce 'MLIRTensorRTCommmon' sub-project

    Certain targets need to be used across multiple sub-projects. For example,
    the 'TensorRTDynamicLoader' target is used in all sub-projects. In addition,
    the sub-projects need to be independently buildable. This change introduces
    another sub-project under the 'common' directory where shared code can be
    placed. This allows us to use `find_package` to declare the dependency, and
    downstream consumers to meet the requirement using any number of
    techniques to fullfill the 'find_package' call.

commit d7d8104087cf272bdd08f6330f27734754f0d71d
Author: Chris Bate <[email protected]>

    [compiler] Harden `stablehlo.constant` to `arith.constant` conversion

    There is a utility pass that runs in the stablehlo-to-executable pipeline
    that converts `stablehlo.constant` to `arith.constant`. This pass
    can temporarily create invalid IR due to `arith.constant` not supporting
    signful integer types. If the "verify-each" option is off,
    then the issue will not be caught since it happens to be self-correcting.
    However, the issue can still cause verification failures while debugging.
    This change fixes the issue by adding a `builtin.unrealized_conversion_cast`
    operation to bridge the type change between signless-and-signfull integer
    types.

commit a500de82a7bd70d6bfe32234719b4daa7cf32a8a
Author: Chris Bate <[email protected]>

    Integrate LLVM at f137c3d592e96330e450a8fd63ef7e8877fc1908

commit cd56aa6a511e2091fcd86106f20d27ff3673db75
Author: Christopher Bate <[email protected]>

    Fix build with BUILD_SHARED_LIBS=ON

    The new InferTensorValueRangeInterface was used without correctly
    specifying the library dependency the PlanIR and StablehloExtIR
    libraries.

commit cf1aff0ad0997947ab87485cfeec4595cb0285d7
Author: Sagar Shelke <[email protected]>

    [compiler] Maintain output order in TensorRT engine.

    For TensorRT engine conversion, first step in lowering a
    cluster containing TensorRT ops is created inline group op.
    Operands to the yield op (i.e. terminator) of inline group op
    are values from the cluster that are used outside the cluster.
    These values are collected by getting uses of each op (with
    `op->getUses()`) and checking if they are outside the cluster.
    However, this use order is not deterministic and sometimes
    it is desired to get yield results in a certian order.

    This MR makes the following changes,
    1. Add a function callback option named `ReorderRegionOpYieldValues`
    to `mlir::createRegionOpFromCluster` method. This callback function
    has signature `std::function<void(SetVector<Value> &yieldValues,
    SmallVectorImpl<Type> &yieldTypes)>` which takes cluster values used
    outside the cluster (in SetVector) and their types. By default this is
    set to nullptr.
    2. TensorRTToExecutable task is used in cases where a single `func.func`
    represents a single TensorRT engine. In this case, `ReorderRegionOpYieldValues`
    callback is implemented to make sure inline group op yield value order
    is same as func.func return values order.

    Valid MLIR test is added.

GitOrigin-RevId: 630a69d8e14506db43cfefe4be2c790f9352da4f
shelkesagar29 pushed a commit to NVIDIA/TensorRT-Incubator that referenced this pull request Jun 30, 2025
This PR moves the following internal changes to OSS,

commit 2986cac97888b5a9c1cd9064e4728cb38ca9dc45
Author: Sagar Shelke <[email protected]>

    [executor] Add complex type support to `ScalarValue`

    Previously, ScalarValue which represents scalar runtime value did not
    support complex type. This MR adds support for complex type by making storage
    union of real and complex data instaed of just real.

    MLIR tests are added via constant subgraph execution.

commit cf83a0d318b8035695d0b9fd24d578733632e253
Author: Christopher Bate <[email protected]>

    [compiler] Enable more `stablehlo.dot_general` to TensorRT using `tensorrt.einsum`

    Previously, we relied on canonicalization of `stablehlo.dot_general` to
    put all such contraction operations into a form that could be converted to
    `tensorrt.matrix_multiply`. Based on recent experiments, this can actually
    produce very inefficient TensorRT programs due to the number of reshapes
    and transpositions that must be inserted to coerce general
    `stablehlo.dot_general` into batched matrix multiplications. This change
    enables conversion of `stablehlo.dot_general` to `tensorrt.einsum`, and
    the pass and patterns now contain configurable parameters to control whether
    `tensorrt.einsum` is used as the primary method or only for fallback when
    conversion to `tensorrt.matrix_multiply` is not possible.

    A follow on change will revamp the Stablehlo preprocessing that we
    perform on 'stablehlo.dot_general' to avoid creating inefficient patterns
    and enable wider use of this pattern.

commit 528651ed1cd36c36376180c1c2232526ce972fef
Author: Christopher Bate <[email protected]>

    [compiler] Fix stablehlo-to-scf scalarization heuristics

    Fixes an issue where float tensors in the 'before' region of converted
    while loops where scalarized. The transform should only scalarize operands
    which are likely to be for-style induction variables.

commit 1d52e0a9e30dc104178c4761c1a24153abc7ea90
Author: Christopher Bate <[email protected]>

    [compiler] NFC: Drop dead code from StablehloToExecutableTask

commit f1c8d8c7cd860aedfe339d76ef7fb953baf9bd55
Author: Chris Bate <[email protected]>

    [compiler] Add `plan-promote-host-tensors-to-host-pinned` pass

    Adds a simple pass to promote "host" tensors to "host-pinned" tensors
    in common cases where we know a tensor will be transferred between host
    and device spaces. This pass runs after `plan-optimize-memory-spaces`
    since the former is sensitive to mismatching host spaces for patterns
    related to moving tranfers out of loops.

commit c27d56ea7a9661395e17fa895c610a79a92fa0c2
Author: Sagar Shelke <[email protected]>

    [executor] Handle elided dense resource elements attr during translation

    Translation to executable (which is flatbuffer) uses MLIR attr serialization to serialize
    `ElementsAttr`. However, this doesn't work when attr is elided dense resource and results
    in segfault. This MR handles this situation by replacing elided resource with `DenseElementsAttr`
    of all `one`s (`true` in case of boolean).

    IR with elided resource is usally seen only during testing of passes and not useful for
    e2e functional execution. Testing of `ExecuteConstantFoldableSubgraphs` pass is such case.
    Thus,  MLIR test cases for this pass are added.

commit 920a84e648833764563d3dc1de544a8f1b9f027e
Author: Chris Bate <[email protected]>

    [tensorrt] Fix TRT layer name generation function

    The TRT layer naming had some faulty logic that could cause the layer
    name to grow very large in the process to create a unique name. Fix
    the issue and use a static counter to reduce time spent in the loop.

commit ff0c5fa4bf5321ad0ce18579598c49f4b552fb37
Author: Christopher Bate <[email protected]>

    Further fixes to LIT configs

    Previously, we were setting `lit_config.parallelism_group` instead of
    `config.parallelism_group`. Apparently, the previous method does nothing,
    only `config.parallelism_group` has any effect.

commit d65c220b712c262992dbdf5a87fa3220a06bfb21
Author: Chris Bate <[email protected]>

    Update LIG test parallelism configs

    In more recent versions of TensorRT (10.11+ at least), the builder is taking a
    much larger amount of host memory. This can cause OOM when running the LIT test
    suites under their existing configurations.

    This change updates all LIT configs:

    - Make sure to use `%pick-one-gpu` in the LIT command line to ensure we
      stall if there are not enough GPU or host resources available. Add
      a hard limit that there must be at least 5GB of host memory available.

    - Update configurations to reduce the amount of estimated parallelism by
      increasing host memory requirements and reducing the amount of host
      memory to 50% for the purposes of the parallelism calculation.

    - Force all tests to use a common parallelism group unless otherwise
      specified in the test config.

commit 1f996f607640d81bf7137a4ed874b20c2a16cca2
Author: Christopher Bate <[email protected]>

    [compiler] Fix failure case in stablehlo-to-scf

    Fixes a failure case due to one of the recently introduced rewrites in
    `stablehlo-to-scf`.

commit 2779b632465fc3e840f5ce987f6233e824fe2ed3
Author: Christopher Bate <[email protected]>

    [compiler] Further improvements to plan bufferization pipeline

    - Split `plan-assign-memory-spaces` into three passes:
      - `plan-assign-memory-spaces`
      - `plan-optimize-memory-spaces`
      - `plan-materialize-explicit-transfers`
    - The last one is the only new code: `plan-materialize-explicit-transfers`
      converts `tensor.cast` ops that change the memory space encoding into
      explicit `bufferization.alloc_tensor` +
      `bufferization.materialize_in_destination` operations.
    - Improve handling of `bufferization.alloc_tensor` and optimization of
      `scf.for` iteration args in `plan-assign-memory-spaces`.
    - Improve handling of `tensor.reshape` in `plan-assign-memory-spaces`.
    - Fix handling of `tensor.reshape` when rewriting functions to be in
      DPS style in `plan-alloc-tensors`.

    This change also updates the LLVM dependencies in order to cherry-pick
    fix to the `tensor.reshape` bufferization interface that I merged
    upstream (llvm/llvm-project#128590).

    In addition, fix APInt assertions in `plan-execute-constant-foldable-subgraphs`.

commit 312170d8cbcd4c1fcf9cefdd628583e3dbdcc4f5
Author: Chris Bate <[email protected]>

    [compiler] Enable While-to-For conversion in Stablehlo-to-Scf pass

    Stablehlo only has one type of loop construct, `stablehlo.while`. The
    `while` loop can represent "for"-style loops as well, but if we only
    have `scf.while` loops after conversion to SCF, then we miss out on
    lot of potential optimizations which are rooted on `scf.for`.
    Experiments show that complicated JAX programs like the
    PhysicalIntelligence Pi0 model can benefit from converting `scf.while`
    to `scf.for` where possible. This improves opportunities for
    constant folding and makes analysis much easier to gauge the benefit
    of transforms like unrolling.

    This change adds some patterns to the Stablehlo-to-Scf pass to enable
    While-to-For conversion after the Stablehlo-to-Scf conversion. This
    transformation is combined with the Stablehlo-to-Scf conversion because
    the While-to-For patterns require first scalarizing block arguments of
    the While operation. The heuristics for which block arguments should
    be scalarized are implemented as control callbacks for the scalarization
    patterns. These callbacks need Stablehlo-specific logic, so it makes sense
    to test the combined conversion as a single pass. From the pass users'
    perspective, it gives the appearence of going directly from `stablehlo.while`
    to `scf.for`.

    The test cases are updated to cover the new patterns.

commit 425d19e749104354b5ea9e76e7509d029f9eac59
Author: Chris Bate <[email protected]>

    [compiler] Fix assign-memory-spaces pass to respect function-level constraints

    Fixes an issue where the `plan.memory_space` attribute on a function was
    not being respected when converting function signatures.

    MR: initialdl/mlir-tensorrt!2146

commit b612d5a22e7e3c4f08bf80fd504df5193b370bd3
Author: Chris Bate <[email protected]>

    [compiler] Update scf.while detensorization to increase flexibility

    In order to incorporate the upstream "uplift scf.while to scf.for"
    transformation as part of the `stablehlo-to-scf` conversion, we need to
    detensorize the operands of `scf.while` that are likely to correspond to
    the loop induction variable. This change refactors our existing 'scf.while'
    detensorization transformation to give more flexibility and control. The
    TensorKindAnalysis is no longer required in order to use the pattern(s).
    Detensorization of `after` and `before` arguments of `scf.while` are now
    controlled separately.

commit 3e21bf465b90e1eaaad872da40c305b70253cce0
Author: Chris Bate <[email protected]>

    [compiler] Improve handling of memory space constraints in the Plan dialect

    This commit improves the handling of memory space constraints in the
    Plan dialect. Constraints are now specified using a common attribute
    'plan.memory_space' that can be applied to functions or individual
    arguments/results. In addition, patterns in `plan-alloc-tensors`
    and `plan-assign-memory-spaces` are updated to avoid introducing
    unnecessary transfers between memory spaces.

commit 36a3b4a77242685e473817cb692a4010f690c0b3
Author: Chris Bate <[email protected]>

    [compiler] Add plan-buffer-results-to-out-params pass

    This change adds a new Plan dialect pass `plan-buffer-results-to-out-params`.
    This pass is based on the upstream Bufferization pass
    `buffer-results-to-out-params`, but it can handle a wider number of
    cases (such as promoting dynamic allocations) and uses alias analysis
    utilities to guard against failure cases that the upstream pass currently
    cannot handle. These improvements should eventually be upstreamed back
    to the Bufferization dialect.

commit 9e7127ca1e61be72b032a54d270a3da0d75639b2
Author: Chris Bate <[email protected]>

    [compiler] Update func conversion in host-to-emitc

    In the EmitC conversion/translation process, you can use `func.func`
    or `emitc.func` to define functions. Previously, we converted all
    `func.func` to `emitc.func`. However, `emitc.func` does not have a
    path for supporting multiple return values. Therefore, prefer use
    of type conversions on `func.func` instead of converting the entire
    op to `emitc.func`. Add tests to verify that we can support multiple
    return values.

commit 934db1f78ef3e7bedb67f1252b41ded7419010f8
Author: Chris Bate <[email protected]>

    [compiler] Fix two host-to-emitc bugs

    This change fixes two bugs exposed by new 'host-to-emitc' conversion
    testing:

    - The `!emitc.size_t` type does not have DataLayout information specified
      upstream. Therefore, to ensure that the type can be queried using
      DataLayout, we add a DataLayoutTypeInterface external model to the type.
      All queries are simply mapped to queries to the `index` type.

    - The upstream `func.call` conversion has a bug where it does not
      correctly convert the result types of the call operation, which
      can lead to a type mismatch for any type that does not have an
      identity conversion.

    Additional tests are added to `host-to-emitc`. Eventually the fixes for
    both these issues should be moved upstream.

commit 9d27f08ee4429f4ffbb72023babc193c7724a700
Author: Chris Bate <[email protected]>

    [common] Add Linalg-to-loops (on tensors) implementation and conversion pass

    Adds a ToLoopsOpInterface implementation and for Linalg operations. In addition,
    a conversion pass is added that converts ToLoopOpInterface operations to loops.

commit 3a419f120808eafc31f45516977ed6169b809ab9
Author: Chris Bate <[email protected]>

    NFC: Move ToLoopsOpInterface to 'mlir-tensorrt-common'

    Moves the ToLoopsOpInterface to the 'mlir-tensorrt-common' project. This
    is in preperation for enabling the ToLoopsOpInterface on LinalgOp (lowering
    while still using Tensor types) to replace the `convert-stablehlo-arith-to-scalar`
    pipeline.

    MR: initialdl/mlir-tensorrt!2137

commit 442bea12b763dd36fce864695f63896912438d87
Author: Christopher Bate <[email protected]>

    NFC: Fix formatting across several files

commit b2a65bc3e806aaa95d932af512cfa4750a9cbe4e
Author: Chris Bate <[email protected]>

    [executor] Introduce RuntimeSession "features" to control loading of runtime modules

    Previously, the RuntimeSession would always load all available runtime
    modules. This causes some inefficiences. For example, in certain integration
    tests for the Executor runtime, we don't use CUDA at all. However, because
    CUDA is still initialized by default, we would still require a GPU to
    be present just to run the integration test. Furthermore, some experimental
    modules (e.g. Lua cublas module) are not ready for "production" use and
    are only really invoked inside special integration tests.

    This change inroduces a notion of "features" to the RuntimeSession and
    RuntimeSessionOptions. A feature is just a string that identifies a
    particular runtime component. The particular semantic of a "feature" depends
    on the the actual runtime implementation. For example, for the
    LuaRuntimeSession, the feature names correspond to the available Lua
    "modules" (a module is just a group of C++ Lua extension functions),
    e.g. "core", "cuda", "tensorrt", etc.

    The RuntimeSessionOptions gains methods for enabling/disabling features.
    Certain features cause others to be added to the set automatically, e.g.
    "tensorrt" and "nccl" both require "cuda" to be added.

    The API is piped through all the way to the Python bindings to allow
    control of loaded modules at all levels. To preserve existing behavior,
    RuntimeSessions created from Python will load all available modules by
    default, but the `executor-runner|mlir-tensorrt-runner` tools now require
    features to be explicitly specified.

commit b90f8f345b2941e958f3a1cc5bcac21daebe783b
Author: Christopher Bate <[email protected]>

    NFC: Fix include guard for 'mlir-executor/Support/Status.h'

commit cdbe1f560483047291a30115a043a60bdce34d99
Author: Sagar Shelke <[email protected]>

    [compiler/lib] Add stablehlo composite to call pass to pre-processing pipeline

    This MR adds `StablehloLegalizeCompositeToCallPass` to the pre-processing
    pipeline.

    MLIR test is added.

commit 6ea3ab77aa2909cee11d08aa24543f247e8a24bf
Author: Chris Bate <[email protected]>

    [compiler] Add "default memory space" to ClusterKindAttrInterface

    Adds a new method to the ClusterKindAttrInterface so that backends can
    control the default tensor encoding (#plan.memory_space<..>) assigned by
    the `plan.assign-memory-spaces` pass at a function-scope level. In
    addition, we also allow an attribute to override the default space
    at function argument/results. This override mechnanism was previously
    lacking and will help resolve a long-standing issue where users cannot
    control the memory space of arguments/results reliably.

commit 0ea59238f5c280ab3ffbc340bb9aee7ed7bfbebb
Author: Christopher Bate <[email protected]>

    [compiler] Fix some issues related to pipeline extension mechanism

    The StablehloToExecutableTensorRTExtension had both 'disable' and
    an inherited 'disabled' member variable. Delete the inherited one
    such it should not have been introduced and was not bound to any
    option. Further, remove unused 'extensions' vector from
    CompilationTaskOptionsBase.

commit 372476d77fcaa399460965ab7bfc052f0e44c99f
Author: Christopher Bate <[email protected]>

    [executor] Fix ptrtoint and inttoptr op translation to Lua

    Previously, we could generate conflicting function types (due to pointer
    address space) when converting `executor.ptrtoint` and `executor.inttoptr`
    ops to opaque calls. Instead, defer the conversion to function call until
    the actual Lua translation point. At that point we can generate a
    function name without having to consider the pointer address space.

commit 75d18534fa67b452dd2253d6981bda6954bf1056
Author: Chris Bate <[email protected]>

    Introduce 'MLIRTensorRTCommmon' sub-project

    Certain targets need to be used across multiple sub-projects. For example,
    the 'TensorRTDynamicLoader' target is used in all sub-projects. In addition,
    the sub-projects need to be independently buildable. This change introduces
    another sub-project under the 'common' directory where shared code can be
    placed. This allows us to use `find_package` to declare the dependency, and
    downstream consumers to meet the requirement using any number of
    techniques to fullfill the 'find_package' call.

commit d7d8104087cf272bdd08f6330f27734754f0d71d
Author: Chris Bate <[email protected]>

    [compiler] Harden `stablehlo.constant` to `arith.constant` conversion

    There is a utility pass that runs in the stablehlo-to-executable pipeline
    that converts `stablehlo.constant` to `arith.constant`. This pass
    can temporarily create invalid IR due to `arith.constant` not supporting
    signful integer types. If the "verify-each" option is off,
    then the issue will not be caught since it happens to be self-correcting.
    However, the issue can still cause verification failures while debugging.
    This change fixes the issue by adding a `builtin.unrealized_conversion_cast`
    operation to bridge the type change between signless-and-signfull integer
    types.

commit a500de82a7bd70d6bfe32234719b4daa7cf32a8a
Author: Chris Bate <[email protected]>

    Integrate LLVM at f137c3d592e96330e450a8fd63ef7e8877fc1908

commit cd56aa6a511e2091fcd86106f20d27ff3673db75
Author: Christopher Bate <[email protected]>

    Fix build with BUILD_SHARED_LIBS=ON

    The new InferTensorValueRangeInterface was used without correctly
    specifying the library dependency the PlanIR and StablehloExtIR
    libraries.

commit cf1aff0ad0997947ab87485cfeec4595cb0285d7
Author: Sagar Shelke <[email protected]>

    [compiler] Maintain output order in TensorRT engine.

    For TensorRT engine conversion, first step in lowering a
    cluster containing TensorRT ops is created inline group op.
    Operands to the yield op (i.e. terminator) of inline group op
    are values from the cluster that are used outside the cluster.
    These values are collected by getting uses of each op (with
    `op->getUses()`) and checking if they are outside the cluster.
    However, this use order is not deterministic and sometimes
    it is desired to get yield results in a certian order.

    This MR makes the following changes,
    1. Add a function callback option named `ReorderRegionOpYieldValues`
    to `mlir::createRegionOpFromCluster` method. This callback function
    has signature `std::function<void(SetVector<Value> &yieldValues,
    SmallVectorImpl<Type> &yieldTypes)>` which takes cluster values used
    outside the cluster (in SetVector) and their types. By default this is
    set to nullptr.
    2. TensorRTToExecutable task is used in cases where a single `func.func`
    represents a single TensorRT engine. In this case, `ReorderRegionOpYieldValues`
    callback is implemented to make sure inline group op yield value order
    is same as func.func return values order.

    Valid MLIR test is added.

GitOrigin-RevId: 630a69d8e14506db43cfefe4be2c790f9352da4f
shelkesagar29 pushed a commit to NVIDIA/TensorRT-Incubator that referenced this pull request Jun 30, 2025
This PR moves the following internal changes to OSS,

commit 2986cac97888b5a9c1cd9064e4728cb38ca9dc45
Author: Sagar Shelke <[email protected]>

    [executor] Add complex type support to `ScalarValue`

    Previously, ScalarValue which represents scalar runtime value did not
    support complex type. This MR adds support for complex type by making storage
    union of real and complex data instaed of just real.

    MLIR tests are added via constant subgraph execution.

commit cf83a0d318b8035695d0b9fd24d578733632e253
Author: Christopher Bate <[email protected]>

    [compiler] Enable more `stablehlo.dot_general` to TensorRT using `tensorrt.einsum`

    Previously, we relied on canonicalization of `stablehlo.dot_general` to
    put all such contraction operations into a form that could be converted to
    `tensorrt.matrix_multiply`. Based on recent experiments, this can actually
    produce very inefficient TensorRT programs due to the number of reshapes
    and transpositions that must be inserted to coerce general
    `stablehlo.dot_general` into batched matrix multiplications. This change
    enables conversion of `stablehlo.dot_general` to `tensorrt.einsum`, and
    the pass and patterns now contain configurable parameters to control whether
    `tensorrt.einsum` is used as the primary method or only for fallback when
    conversion to `tensorrt.matrix_multiply` is not possible.

    A follow on change will revamp the Stablehlo preprocessing that we
    perform on 'stablehlo.dot_general' to avoid creating inefficient patterns
    and enable wider use of this pattern.

commit 528651ed1cd36c36376180c1c2232526ce972fef
Author: Christopher Bate <[email protected]>

    [compiler] Fix stablehlo-to-scf scalarization heuristics

    Fixes an issue where float tensors in the 'before' region of converted
    while loops where scalarized. The transform should only scalarize operands
    which are likely to be for-style induction variables.

commit 1d52e0a9e30dc104178c4761c1a24153abc7ea90
Author: Christopher Bate <[email protected]>

    [compiler] NFC: Drop dead code from StablehloToExecutableTask

commit f1c8d8c7cd860aedfe339d76ef7fb953baf9bd55
Author: Chris Bate <[email protected]>

    [compiler] Add `plan-promote-host-tensors-to-host-pinned` pass

    Adds a simple pass to promote "host" tensors to "host-pinned" tensors
    in common cases where we know a tensor will be transferred between host
    and device spaces. This pass runs after `plan-optimize-memory-spaces`
    since the former is sensitive to mismatching host spaces for patterns
    related to moving tranfers out of loops.

commit c27d56ea7a9661395e17fa895c610a79a92fa0c2
Author: Sagar Shelke <[email protected]>

    [executor] Handle elided dense resource elements attr during translation

    Translation to executable (which is flatbuffer) uses MLIR attr serialization to serialize
    `ElementsAttr`. However, this doesn't work when attr is elided dense resource and results
    in segfault. This MR handles this situation by replacing elided resource with `DenseElementsAttr`
    of all `one`s (`true` in case of boolean).

    IR with elided resource is usally seen only during testing of passes and not useful for
    e2e functional execution. Testing of `ExecuteConstantFoldableSubgraphs` pass is such case.
    Thus,  MLIR test cases for this pass are added.

commit 920a84e648833764563d3dc1de544a8f1b9f027e
Author: Chris Bate <[email protected]>

    [tensorrt] Fix TRT layer name generation function

    The TRT layer naming had some faulty logic that could cause the layer
    name to grow very large in the process to create a unique name. Fix
    the issue and use a static counter to reduce time spent in the loop.

commit ff0c5fa4bf5321ad0ce18579598c49f4b552fb37
Author: Christopher Bate <[email protected]>

    Further fixes to LIT configs

    Previously, we were setting `lit_config.parallelism_group` instead of
    `config.parallelism_group`. Apparently, the previous method does nothing,
    only `config.parallelism_group` has any effect.

commit d65c220b712c262992dbdf5a87fa3220a06bfb21
Author: Chris Bate <[email protected]>

    Update LIG test parallelism configs

    In more recent versions of TensorRT (10.11+ at least), the builder is taking a
    much larger amount of host memory. This can cause OOM when running the LIT test
    suites under their existing configurations.

    This change updates all LIT configs:

    - Make sure to use `%pick-one-gpu` in the LIT command line to ensure we
      stall if there are not enough GPU or host resources available. Add
      a hard limit that there must be at least 5GB of host memory available.

    - Update configurations to reduce the amount of estimated parallelism by
      increasing host memory requirements and reducing the amount of host
      memory to 50% for the purposes of the parallelism calculation.

    - Force all tests to use a common parallelism group unless otherwise
      specified in the test config.

commit 1f996f607640d81bf7137a4ed874b20c2a16cca2
Author: Christopher Bate <[email protected]>

    [compiler] Fix failure case in stablehlo-to-scf

    Fixes a failure case due to one of the recently introduced rewrites in
    `stablehlo-to-scf`.

commit 2779b632465fc3e840f5ce987f6233e824fe2ed3
Author: Christopher Bate <[email protected]>

    [compiler] Further improvements to plan bufferization pipeline

    - Split `plan-assign-memory-spaces` into three passes:
      - `plan-assign-memory-spaces`
      - `plan-optimize-memory-spaces`
      - `plan-materialize-explicit-transfers`
    - The last one is the only new code: `plan-materialize-explicit-transfers`
      converts `tensor.cast` ops that change the memory space encoding into
      explicit `bufferization.alloc_tensor` +
      `bufferization.materialize_in_destination` operations.
    - Improve handling of `bufferization.alloc_tensor` and optimization of
      `scf.for` iteration args in `plan-assign-memory-spaces`.
    - Improve handling of `tensor.reshape` in `plan-assign-memory-spaces`.
    - Fix handling of `tensor.reshape` when rewriting functions to be in
      DPS style in `plan-alloc-tensors`.

    This change also updates the LLVM dependencies in order to cherry-pick
    fix to the `tensor.reshape` bufferization interface that I merged
    upstream (llvm/llvm-project#128590).

    In addition, fix APInt assertions in `plan-execute-constant-foldable-subgraphs`.

commit 312170d8cbcd4c1fcf9cefdd628583e3dbdcc4f5
Author: Chris Bate <[email protected]>

    [compiler] Enable While-to-For conversion in Stablehlo-to-Scf pass

    Stablehlo only has one type of loop construct, `stablehlo.while`. The
    `while` loop can represent "for"-style loops as well, but if we only
    have `scf.while` loops after conversion to SCF, then we miss out on
    lot of potential optimizations which are rooted on `scf.for`.
    Experiments show that complicated JAX programs like the
    PhysicalIntelligence Pi0 model can benefit from converting `scf.while`
    to `scf.for` where possible. This improves opportunities for
    constant folding and makes analysis much easier to gauge the benefit
    of transforms like unrolling.

    This change adds some patterns to the Stablehlo-to-Scf pass to enable
    While-to-For conversion after the Stablehlo-to-Scf conversion. This
    transformation is combined with the Stablehlo-to-Scf conversion because
    the While-to-For patterns require first scalarizing block arguments of
    the While operation. The heuristics for which block arguments should
    be scalarized are implemented as control callbacks for the scalarization
    patterns. These callbacks need Stablehlo-specific logic, so it makes sense
    to test the combined conversion as a single pass. From the pass users'
    perspective, it gives the appearence of going directly from `stablehlo.while`
    to `scf.for`.

    The test cases are updated to cover the new patterns.

commit 425d19e749104354b5ea9e76e7509d029f9eac59
Author: Chris Bate <[email protected]>

    [compiler] Fix assign-memory-spaces pass to respect function-level constraints

    Fixes an issue where the `plan.memory_space` attribute on a function was
    not being respected when converting function signatures.

    MR: initialdl/mlir-tensorrt!2146

commit b612d5a22e7e3c4f08bf80fd504df5193b370bd3
Author: Chris Bate <[email protected]>

    [compiler] Update scf.while detensorization to increase flexibility

    In order to incorporate the upstream "uplift scf.while to scf.for"
    transformation as part of the `stablehlo-to-scf` conversion, we need to
    detensorize the operands of `scf.while` that are likely to correspond to
    the loop induction variable. This change refactors our existing 'scf.while'
    detensorization transformation to give more flexibility and control. The
    TensorKindAnalysis is no longer required in order to use the pattern(s).
    Detensorization of `after` and `before` arguments of `scf.while` are now
    controlled separately.

commit 3e21bf465b90e1eaaad872da40c305b70253cce0
Author: Chris Bate <[email protected]>

    [compiler] Improve handling of memory space constraints in the Plan dialect

    This commit improves the handling of memory space constraints in the
    Plan dialect. Constraints are now specified using a common attribute
    'plan.memory_space' that can be applied to functions or individual
    arguments/results. In addition, patterns in `plan-alloc-tensors`
    and `plan-assign-memory-spaces` are updated to avoid introducing
    unnecessary transfers between memory spaces.

commit 36a3b4a77242685e473817cb692a4010f690c0b3
Author: Chris Bate <[email protected]>

    [compiler] Add plan-buffer-results-to-out-params pass

    This change adds a new Plan dialect pass `plan-buffer-results-to-out-params`.
    This pass is based on the upstream Bufferization pass
    `buffer-results-to-out-params`, but it can handle a wider number of
    cases (such as promoting dynamic allocations) and uses alias analysis
    utilities to guard against failure cases that the upstream pass currently
    cannot handle. These improvements should eventually be upstreamed back
    to the Bufferization dialect.

commit 9e7127ca1e61be72b032a54d270a3da0d75639b2
Author: Chris Bate <[email protected]>

    [compiler] Update func conversion in host-to-emitc

    In the EmitC conversion/translation process, you can use `func.func`
    or `emitc.func` to define functions. Previously, we converted all
    `func.func` to `emitc.func`. However, `emitc.func` does not have a
    path for supporting multiple return values. Therefore, prefer use
    of type conversions on `func.func` instead of converting the entire
    op to `emitc.func`. Add tests to verify that we can support multiple
    return values.

commit 934db1f78ef3e7bedb67f1252b41ded7419010f8
Author: Chris Bate <[email protected]>

    [compiler] Fix two host-to-emitc bugs

    This change fixes two bugs exposed by new 'host-to-emitc' conversion
    testing:

    - The `!emitc.size_t` type does not have DataLayout information specified
      upstream. Therefore, to ensure that the type can be queried using
      DataLayout, we add a DataLayoutTypeInterface external model to the type.
      All queries are simply mapped to queries to the `index` type.

    - The upstream `func.call` conversion has a bug where it does not
      correctly convert the result types of the call operation, which
      can lead to a type mismatch for any type that does not have an
      identity conversion.

    Additional tests are added to `host-to-emitc`. Eventually the fixes for
    both these issues should be moved upstream.

commit 9d27f08ee4429f4ffbb72023babc193c7724a700
Author: Chris Bate <[email protected]>

    [common] Add Linalg-to-loops (on tensors) implementation and conversion pass

    Adds a ToLoopsOpInterface implementation and for Linalg operations. In addition,
    a conversion pass is added that converts ToLoopOpInterface operations to loops.

commit 3a419f120808eafc31f45516977ed6169b809ab9
Author: Chris Bate <[email protected]>

    NFC: Move ToLoopsOpInterface to 'mlir-tensorrt-common'

    Moves the ToLoopsOpInterface to the 'mlir-tensorrt-common' project. This
    is in preperation for enabling the ToLoopsOpInterface on LinalgOp (lowering
    while still using Tensor types) to replace the `convert-stablehlo-arith-to-scalar`
    pipeline.

    MR: initialdl/mlir-tensorrt!2137

commit 442bea12b763dd36fce864695f63896912438d87
Author: Christopher Bate <[email protected]>

    NFC: Fix formatting across several files

commit b2a65bc3e806aaa95d932af512cfa4750a9cbe4e
Author: Chris Bate <[email protected]>

    [executor] Introduce RuntimeSession "features" to control loading of runtime modules

    Previously, the RuntimeSession would always load all available runtime
    modules. This causes some inefficiences. For example, in certain integration
    tests for the Executor runtime, we don't use CUDA at all. However, because
    CUDA is still initialized by default, we would still require a GPU to
    be present just to run the integration test. Furthermore, some experimental
    modules (e.g. Lua cublas module) are not ready for "production" use and
    are only really invoked inside special integration tests.

    This change inroduces a notion of "features" to the RuntimeSession and
    RuntimeSessionOptions. A feature is just a string that identifies a
    particular runtime component. The particular semantic of a "feature" depends
    on the the actual runtime implementation. For example, for the
    LuaRuntimeSession, the feature names correspond to the available Lua
    "modules" (a module is just a group of C++ Lua extension functions),
    e.g. "core", "cuda", "tensorrt", etc.

    The RuntimeSessionOptions gains methods for enabling/disabling features.
    Certain features cause others to be added to the set automatically, e.g.
    "tensorrt" and "nccl" both require "cuda" to be added.

    The API is piped through all the way to the Python bindings to allow
    control of loaded modules at all levels. To preserve existing behavior,
    RuntimeSessions created from Python will load all available modules by
    default, but the `executor-runner|mlir-tensorrt-runner` tools now require
    features to be explicitly specified.

commit b90f8f345b2941e958f3a1cc5bcac21daebe783b
Author: Christopher Bate <[email protected]>

    NFC: Fix include guard for 'mlir-executor/Support/Status.h'

commit cdbe1f560483047291a30115a043a60bdce34d99
Author: Sagar Shelke <[email protected]>

    [compiler/lib] Add stablehlo composite to call pass to pre-processing pipeline

    This MR adds `StablehloLegalizeCompositeToCallPass` to the pre-processing
    pipeline.

    MLIR test is added.

commit 6ea3ab77aa2909cee11d08aa24543f247e8a24bf
Author: Chris Bate <[email protected]>

    [compiler] Add "default memory space" to ClusterKindAttrInterface

    Adds a new method to the ClusterKindAttrInterface so that backends can
    control the default tensor encoding (#plan.memory_space<..>) assigned by
    the `plan.assign-memory-spaces` pass at a function-scope level. In
    addition, we also allow an attribute to override the default space
    at function argument/results. This override mechnanism was previously
    lacking and will help resolve a long-standing issue where users cannot
    control the memory space of arguments/results reliably.

commit 0ea59238f5c280ab3ffbc340bb9aee7ed7bfbebb
Author: Christopher Bate <[email protected]>

    [compiler] Fix some issues related to pipeline extension mechanism

    The StablehloToExecutableTensorRTExtension had both 'disable' and
    an inherited 'disabled' member variable. Delete the inherited one
    such it should not have been introduced and was not bound to any
    option. Further, remove unused 'extensions' vector from
    CompilationTaskOptionsBase.

commit 372476d77fcaa399460965ab7bfc052f0e44c99f
Author: Christopher Bate <[email protected]>

    [executor] Fix ptrtoint and inttoptr op translation to Lua

    Previously, we could generate conflicting function types (due to pointer
    address space) when converting `executor.ptrtoint` and `executor.inttoptr`
    ops to opaque calls. Instead, defer the conversion to function call until
    the actual Lua translation point. At that point we can generate a
    function name without having to consider the pointer address space.

commit 75d18534fa67b452dd2253d6981bda6954bf1056
Author: Chris Bate <[email protected]>

    Introduce 'MLIRTensorRTCommmon' sub-project

    Certain targets need to be used across multiple sub-projects. For example,
    the 'TensorRTDynamicLoader' target is used in all sub-projects. In addition,
    the sub-projects need to be independently buildable. This change introduces
    another sub-project under the 'common' directory where shared code can be
    placed. This allows us to use `find_package` to declare the dependency, and
    downstream consumers to meet the requirement using any number of
    techniques to fullfill the 'find_package' call.

commit d7d8104087cf272bdd08f6330f27734754f0d71d
Author: Chris Bate <[email protected]>

    [compiler] Harden `stablehlo.constant` to `arith.constant` conversion

    There is a utility pass that runs in the stablehlo-to-executable pipeline
    that converts `stablehlo.constant` to `arith.constant`. This pass
    can temporarily create invalid IR due to `arith.constant` not supporting
    signful integer types. If the "verify-each" option is off,
    then the issue will not be caught since it happens to be self-correcting.
    However, the issue can still cause verification failures while debugging.
    This change fixes the issue by adding a `builtin.unrealized_conversion_cast`
    operation to bridge the type change between signless-and-signfull integer
    types.

commit a500de82a7bd70d6bfe32234719b4daa7cf32a8a
Author: Chris Bate <[email protected]>

    Integrate LLVM at f137c3d592e96330e450a8fd63ef7e8877fc1908

commit cd56aa6a511e2091fcd86106f20d27ff3673db75
Author: Christopher Bate <[email protected]>

    Fix build with BUILD_SHARED_LIBS=ON

    The new InferTensorValueRangeInterface was used without correctly
    specifying the library dependency the PlanIR and StablehloExtIR
    libraries.

commit cf1aff0ad0997947ab87485cfeec4595cb0285d7
Author: Sagar Shelke <[email protected]>

    [compiler] Maintain output order in TensorRT engine.

    For TensorRT engine conversion, first step in lowering a
    cluster containing TensorRT ops is created inline group op.
    Operands to the yield op (i.e. terminator) of inline group op
    are values from the cluster that are used outside the cluster.
    These values are collected by getting uses of each op (with
    `op->getUses()`) and checking if they are outside the cluster.
    However, this use order is not deterministic and sometimes
    it is desired to get yield results in a certian order.

    This MR makes the following changes,
    1. Add a function callback option named `ReorderRegionOpYieldValues`
    to `mlir::createRegionOpFromCluster` method. This callback function
    has signature `std::function<void(SetVector<Value> &yieldValues,
    SmallVectorImpl<Type> &yieldTypes)>` which takes cluster values used
    outside the cluster (in SetVector) and their types. By default this is
    set to nullptr.
    2. TensorRTToExecutable task is used in cases where a single `func.func`
    represents a single TensorRT engine. In this case, `ReorderRegionOpYieldValues`
    callback is implemented to make sure inline group op yield value order
    is same as func.func return values order.

    Valid MLIR test is added.

GitOrigin-RevId: 630a69d8e14506db43cfefe4be2c790f9352da4f
shelkesagar29 pushed a commit to NVIDIA/TensorRT-Incubator that referenced this pull request Jun 30, 2025
This PR moves the following internal changes to OSS,

commit 2986cac97888b5a9c1cd9064e4728cb38ca9dc45
Author: Sagar Shelke <[email protected]>

    [executor] Add complex type support to `ScalarValue`

    Previously, ScalarValue which represents scalar runtime value did not
    support complex type. This MR adds support for complex type by making storage
    union of real and complex data instaed of just real.

    MLIR tests are added via constant subgraph execution.

commit cf83a0d318b8035695d0b9fd24d578733632e253
Author: Christopher Bate <[email protected]>

    [compiler] Enable more `stablehlo.dot_general` to TensorRT using `tensorrt.einsum`

    Previously, we relied on canonicalization of `stablehlo.dot_general` to
    put all such contraction operations into a form that could be converted to
    `tensorrt.matrix_multiply`. Based on recent experiments, this can actually
    produce very inefficient TensorRT programs due to the number of reshapes
    and transpositions that must be inserted to coerce general
    `stablehlo.dot_general` into batched matrix multiplications. This change
    enables conversion of `stablehlo.dot_general` to `tensorrt.einsum`, and
    the pass and patterns now contain configurable parameters to control whether
    `tensorrt.einsum` is used as the primary method or only for fallback when
    conversion to `tensorrt.matrix_multiply` is not possible.

    A follow on change will revamp the Stablehlo preprocessing that we
    perform on 'stablehlo.dot_general' to avoid creating inefficient patterns
    and enable wider use of this pattern.

commit 528651ed1cd36c36376180c1c2232526ce972fef
Author: Christopher Bate <[email protected]>

    [compiler] Fix stablehlo-to-scf scalarization heuristics

    Fixes an issue where float tensors in the 'before' region of converted
    while loops where scalarized. The transform should only scalarize operands
    which are likely to be for-style induction variables.

commit 1d52e0a9e30dc104178c4761c1a24153abc7ea90
Author: Christopher Bate <[email protected]>

    [compiler] NFC: Drop dead code from StablehloToExecutableTask

commit f1c8d8c7cd860aedfe339d76ef7fb953baf9bd55
Author: Chris Bate <[email protected]>

    [compiler] Add `plan-promote-host-tensors-to-host-pinned` pass

    Adds a simple pass to promote "host" tensors to "host-pinned" tensors
    in common cases where we know a tensor will be transferred between host
    and device spaces. This pass runs after `plan-optimize-memory-spaces`
    since the former is sensitive to mismatching host spaces for patterns
    related to moving tranfers out of loops.

commit c27d56ea7a9661395e17fa895c610a79a92fa0c2
Author: Sagar Shelke <[email protected]>

    [executor] Handle elided dense resource elements attr during translation

    Translation to executable (which is flatbuffer) uses MLIR attr serialization to serialize
    `ElementsAttr`. However, this doesn't work when attr is elided dense resource and results
    in segfault. This MR handles this situation by replacing elided resource with `DenseElementsAttr`
    of all `one`s (`true` in case of boolean).

    IR with elided resource is usally seen only during testing of passes and not useful for
    e2e functional execution. Testing of `ExecuteConstantFoldableSubgraphs` pass is such case.
    Thus,  MLIR test cases for this pass are added.

commit 920a84e648833764563d3dc1de544a8f1b9f027e
Author: Chris Bate <[email protected]>

    [tensorrt] Fix TRT layer name generation function

    The TRT layer naming had some faulty logic that could cause the layer
    name to grow very large in the process to create a unique name. Fix
    the issue and use a static counter to reduce time spent in the loop.

commit ff0c5fa4bf5321ad0ce18579598c49f4b552fb37
Author: Christopher Bate <[email protected]>

    Further fixes to LIT configs

    Previously, we were setting `lit_config.parallelism_group` instead of
    `config.parallelism_group`. Apparently, the previous method does nothing,
    only `config.parallelism_group` has any effect.

commit d65c220b712c262992dbdf5a87fa3220a06bfb21
Author: Chris Bate <[email protected]>

    Update LIG test parallelism configs

    In more recent versions of TensorRT (10.11+ at least), the builder is taking a
    much larger amount of host memory. This can cause OOM when running the LIT test
    suites under their existing configurations.

    This change updates all LIT configs:

    - Make sure to use `%pick-one-gpu` in the LIT command line to ensure we
      stall if there are not enough GPU or host resources available. Add
      a hard limit that there must be at least 5GB of host memory available.

    - Update configurations to reduce the amount of estimated parallelism by
      increasing host memory requirements and reducing the amount of host
      memory to 50% for the purposes of the parallelism calculation.

    - Force all tests to use a common parallelism group unless otherwise
      specified in the test config.

commit 1f996f607640d81bf7137a4ed874b20c2a16cca2
Author: Christopher Bate <[email protected]>

    [compiler] Fix failure case in stablehlo-to-scf

    Fixes a failure case due to one of the recently introduced rewrites in
    `stablehlo-to-scf`.

commit 2779b632465fc3e840f5ce987f6233e824fe2ed3
Author: Christopher Bate <[email protected]>

    [compiler] Further improvements to plan bufferization pipeline

    - Split `plan-assign-memory-spaces` into three passes:
      - `plan-assign-memory-spaces`
      - `plan-optimize-memory-spaces`
      - `plan-materialize-explicit-transfers`
    - The last one is the only new code: `plan-materialize-explicit-transfers`
      converts `tensor.cast` ops that change the memory space encoding into
      explicit `bufferization.alloc_tensor` +
      `bufferization.materialize_in_destination` operations.
    - Improve handling of `bufferization.alloc_tensor` and optimization of
      `scf.for` iteration args in `plan-assign-memory-spaces`.
    - Improve handling of `tensor.reshape` in `plan-assign-memory-spaces`.
    - Fix handling of `tensor.reshape` when rewriting functions to be in
      DPS style in `plan-alloc-tensors`.

    This change also updates the LLVM dependencies in order to cherry-pick
    fix to the `tensor.reshape` bufferization interface that I merged
    upstream (llvm/llvm-project#128590).

    In addition, fix APInt assertions in `plan-execute-constant-foldable-subgraphs`.

commit 312170d8cbcd4c1fcf9cefdd628583e3dbdcc4f5
Author: Chris Bate <[email protected]>

    [compiler] Enable While-to-For conversion in Stablehlo-to-Scf pass

    Stablehlo only has one type of loop construct, `stablehlo.while`. The
    `while` loop can represent "for"-style loops as well, but if we only
    have `scf.while` loops after conversion to SCF, then we miss out on
    lot of potential optimizations which are rooted on `scf.for`.
    Experiments show that complicated JAX programs like the
    PhysicalIntelligence Pi0 model can benefit from converting `scf.while`
    to `scf.for` where possible. This improves opportunities for
    constant folding and makes analysis much easier to gauge the benefit
    of transforms like unrolling.

    This change adds some patterns to the Stablehlo-to-Scf pass to enable
    While-to-For conversion after the Stablehlo-to-Scf conversion. This
    transformation is combined with the Stablehlo-to-Scf conversion because
    the While-to-For patterns require first scalarizing block arguments of
    the While operation. The heuristics for which block arguments should
    be scalarized are implemented as control callbacks for the scalarization
    patterns. These callbacks need Stablehlo-specific logic, so it makes sense
    to test the combined conversion as a single pass. From the pass users'
    perspective, it gives the appearence of going directly from `stablehlo.while`
    to `scf.for`.

    The test cases are updated to cover the new patterns.

commit 425d19e749104354b5ea9e76e7509d029f9eac59
Author: Chris Bate <[email protected]>

    [compiler] Fix assign-memory-spaces pass to respect function-level constraints

    Fixes an issue where the `plan.memory_space` attribute on a function was
    not being respected when converting function signatures.

    MR: initialdl/mlir-tensorrt!2146

commit b612d5a22e7e3c4f08bf80fd504df5193b370bd3
Author: Chris Bate <[email protected]>

    [compiler] Update scf.while detensorization to increase flexibility

    In order to incorporate the upstream "uplift scf.while to scf.for"
    transformation as part of the `stablehlo-to-scf` conversion, we need to
    detensorize the operands of `scf.while` that are likely to correspond to
    the loop induction variable. This change refactors our existing 'scf.while'
    detensorization transformation to give more flexibility and control. The
    TensorKindAnalysis is no longer required in order to use the pattern(s).
    Detensorization of `after` and `before` arguments of `scf.while` are now
    controlled separately.

commit 3e21bf465b90e1eaaad872da40c305b70253cce0
Author: Chris Bate <[email protected]>

    [compiler] Improve handling of memory space constraints in the Plan dialect

    This commit improves the handling of memory space constraints in the
    Plan dialect. Constraints are now specified using a common attribute
    'plan.memory_space' that can be applied to functions or individual
    arguments/results. In addition, patterns in `plan-alloc-tensors`
    and `plan-assign-memory-spaces` are updated to avoid introducing
    unnecessary transfers between memory spaces.

commit 36a3b4a77242685e473817cb692a4010f690c0b3
Author: Chris Bate <[email protected]>

    [compiler] Add plan-buffer-results-to-out-params pass

    This change adds a new Plan dialect pass `plan-buffer-results-to-out-params`.
    This pass is based on the upstream Bufferization pass
    `buffer-results-to-out-params`, but it can handle a wider number of
    cases (such as promoting dynamic allocations) and uses alias analysis
    utilities to guard against failure cases that the upstream pass currently
    cannot handle. These improvements should eventually be upstreamed back
    to the Bufferization dialect.

commit 9e7127ca1e61be72b032a54d270a3da0d75639b2
Author: Chris Bate <[email protected]>

    [compiler] Update func conversion in host-to-emitc

    In the EmitC conversion/translation process, you can use `func.func`
    or `emitc.func` to define functions. Previously, we converted all
    `func.func` to `emitc.func`. However, `emitc.func` does not have a
    path for supporting multiple return values. Therefore, prefer use
    of type conversions on `func.func` instead of converting the entire
    op to `emitc.func`. Add tests to verify that we can support multiple
    return values.

commit 934db1f78ef3e7bedb67f1252b41ded7419010f8
Author: Chris Bate <[email protected]>

    [compiler] Fix two host-to-emitc bugs

    This change fixes two bugs exposed by new 'host-to-emitc' conversion
    testing:

    - The `!emitc.size_t` type does not have DataLayout information specified
      upstream. Therefore, to ensure that the type can be queried using
      DataLayout, we add a DataLayoutTypeInterface external model to the type.
      All queries are simply mapped to queries to the `index` type.

    - The upstream `func.call` conversion has a bug where it does not
      correctly convert the result types of the call operation, which
      can lead to a type mismatch for any type that does not have an
      identity conversion.

    Additional tests are added to `host-to-emitc`. Eventually the fixes for
    both these issues should be moved upstream.

commit 9d27f08ee4429f4ffbb72023babc193c7724a700
Author: Chris Bate <[email protected]>

    [common] Add Linalg-to-loops (on tensors) implementation and conversion pass

    Adds a ToLoopsOpInterface implementation and for Linalg operations. In addition,
    a conversion pass is added that converts ToLoopOpInterface operations to loops.

commit 3a419f120808eafc31f45516977ed6169b809ab9
Author: Chris Bate <[email protected]>

    NFC: Move ToLoopsOpInterface to 'mlir-tensorrt-common'

    Moves the ToLoopsOpInterface to the 'mlir-tensorrt-common' project. This
    is in preperation for enabling the ToLoopsOpInterface on LinalgOp (lowering
    while still using Tensor types) to replace the `convert-stablehlo-arith-to-scalar`
    pipeline.

    MR: initialdl/mlir-tensorrt!2137

commit 442bea12b763dd36fce864695f63896912438d87
Author: Christopher Bate <[email protected]>

    NFC: Fix formatting across several files

commit b2a65bc3e806aaa95d932af512cfa4750a9cbe4e
Author: Chris Bate <[email protected]>

    [executor] Introduce RuntimeSession "features" to control loading of runtime modules

    Previously, the RuntimeSession would always load all available runtime
    modules. This causes some inefficiences. For example, in certain integration
    tests for the Executor runtime, we don't use CUDA at all. However, because
    CUDA is still initialized by default, we would still require a GPU to
    be present just to run the integration test. Furthermore, some experimental
    modules (e.g. Lua cublas module) are not ready for "production" use and
    are only really invoked inside special integration tests.

    This change inroduces a notion of "features" to the RuntimeSession and
    RuntimeSessionOptions. A feature is just a string that identifies a
    particular runtime component. The particular semantic of a "feature" depends
    on the the actual runtime implementation. For example, for the
    LuaRuntimeSession, the feature names correspond to the available Lua
    "modules" (a module is just a group of C++ Lua extension functions),
    e.g. "core", "cuda", "tensorrt", etc.

    The RuntimeSessionOptions gains methods for enabling/disabling features.
    Certain features cause others to be added to the set automatically, e.g.
    "tensorrt" and "nccl" both require "cuda" to be added.

    The API is piped through all the way to the Python bindings to allow
    control of loaded modules at all levels. To preserve existing behavior,
    RuntimeSessions created from Python will load all available modules by
    default, but the `executor-runner|mlir-tensorrt-runner` tools now require
    features to be explicitly specified.

commit b90f8f345b2941e958f3a1cc5bcac21daebe783b
Author: Christopher Bate <[email protected]>

    NFC: Fix include guard for 'mlir-executor/Support/Status.h'

commit cdbe1f560483047291a30115a043a60bdce34d99
Author: Sagar Shelke <[email protected]>

    [compiler/lib] Add stablehlo composite to call pass to pre-processing pipeline

    This MR adds `StablehloLegalizeCompositeToCallPass` to the pre-processing
    pipeline.

    MLIR test is added.

commit 6ea3ab77aa2909cee11d08aa24543f247e8a24bf
Author: Chris Bate <[email protected]>

    [compiler] Add "default memory space" to ClusterKindAttrInterface

    Adds a new method to the ClusterKindAttrInterface so that backends can
    control the default tensor encoding (#plan.memory_space<..>) assigned by
    the `plan.assign-memory-spaces` pass at a function-scope level. In
    addition, we also allow an attribute to override the default space
    at function argument/results. This override mechnanism was previously
    lacking and will help resolve a long-standing issue where users cannot
    control the memory space of arguments/results reliably.

commit 0ea59238f5c280ab3ffbc340bb9aee7ed7bfbebb
Author: Christopher Bate <[email protected]>

    [compiler] Fix some issues related to pipeline extension mechanism

    The StablehloToExecutableTensorRTExtension had both 'disable' and
    an inherited 'disabled' member variable. Delete the inherited one
    such it should not have been introduced and was not bound to any
    option. Further, remove unused 'extensions' vector from
    CompilationTaskOptionsBase.

commit 372476d77fcaa399460965ab7bfc052f0e44c99f
Author: Christopher Bate <[email protected]>

    [executor] Fix ptrtoint and inttoptr op translation to Lua

    Previously, we could generate conflicting function types (due to pointer
    address space) when converting `executor.ptrtoint` and `executor.inttoptr`
    ops to opaque calls. Instead, defer the conversion to function call until
    the actual Lua translation point. At that point we can generate a
    function name without having to consider the pointer address space.

commit 75d18534fa67b452dd2253d6981bda6954bf1056
Author: Chris Bate <[email protected]>

    Introduce 'MLIRTensorRTCommmon' sub-project

    Certain targets need to be used across multiple sub-projects. For example,
    the 'TensorRTDynamicLoader' target is used in all sub-projects. In addition,
    the sub-projects need to be independently buildable. This change introduces
    another sub-project under the 'common' directory where shared code can be
    placed. This allows us to use `find_package` to declare the dependency, and
    downstream consumers to meet the requirement using any number of
    techniques to fullfill the 'find_package' call.

commit d7d8104087cf272bdd08f6330f27734754f0d71d
Author: Chris Bate <[email protected]>

    [compiler] Harden `stablehlo.constant` to `arith.constant` conversion

    There is a utility pass that runs in the stablehlo-to-executable pipeline
    that converts `stablehlo.constant` to `arith.constant`. This pass
    can temporarily create invalid IR due to `arith.constant` not supporting
    signful integer types. If the "verify-each" option is off,
    then the issue will not be caught since it happens to be self-correcting.
    However, the issue can still cause verification failures while debugging.
    This change fixes the issue by adding a `builtin.unrealized_conversion_cast`
    operation to bridge the type change between signless-and-signfull integer
    types.

commit a500de82a7bd70d6bfe32234719b4daa7cf32a8a
Author: Chris Bate <[email protected]>

    Integrate LLVM at f137c3d592e96330e450a8fd63ef7e8877fc1908

commit cd56aa6a511e2091fcd86106f20d27ff3673db75
Author: Christopher Bate <[email protected]>

    Fix build with BUILD_SHARED_LIBS=ON

    The new InferTensorValueRangeInterface was used without correctly
    specifying the library dependency the PlanIR and StablehloExtIR
    libraries.

commit cf1aff0ad0997947ab87485cfeec4595cb0285d7
Author: Sagar Shelke <[email protected]>

    [compiler] Maintain output order in TensorRT engine.

    For TensorRT engine conversion, first step in lowering a
    cluster containing TensorRT ops is created inline group op.
    Operands to the yield op (i.e. terminator) of inline group op
    are values from the cluster that are used outside the cluster.
    These values are collected by getting uses of each op (with
    `op->getUses()`) and checking if they are outside the cluster.
    However, this use order is not deterministic and sometimes
    it is desired to get yield results in a certian order.

    This MR makes the following changes,
    1. Add a function callback option named `ReorderRegionOpYieldValues`
    to `mlir::createRegionOpFromCluster` method. This callback function
    has signature `std::function<void(SetVector<Value> &yieldValues,
    SmallVectorImpl<Type> &yieldTypes)>` which takes cluster values used
    outside the cluster (in SetVector) and their types. By default this is
    set to nullptr.
    2. TensorRTToExecutable task is used in cases where a single `func.func`
    represents a single TensorRT engine. In this case, `ReorderRegionOpYieldValues`
    callback is implemented to make sure inline group op yield value order
    is same as func.func return values order.

    Valid MLIR test is added.

GitOrigin-RevId: 630a69d8e14506db43cfefe4be2c790f9352da4f
shelkesagar29 pushed a commit to NVIDIA/TensorRT-Incubator that referenced this pull request Jul 1, 2025
This PR moves the following internal changes to OSS,

commit 2986cac97888b5a9c1cd9064e4728cb38ca9dc45
Author: Sagar Shelke <[email protected]>

    [executor] Add complex type support to `ScalarValue`

    Previously, ScalarValue which represents scalar runtime value did not
    support complex type. This MR adds support for complex type by making storage
    union of real and complex data instaed of just real.

    MLIR tests are added via constant subgraph execution.

commit cf83a0d318b8035695d0b9fd24d578733632e253
Author: Christopher Bate <[email protected]>

    [compiler] Enable more `stablehlo.dot_general` to TensorRT using `tensorrt.einsum`

    Previously, we relied on canonicalization of `stablehlo.dot_general` to
    put all such contraction operations into a form that could be converted to
    `tensorrt.matrix_multiply`. Based on recent experiments, this can actually
    produce very inefficient TensorRT programs due to the number of reshapes
    and transpositions that must be inserted to coerce general
    `stablehlo.dot_general` into batched matrix multiplications. This change
    enables conversion of `stablehlo.dot_general` to `tensorrt.einsum`, and
    the pass and patterns now contain configurable parameters to control whether
    `tensorrt.einsum` is used as the primary method or only for fallback when
    conversion to `tensorrt.matrix_multiply` is not possible.

    A follow on change will revamp the Stablehlo preprocessing that we
    perform on 'stablehlo.dot_general' to avoid creating inefficient patterns
    and enable wider use of this pattern.

commit 528651ed1cd36c36376180c1c2232526ce972fef
Author: Christopher Bate <[email protected]>

    [compiler] Fix stablehlo-to-scf scalarization heuristics

    Fixes an issue where float tensors in the 'before' region of converted
    while loops where scalarized. The transform should only scalarize operands
    which are likely to be for-style induction variables.

commit 1d52e0a9e30dc104178c4761c1a24153abc7ea90
Author: Christopher Bate <[email protected]>

    [compiler] NFC: Drop dead code from StablehloToExecutableTask

commit f1c8d8c7cd860aedfe339d76ef7fb953baf9bd55
Author: Chris Bate <[email protected]>

    [compiler] Add `plan-promote-host-tensors-to-host-pinned` pass

    Adds a simple pass to promote "host" tensors to "host-pinned" tensors
    in common cases where we know a tensor will be transferred between host
    and device spaces. This pass runs after `plan-optimize-memory-spaces`
    since the former is sensitive to mismatching host spaces for patterns
    related to moving tranfers out of loops.

commit c27d56ea7a9661395e17fa895c610a79a92fa0c2
Author: Sagar Shelke <[email protected]>

    [executor] Handle elided dense resource elements attr during translation

    Translation to executable (which is flatbuffer) uses MLIR attr serialization to serialize
    `ElementsAttr`. However, this doesn't work when attr is elided dense resource and results
    in segfault. This MR handles this situation by replacing elided resource with `DenseElementsAttr`
    of all `one`s (`true` in case of boolean).

    IR with elided resource is usally seen only during testing of passes and not useful for
    e2e functional execution. Testing of `ExecuteConstantFoldableSubgraphs` pass is such case.
    Thus,  MLIR test cases for this pass are added.

commit 920a84e648833764563d3dc1de544a8f1b9f027e
Author: Chris Bate <[email protected]>

    [tensorrt] Fix TRT layer name generation function

    The TRT layer naming had some faulty logic that could cause the layer
    name to grow very large in the process to create a unique name. Fix
    the issue and use a static counter to reduce time spent in the loop.

commit ff0c5fa4bf5321ad0ce18579598c49f4b552fb37
Author: Christopher Bate <[email protected]>

    Further fixes to LIT configs

    Previously, we were setting `lit_config.parallelism_group` instead of
    `config.parallelism_group`. Apparently, the previous method does nothing,
    only `config.parallelism_group` has any effect.

commit d65c220b712c262992dbdf5a87fa3220a06bfb21
Author: Chris Bate <[email protected]>

    Update LIG test parallelism configs

    In more recent versions of TensorRT (10.11+ at least), the builder is taking a
    much larger amount of host memory. This can cause OOM when running the LIT test
    suites under their existing configurations.

    This change updates all LIT configs:

    - Make sure to use `%pick-one-gpu` in the LIT command line to ensure we
      stall if there are not enough GPU or host resources available. Add
      a hard limit that there must be at least 5GB of host memory available.

    - Update configurations to reduce the amount of estimated parallelism by
      increasing host memory requirements and reducing the amount of host
      memory to 50% for the purposes of the parallelism calculation.

    - Force all tests to use a common parallelism group unless otherwise
      specified in the test config.

commit 1f996f607640d81bf7137a4ed874b20c2a16cca2
Author: Christopher Bate <[email protected]>

    [compiler] Fix failure case in stablehlo-to-scf

    Fixes a failure case due to one of the recently introduced rewrites in
    `stablehlo-to-scf`.

commit 2779b632465fc3e840f5ce987f6233e824fe2ed3
Author: Christopher Bate <[email protected]>

    [compiler] Further improvements to plan bufferization pipeline

    - Split `plan-assign-memory-spaces` into three passes:
      - `plan-assign-memory-spaces`
      - `plan-optimize-memory-spaces`
      - `plan-materialize-explicit-transfers`
    - The last one is the only new code: `plan-materialize-explicit-transfers`
      converts `tensor.cast` ops that change the memory space encoding into
      explicit `bufferization.alloc_tensor` +
      `bufferization.materialize_in_destination` operations.
    - Improve handling of `bufferization.alloc_tensor` and optimization of
      `scf.for` iteration args in `plan-assign-memory-spaces`.
    - Improve handling of `tensor.reshape` in `plan-assign-memory-spaces`.
    - Fix handling of `tensor.reshape` when rewriting functions to be in
      DPS style in `plan-alloc-tensors`.

    This change also updates the LLVM dependencies in order to cherry-pick
    fix to the `tensor.reshape` bufferization interface that I merged
    upstream (llvm/llvm-project#128590).

    In addition, fix APInt assertions in `plan-execute-constant-foldable-subgraphs`.

commit 312170d8cbcd4c1fcf9cefdd628583e3dbdcc4f5
Author: Chris Bate <[email protected]>

    [compiler] Enable While-to-For conversion in Stablehlo-to-Scf pass

    Stablehlo only has one type of loop construct, `stablehlo.while`. The
    `while` loop can represent "for"-style loops as well, but if we only
    have `scf.while` loops after conversion to SCF, then we miss out on
    lot of potential optimizations which are rooted on `scf.for`.
    Experiments show that complicated JAX programs like the
    PhysicalIntelligence Pi0 model can benefit from converting `scf.while`
    to `scf.for` where possible. This improves opportunities for
    constant folding and makes analysis much easier to gauge the benefit
    of transforms like unrolling.

    This change adds some patterns to the Stablehlo-to-Scf pass to enable
    While-to-For conversion after the Stablehlo-to-Scf conversion. This
    transformation is combined with the Stablehlo-to-Scf conversion because
    the While-to-For patterns require first scalarizing block arguments of
    the While operation. The heuristics for which block arguments should
    be scalarized are implemented as control callbacks for the scalarization
    patterns. These callbacks need Stablehlo-specific logic, so it makes sense
    to test the combined conversion as a single pass. From the pass users'
    perspective, it gives the appearence of going directly from `stablehlo.while`
    to `scf.for`.

    The test cases are updated to cover the new patterns.

commit 425d19e749104354b5ea9e76e7509d029f9eac59
Author: Chris Bate <[email protected]>

    [compiler] Fix assign-memory-spaces pass to respect function-level constraints

    Fixes an issue where the `plan.memory_space` attribute on a function was
    not being respected when converting function signatures.

    MR: initialdl/mlir-tensorrt!2146

commit b612d5a22e7e3c4f08bf80fd504df5193b370bd3
Author: Chris Bate <[email protected]>

    [compiler] Update scf.while detensorization to increase flexibility

    In order to incorporate the upstream "uplift scf.while to scf.for"
    transformation as part of the `stablehlo-to-scf` conversion, we need to
    detensorize the operands of `scf.while` that are likely to correspond to
    the loop induction variable. This change refactors our existing 'scf.while'
    detensorization transformation to give more flexibility and control. The
    TensorKindAnalysis is no longer required in order to use the pattern(s).
    Detensorization of `after` and `before` arguments of `scf.while` are now
    controlled separately.

commit 3e21bf465b90e1eaaad872da40c305b70253cce0
Author: Chris Bate <[email protected]>

    [compiler] Improve handling of memory space constraints in the Plan dialect

    This commit improves the handling of memory space constraints in the
    Plan dialect. Constraints are now specified using a common attribute
    'plan.memory_space' that can be applied to functions or individual
    arguments/results. In addition, patterns in `plan-alloc-tensors`
    and `plan-assign-memory-spaces` are updated to avoid introducing
    unnecessary transfers between memory spaces.

commit 36a3b4a77242685e473817cb692a4010f690c0b3
Author: Chris Bate <[email protected]>

    [compiler] Add plan-buffer-results-to-out-params pass

    This change adds a new Plan dialect pass `plan-buffer-results-to-out-params`.
    This pass is based on the upstream Bufferization pass
    `buffer-results-to-out-params`, but it can handle a wider number of
    cases (such as promoting dynamic allocations) and uses alias analysis
    utilities to guard against failure cases that the upstream pass currently
    cannot handle. These improvements should eventually be upstreamed back
    to the Bufferization dialect.

commit 9e7127ca1e61be72b032a54d270a3da0d75639b2
Author: Chris Bate <[email protected]>

    [compiler] Update func conversion in host-to-emitc

    In the EmitC conversion/translation process, you can use `func.func`
    or `emitc.func` to define functions. Previously, we converted all
    `func.func` to `emitc.func`. However, `emitc.func` does not have a
    path for supporting multiple return values. Therefore, prefer use
    of type conversions on `func.func` instead of converting the entire
    op to `emitc.func`. Add tests to verify that we can support multiple
    return values.

commit 934db1f78ef3e7bedb67f1252b41ded7419010f8
Author: Chris Bate <[email protected]>

    [compiler] Fix two host-to-emitc bugs

    This change fixes two bugs exposed by new 'host-to-emitc' conversion
    testing:

    - The `!emitc.size_t` type does not have DataLayout information specified
      upstream. Therefore, to ensure that the type can be queried using
      DataLayout, we add a DataLayoutTypeInterface external model to the type.
      All queries are simply mapped to queries to the `index` type.

    - The upstream `func.call` conversion has a bug where it does not
      correctly convert the result types of the call operation, which
      can lead to a type mismatch for any type that does not have an
      identity conversion.

    Additional tests are added to `host-to-emitc`. Eventually the fixes for
    both these issues should be moved upstream.

commit 9d27f08ee4429f4ffbb72023babc193c7724a700
Author: Chris Bate <[email protected]>

    [common] Add Linalg-to-loops (on tensors) implementation and conversion pass

    Adds a ToLoopsOpInterface implementation and for Linalg operations. In addition,
    a conversion pass is added that converts ToLoopOpInterface operations to loops.

commit 3a419f120808eafc31f45516977ed6169b809ab9
Author: Chris Bate <[email protected]>

    NFC: Move ToLoopsOpInterface to 'mlir-tensorrt-common'

    Moves the ToLoopsOpInterface to the 'mlir-tensorrt-common' project. This
    is in preperation for enabling the ToLoopsOpInterface on LinalgOp (lowering
    while still using Tensor types) to replace the `convert-stablehlo-arith-to-scalar`
    pipeline.

    MR: initialdl/mlir-tensorrt!2137

commit 442bea12b763dd36fce864695f63896912438d87
Author: Christopher Bate <[email protected]>

    NFC: Fix formatting across several files

commit b2a65bc3e806aaa95d932af512cfa4750a9cbe4e
Author: Chris Bate <[email protected]>

    [executor] Introduce RuntimeSession "features" to control loading of runtime modules

    Previously, the RuntimeSession would always load all available runtime
    modules. This causes some inefficiences. For example, in certain integration
    tests for the Executor runtime, we don't use CUDA at all. However, because
    CUDA is still initialized by default, we would still require a GPU to
    be present just to run the integration test. Furthermore, some experimental
    modules (e.g. Lua cublas module) are not ready for "production" use and
    are only really invoked inside special integration tests.

    This change inroduces a notion of "features" to the RuntimeSession and
    RuntimeSessionOptions. A feature is just a string that identifies a
    particular runtime component. The particular semantic of a "feature" depends
    on the the actual runtime implementation. For example, for the
    LuaRuntimeSession, the feature names correspond to the available Lua
    "modules" (a module is just a group of C++ Lua extension functions),
    e.g. "core", "cuda", "tensorrt", etc.

    The RuntimeSessionOptions gains methods for enabling/disabling features.
    Certain features cause others to be added to the set automatically, e.g.
    "tensorrt" and "nccl" both require "cuda" to be added.

    The API is piped through all the way to the Python bindings to allow
    control of loaded modules at all levels. To preserve existing behavior,
    RuntimeSessions created from Python will load all available modules by
    default, but the `executor-runner|mlir-tensorrt-runner` tools now require
    features to be explicitly specified.

commit b90f8f345b2941e958f3a1cc5bcac21daebe783b
Author: Christopher Bate <[email protected]>

    NFC: Fix include guard for 'mlir-executor/Support/Status.h'

commit cdbe1f560483047291a30115a043a60bdce34d99
Author: Sagar Shelke <[email protected]>

    [compiler/lib] Add stablehlo composite to call pass to pre-processing pipeline

    This MR adds `StablehloLegalizeCompositeToCallPass` to the pre-processing
    pipeline.

    MLIR test is added.

commit 6ea3ab77aa2909cee11d08aa24543f247e8a24bf
Author: Chris Bate <[email protected]>

    [compiler] Add "default memory space" to ClusterKindAttrInterface

    Adds a new method to the ClusterKindAttrInterface so that backends can
    control the default tensor encoding (#plan.memory_space<..>) assigned by
    the `plan.assign-memory-spaces` pass at a function-scope level. In
    addition, we also allow an attribute to override the default space
    at function argument/results. This override mechnanism was previously
    lacking and will help resolve a long-standing issue where users cannot
    control the memory space of arguments/results reliably.

commit 0ea59238f5c280ab3ffbc340bb9aee7ed7bfbebb
Author: Christopher Bate <[email protected]>

    [compiler] Fix some issues related to pipeline extension mechanism

    The StablehloToExecutableTensorRTExtension had both 'disable' and
    an inherited 'disabled' member variable. Delete the inherited one
    such it should not have been introduced and was not bound to any
    option. Further, remove unused 'extensions' vector from
    CompilationTaskOptionsBase.

commit 372476d77fcaa399460965ab7bfc052f0e44c99f
Author: Christopher Bate <[email protected]>

    [executor] Fix ptrtoint and inttoptr op translation to Lua

    Previously, we could generate conflicting function types (due to pointer
    address space) when converting `executor.ptrtoint` and `executor.inttoptr`
    ops to opaque calls. Instead, defer the conversion to function call until
    the actual Lua translation point. At that point we can generate a
    function name without having to consider the pointer address space.

commit 75d18534fa67b452dd2253d6981bda6954bf1056
Author: Chris Bate <[email protected]>

    Introduce 'MLIRTensorRTCommmon' sub-project

    Certain targets need to be used across multiple sub-projects. For example,
    the 'TensorRTDynamicLoader' target is used in all sub-projects. In addition,
    the sub-projects need to be independently buildable. This change introduces
    another sub-project under the 'common' directory where shared code can be
    placed. This allows us to use `find_package` to declare the dependency, and
    downstream consumers to meet the requirement using any number of
    techniques to fullfill the 'find_package' call.

commit d7d8104087cf272bdd08f6330f27734754f0d71d
Author: Chris Bate <[email protected]>

    [compiler] Harden `stablehlo.constant` to `arith.constant` conversion

    There is a utility pass that runs in the stablehlo-to-executable pipeline
    that converts `stablehlo.constant` to `arith.constant`. This pass
    can temporarily create invalid IR due to `arith.constant` not supporting
    signful integer types. If the "verify-each" option is off,
    then the issue will not be caught since it happens to be self-correcting.
    However, the issue can still cause verification failures while debugging.
    This change fixes the issue by adding a `builtin.unrealized_conversion_cast`
    operation to bridge the type change between signless-and-signfull integer
    types.

commit a500de82a7bd70d6bfe32234719b4daa7cf32a8a
Author: Chris Bate <[email protected]>

    Integrate LLVM at f137c3d592e96330e450a8fd63ef7e8877fc1908

commit cd56aa6a511e2091fcd86106f20d27ff3673db75
Author: Christopher Bate <[email protected]>

    Fix build with BUILD_SHARED_LIBS=ON

    The new InferTensorValueRangeInterface was used without correctly
    specifying the library dependency the PlanIR and StablehloExtIR
    libraries.

commit cf1aff0ad0997947ab87485cfeec4595cb0285d7
Author: Sagar Shelke <[email protected]>

    [compiler] Maintain output order in TensorRT engine.

    For TensorRT engine conversion, first step in lowering a
    cluster containing TensorRT ops is created inline group op.
    Operands to the yield op (i.e. terminator) of inline group op
    are values from the cluster that are used outside the cluster.
    These values are collected by getting uses of each op (with
    `op->getUses()`) and checking if they are outside the cluster.
    However, this use order is not deterministic and sometimes
    it is desired to get yield results in a certian order.

    This MR makes the following changes,
    1. Add a function callback option named `ReorderRegionOpYieldValues`
    to `mlir::createRegionOpFromCluster` method. This callback function
    has signature `std::function<void(SetVector<Value> &yieldValues,
    SmallVectorImpl<Type> &yieldTypes)>` which takes cluster values used
    outside the cluster (in SetVector) and their types. By default this is
    set to nullptr.
    2. TensorRTToExecutable task is used in cases where a single `func.func`
    represents a single TensorRT engine. In this case, `ReorderRegionOpYieldValues`
    callback is implemented to make sure inline group op yield value order
    is same as func.func return values order.

    Valid MLIR test is added.

GitOrigin-RevId: 630a69d8e14506db43cfefe4be2c790f9352da4f
shelkesagar29 pushed a commit to NVIDIA/TensorRT-Incubator that referenced this pull request Jul 1, 2025
This PR moves the following internal changes to OSS,

commit 2986cac97888b5a9c1cd9064e4728cb38ca9dc45
Author: Sagar Shelke <[email protected]>

    [executor] Add complex type support to `ScalarValue`

    Previously, ScalarValue which represents scalar runtime value did not
    support complex type. This MR adds support for complex type by making storage
    union of real and complex data instaed of just real.

    MLIR tests are added via constant subgraph execution.

commit cf83a0d318b8035695d0b9fd24d578733632e253
Author: Christopher Bate <[email protected]>

    [compiler] Enable more `stablehlo.dot_general` to TensorRT using `tensorrt.einsum`

    Previously, we relied on canonicalization of `stablehlo.dot_general` to
    put all such contraction operations into a form that could be converted to
    `tensorrt.matrix_multiply`. Based on recent experiments, this can actually
    produce very inefficient TensorRT programs due to the number of reshapes
    and transpositions that must be inserted to coerce general
    `stablehlo.dot_general` into batched matrix multiplications. This change
    enables conversion of `stablehlo.dot_general` to `tensorrt.einsum`, and
    the pass and patterns now contain configurable parameters to control whether
    `tensorrt.einsum` is used as the primary method or only for fallback when
    conversion to `tensorrt.matrix_multiply` is not possible.

    A follow on change will revamp the Stablehlo preprocessing that we
    perform on 'stablehlo.dot_general' to avoid creating inefficient patterns
    and enable wider use of this pattern.

commit 528651ed1cd36c36376180c1c2232526ce972fef
Author: Christopher Bate <[email protected]>

    [compiler] Fix stablehlo-to-scf scalarization heuristics

    Fixes an issue where float tensors in the 'before' region of converted
    while loops where scalarized. The transform should only scalarize operands
    which are likely to be for-style induction variables.

commit 1d52e0a9e30dc104178c4761c1a24153abc7ea90
Author: Christopher Bate <[email protected]>

    [compiler] NFC: Drop dead code from StablehloToExecutableTask

commit f1c8d8c7cd860aedfe339d76ef7fb953baf9bd55
Author: Chris Bate <[email protected]>

    [compiler] Add `plan-promote-host-tensors-to-host-pinned` pass

    Adds a simple pass to promote "host" tensors to "host-pinned" tensors
    in common cases where we know a tensor will be transferred between host
    and device spaces. This pass runs after `plan-optimize-memory-spaces`
    since the former is sensitive to mismatching host spaces for patterns
    related to moving tranfers out of loops.

commit c27d56ea7a9661395e17fa895c610a79a92fa0c2
Author: Sagar Shelke <[email protected]>

    [executor] Handle elided dense resource elements attr during translation

    Translation to executable (which is flatbuffer) uses MLIR attr serialization to serialize
    `ElementsAttr`. However, this doesn't work when attr is elided dense resource and results
    in segfault. This MR handles this situation by replacing elided resource with `DenseElementsAttr`
    of all `one`s (`true` in case of boolean).

    IR with elided resource is usally seen only during testing of passes and not useful for
    e2e functional execution. Testing of `ExecuteConstantFoldableSubgraphs` pass is such case.
    Thus,  MLIR test cases for this pass are added.

commit 920a84e648833764563d3dc1de544a8f1b9f027e
Author: Chris Bate <[email protected]>

    [tensorrt] Fix TRT layer name generation function

    The TRT layer naming had some faulty logic that could cause the layer
    name to grow very large in the process to create a unique name. Fix
    the issue and use a static counter to reduce time spent in the loop.

commit ff0c5fa4bf5321ad0ce18579598c49f4b552fb37
Author: Christopher Bate <[email protected]>

    Further fixes to LIT configs

    Previously, we were setting `lit_config.parallelism_group` instead of
    `config.parallelism_group`. Apparently, the previous method does nothing,
    only `config.parallelism_group` has any effect.

commit d65c220b712c262992dbdf5a87fa3220a06bfb21
Author: Chris Bate <[email protected]>

    Update LIG test parallelism configs

    In more recent versions of TensorRT (10.11+ at least), the builder is taking a
    much larger amount of host memory. This can cause OOM when running the LIT test
    suites under their existing configurations.

    This change updates all LIT configs:

    - Make sure to use `%pick-one-gpu` in the LIT command line to ensure we
      stall if there are not enough GPU or host resources available. Add
      a hard limit that there must be at least 5GB of host memory available.

    - Update configurations to reduce the amount of estimated parallelism by
      increasing host memory requirements and reducing the amount of host
      memory to 50% for the purposes of the parallelism calculation.

    - Force all tests to use a common parallelism group unless otherwise
      specified in the test config.

commit 1f996f607640d81bf7137a4ed874b20c2a16cca2
Author: Christopher Bate <[email protected]>

    [compiler] Fix failure case in stablehlo-to-scf

    Fixes a failure case due to one of the recently introduced rewrites in
    `stablehlo-to-scf`.

commit 2779b632465fc3e840f5ce987f6233e824fe2ed3
Author: Christopher Bate <[email protected]>

    [compiler] Further improvements to plan bufferization pipeline

    - Split `plan-assign-memory-spaces` into three passes:
      - `plan-assign-memory-spaces`
      - `plan-optimize-memory-spaces`
      - `plan-materialize-explicit-transfers`
    - The last one is the only new code: `plan-materialize-explicit-transfers`
      converts `tensor.cast` ops that change the memory space encoding into
      explicit `bufferization.alloc_tensor` +
      `bufferization.materialize_in_destination` operations.
    - Improve handling of `bufferization.alloc_tensor` and optimization of
      `scf.for` iteration args in `plan-assign-memory-spaces`.
    - Improve handling of `tensor.reshape` in `plan-assign-memory-spaces`.
    - Fix handling of `tensor.reshape` when rewriting functions to be in
      DPS style in `plan-alloc-tensors`.

    This change also updates the LLVM dependencies in order to cherry-pick
    fix to the `tensor.reshape` bufferization interface that I merged
    upstream (llvm/llvm-project#128590).

    In addition, fix APInt assertions in `plan-execute-constant-foldable-subgraphs`.

commit 312170d8cbcd4c1fcf9cefdd628583e3dbdcc4f5
Author: Chris Bate <[email protected]>

    [compiler] Enable While-to-For conversion in Stablehlo-to-Scf pass

    Stablehlo only has one type of loop construct, `stablehlo.while`. The
    `while` loop can represent "for"-style loops as well, but if we only
    have `scf.while` loops after conversion to SCF, then we miss out on
    lot of potential optimizations which are rooted on `scf.for`.
    Experiments show that complicated JAX programs like the
    PhysicalIntelligence Pi0 model can benefit from converting `scf.while`
    to `scf.for` where possible. This improves opportunities for
    constant folding and makes analysis much easier to gauge the benefit
    of transforms like unrolling.

    This change adds some patterns to the Stablehlo-to-Scf pass to enable
    While-to-For conversion after the Stablehlo-to-Scf conversion. This
    transformation is combined with the Stablehlo-to-Scf conversion because
    the While-to-For patterns require first scalarizing block arguments of
    the While operation. The heuristics for which block arguments should
    be scalarized are implemented as control callbacks for the scalarization
    patterns. These callbacks need Stablehlo-specific logic, so it makes sense
    to test the combined conversion as a single pass. From the pass users'
    perspective, it gives the appearence of going directly from `stablehlo.while`
    to `scf.for`.

    The test cases are updated to cover the new patterns.

commit 425d19e749104354b5ea9e76e7509d029f9eac59
Author: Chris Bate <[email protected]>

    [compiler] Fix assign-memory-spaces pass to respect function-level constraints

    Fixes an issue where the `plan.memory_space` attribute on a function was
    not being respected when converting function signatures.

    MR: initialdl/mlir-tensorrt!2146

commit b612d5a22e7e3c4f08bf80fd504df5193b370bd3
Author: Chris Bate <[email protected]>

    [compiler] Update scf.while detensorization to increase flexibility

    In order to incorporate the upstream "uplift scf.while to scf.for"
    transformation as part of the `stablehlo-to-scf` conversion, we need to
    detensorize the operands of `scf.while` that are likely to correspond to
    the loop induction variable. This change refactors our existing 'scf.while'
    detensorization transformation to give more flexibility and control. The
    TensorKindAnalysis is no longer required in order to use the pattern(s).
    Detensorization of `after` and `before` arguments of `scf.while` are now
    controlled separately.

commit 3e21bf465b90e1eaaad872da40c305b70253cce0
Author: Chris Bate <[email protected]>

    [compiler] Improve handling of memory space constraints in the Plan dialect

    This commit improves the handling of memory space constraints in the
    Plan dialect. Constraints are now specified using a common attribute
    'plan.memory_space' that can be applied to functions or individual
    arguments/results. In addition, patterns in `plan-alloc-tensors`
    and `plan-assign-memory-spaces` are updated to avoid introducing
    unnecessary transfers between memory spaces.

commit 36a3b4a77242685e473817cb692a4010f690c0b3
Author: Chris Bate <[email protected]>

    [compiler] Add plan-buffer-results-to-out-params pass

    This change adds a new Plan dialect pass `plan-buffer-results-to-out-params`.
    This pass is based on the upstream Bufferization pass
    `buffer-results-to-out-params`, but it can handle a wider number of
    cases (such as promoting dynamic allocations) and uses alias analysis
    utilities to guard against failure cases that the upstream pass currently
    cannot handle. These improvements should eventually be upstreamed back
    to the Bufferization dialect.

commit 9e7127ca1e61be72b032a54d270a3da0d75639b2
Author: Chris Bate <[email protected]>

    [compiler] Update func conversion in host-to-emitc

    In the EmitC conversion/translation process, you can use `func.func`
    or `emitc.func` to define functions. Previously, we converted all
    `func.func` to `emitc.func`. However, `emitc.func` does not have a
    path for supporting multiple return values. Therefore, prefer use
    of type conversions on `func.func` instead of converting the entire
    op to `emitc.func`. Add tests to verify that we can support multiple
    return values.

commit 934db1f78ef3e7bedb67f1252b41ded7419010f8
Author: Chris Bate <[email protected]>

    [compiler] Fix two host-to-emitc bugs

    This change fixes two bugs exposed by new 'host-to-emitc' conversion
    testing:

    - The `!emitc.size_t` type does not have DataLayout information specified
      upstream. Therefore, to ensure that the type can be queried using
      DataLayout, we add a DataLayoutTypeInterface external model to the type.
      All queries are simply mapped to queries to the `index` type.

    - The upstream `func.call` conversion has a bug where it does not
      correctly convert the result types of the call operation, which
      can lead to a type mismatch for any type that does not have an
      identity conversion.

    Additional tests are added to `host-to-emitc`. Eventually the fixes for
    both these issues should be moved upstream.

commit 9d27f08ee4429f4ffbb72023babc193c7724a700
Author: Chris Bate <[email protected]>

    [common] Add Linalg-to-loops (on tensors) implementation and conversion pass

    Adds a ToLoopsOpInterface implementation and for Linalg operations. In addition,
    a conversion pass is added that converts ToLoopOpInterface operations to loops.

commit 3a419f120808eafc31f45516977ed6169b809ab9
Author: Chris Bate <[email protected]>

    NFC: Move ToLoopsOpInterface to 'mlir-tensorrt-common'

    Moves the ToLoopsOpInterface to the 'mlir-tensorrt-common' project. This
    is in preperation for enabling the ToLoopsOpInterface on LinalgOp (lowering
    while still using Tensor types) to replace the `convert-stablehlo-arith-to-scalar`
    pipeline.

    MR: initialdl/mlir-tensorrt!2137

commit 442bea12b763dd36fce864695f63896912438d87
Author: Christopher Bate <[email protected]>

    NFC: Fix formatting across several files

commit b2a65bc3e806aaa95d932af512cfa4750a9cbe4e
Author: Chris Bate <[email protected]>

    [executor] Introduce RuntimeSession "features" to control loading of runtime modules

    Previously, the RuntimeSession would always load all available runtime
    modules. This causes some inefficiences. For example, in certain integration
    tests for the Executor runtime, we don't use CUDA at all. However, because
    CUDA is still initialized by default, we would still require a GPU to
    be present just to run the integration test. Furthermore, some experimental
    modules (e.g. Lua cublas module) are not ready for "production" use and
    are only really invoked inside special integration tests.

    This change inroduces a notion of "features" to the RuntimeSession and
    RuntimeSessionOptions. A feature is just a string that identifies a
    particular runtime component. The particular semantic of a "feature" depends
    on the the actual runtime implementation. For example, for the
    LuaRuntimeSession, the feature names correspond to the available Lua
    "modules" (a module is just a group of C++ Lua extension functions),
    e.g. "core", "cuda", "tensorrt", etc.

    The RuntimeSessionOptions gains methods for enabling/disabling features.
    Certain features cause others to be added to the set automatically, e.g.
    "tensorrt" and "nccl" both require "cuda" to be added.

    The API is piped through all the way to the Python bindings to allow
    control of loaded modules at all levels. To preserve existing behavior,
    RuntimeSessions created from Python will load all available modules by
    default, but the `executor-runner|mlir-tensorrt-runner` tools now require
    features to be explicitly specified.

commit b90f8f345b2941e958f3a1cc5bcac21daebe783b
Author: Christopher Bate <[email protected]>

    NFC: Fix include guard for 'mlir-executor/Support/Status.h'

commit cdbe1f560483047291a30115a043a60bdce34d99
Author: Sagar Shelke <[email protected]>

    [compiler/lib] Add stablehlo composite to call pass to pre-processing pipeline

    This MR adds `StablehloLegalizeCompositeToCallPass` to the pre-processing
    pipeline.

    MLIR test is added.

commit 6ea3ab77aa2909cee11d08aa24543f247e8a24bf
Author: Chris Bate <[email protected]>

    [compiler] Add "default memory space" to ClusterKindAttrInterface

    Adds a new method to the ClusterKindAttrInterface so that backends can
    control the default tensor encoding (#plan.memory_space<..>) assigned by
    the `plan.assign-memory-spaces` pass at a function-scope level. In
    addition, we also allow an attribute to override the default space
    at function argument/results. This override mechnanism was previously
    lacking and will help resolve a long-standing issue where users cannot
    control the memory space of arguments/results reliably.

commit 0ea59238f5c280ab3ffbc340bb9aee7ed7bfbebb
Author: Christopher Bate <[email protected]>

    [compiler] Fix some issues related to pipeline extension mechanism

    The StablehloToExecutableTensorRTExtension had both 'disable' and
    an inherited 'disabled' member variable. Delete the inherited one
    such it should not have been introduced and was not bound to any
    option. Further, remove unused 'extensions' vector from
    CompilationTaskOptionsBase.

commit 372476d77fcaa399460965ab7bfc052f0e44c99f
Author: Christopher Bate <[email protected]>

    [executor] Fix ptrtoint and inttoptr op translation to Lua

    Previously, we could generate conflicting function types (due to pointer
    address space) when converting `executor.ptrtoint` and `executor.inttoptr`
    ops to opaque calls. Instead, defer the conversion to function call until
    the actual Lua translation point. At that point we can generate a
    function name without having to consider the pointer address space.

commit 75d18534fa67b452dd2253d6981bda6954bf1056
Author: Chris Bate <[email protected]>

    Introduce 'MLIRTensorRTCommmon' sub-project

    Certain targets need to be used across multiple sub-projects. For example,
    the 'TensorRTDynamicLoader' target is used in all sub-projects. In addition,
    the sub-projects need to be independently buildable. This change introduces
    another sub-project under the 'common' directory where shared code can be
    placed. This allows us to use `find_package` to declare the dependency, and
    downstream consumers to meet the requirement using any number of
    techniques to fullfill the 'find_package' call.

commit d7d8104087cf272bdd08f6330f27734754f0d71d
Author: Chris Bate <[email protected]>

    [compiler] Harden `stablehlo.constant` to `arith.constant` conversion

    There is a utility pass that runs in the stablehlo-to-executable pipeline
    that converts `stablehlo.constant` to `arith.constant`. This pass
    can temporarily create invalid IR due to `arith.constant` not supporting
    signful integer types. If the "verify-each" option is off,
    then the issue will not be caught since it happens to be self-correcting.
    However, the issue can still cause verification failures while debugging.
    This change fixes the issue by adding a `builtin.unrealized_conversion_cast`
    operation to bridge the type change between signless-and-signfull integer
    types.

commit a500de82a7bd70d6bfe32234719b4daa7cf32a8a
Author: Chris Bate <[email protected]>

    Integrate LLVM at f137c3d592e96330e450a8fd63ef7e8877fc1908

commit cd56aa6a511e2091fcd86106f20d27ff3673db75
Author: Christopher Bate <[email protected]>

    Fix build with BUILD_SHARED_LIBS=ON

    The new InferTensorValueRangeInterface was used without correctly
    specifying the library dependency the PlanIR and StablehloExtIR
    libraries.

commit cf1aff0ad0997947ab87485cfeec4595cb0285d7
Author: Sagar Shelke <[email protected]>

    [compiler] Maintain output order in TensorRT engine.

    For TensorRT engine conversion, first step in lowering a
    cluster containing TensorRT ops is created inline group op.
    Operands to the yield op (i.e. terminator) of inline group op
    are values from the cluster that are used outside the cluster.
    These values are collected by getting uses of each op (with
    `op->getUses()`) and checking if they are outside the cluster.
    However, this use order is not deterministic and sometimes
    it is desired to get yield results in a certian order.

    This MR makes the following changes,
    1. Add a function callback option named `ReorderRegionOpYieldValues`
    to `mlir::createRegionOpFromCluster` method. This callback function
    has signature `std::function<void(SetVector<Value> &yieldValues,
    SmallVectorImpl<Type> &yieldTypes)>` which takes cluster values used
    outside the cluster (in SetVector) and their types. By default this is
    set to nullptr.
    2. TensorRTToExecutable task is used in cases where a single `func.func`
    represents a single TensorRT engine. In this case, `ReorderRegionOpYieldValues`
    callback is implemented to make sure inline group op yield value order
    is same as func.func return values order.

    Valid MLIR test is added.

GitOrigin-RevId: 630a69d8e14506db43cfefe4be2c790f9352da4f
shelkesagar29 pushed a commit to NVIDIA/TensorRT-Incubator that referenced this pull request Jul 1, 2025
This PR moves the following internal changes to OSS,

commit 2986cac97888b5a9c1cd9064e4728cb38ca9dc45
Author: Sagar Shelke <[email protected]>

    [executor] Add complex type support to `ScalarValue`

    Previously, ScalarValue which represents scalar runtime value did not
    support complex type. This MR adds support for complex type by making storage
    union of real and complex data instaed of just real.

    MLIR tests are added via constant subgraph execution.

commit cf83a0d318b8035695d0b9fd24d578733632e253
Author: Christopher Bate <[email protected]>

    [compiler] Enable more `stablehlo.dot_general` to TensorRT using `tensorrt.einsum`

    Previously, we relied on canonicalization of `stablehlo.dot_general` to
    put all such contraction operations into a form that could be converted to
    `tensorrt.matrix_multiply`. Based on recent experiments, this can actually
    produce very inefficient TensorRT programs due to the number of reshapes
    and transpositions that must be inserted to coerce general
    `stablehlo.dot_general` into batched matrix multiplications. This change
    enables conversion of `stablehlo.dot_general` to `tensorrt.einsum`, and
    the pass and patterns now contain configurable parameters to control whether
    `tensorrt.einsum` is used as the primary method or only for fallback when
    conversion to `tensorrt.matrix_multiply` is not possible.

    A follow on change will revamp the Stablehlo preprocessing that we
    perform on 'stablehlo.dot_general' to avoid creating inefficient patterns
    and enable wider use of this pattern.

commit 528651ed1cd36c36376180c1c2232526ce972fef
Author: Christopher Bate <[email protected]>

    [compiler] Fix stablehlo-to-scf scalarization heuristics

    Fixes an issue where float tensors in the 'before' region of converted
    while loops where scalarized. The transform should only scalarize operands
    which are likely to be for-style induction variables.

commit 1d52e0a9e30dc104178c4761c1a24153abc7ea90
Author: Christopher Bate <[email protected]>

    [compiler] NFC: Drop dead code from StablehloToExecutableTask

commit f1c8d8c7cd860aedfe339d76ef7fb953baf9bd55
Author: Chris Bate <[email protected]>

    [compiler] Add `plan-promote-host-tensors-to-host-pinned` pass

    Adds a simple pass to promote "host" tensors to "host-pinned" tensors
    in common cases where we know a tensor will be transferred between host
    and device spaces. This pass runs after `plan-optimize-memory-spaces`
    since the former is sensitive to mismatching host spaces for patterns
    related to moving tranfers out of loops.

commit c27d56ea7a9661395e17fa895c610a79a92fa0c2
Author: Sagar Shelke <[email protected]>

    [executor] Handle elided dense resource elements attr during translation

    Translation to executable (which is flatbuffer) uses MLIR attr serialization to serialize
    `ElementsAttr`. However, this doesn't work when attr is elided dense resource and results
    in segfault. This MR handles this situation by replacing elided resource with `DenseElementsAttr`
    of all `one`s (`true` in case of boolean).

    IR with elided resource is usally seen only during testing of passes and not useful for
    e2e functional execution. Testing of `ExecuteConstantFoldableSubgraphs` pass is such case.
    Thus,  MLIR test cases for this pass are added.

commit 920a84e648833764563d3dc1de544a8f1b9f027e
Author: Chris Bate <[email protected]>

    [tensorrt] Fix TRT layer name generation function

    The TRT layer naming had some faulty logic that could cause the layer
    name to grow very large in the process to create a unique name. Fix
    the issue and use a static counter to reduce time spent in the loop.

commit ff0c5fa4bf5321ad0ce18579598c49f4b552fb37
Author: Christopher Bate <[email protected]>

    Further fixes to LIT configs

    Previously, we were setting `lit_config.parallelism_group` instead of
    `config.parallelism_group`. Apparently, the previous method does nothing,
    only `config.parallelism_group` has any effect.

commit d65c220b712c262992dbdf5a87fa3220a06bfb21
Author: Chris Bate <[email protected]>

    Update LIG test parallelism configs

    In more recent versions of TensorRT (10.11+ at least), the builder is taking a
    much larger amount of host memory. This can cause OOM when running the LIT test
    suites under their existing configurations.

    This change updates all LIT configs:

    - Make sure to use `%pick-one-gpu` in the LIT command line to ensure we
      stall if there are not enough GPU or host resources available. Add
      a hard limit that there must be at least 5GB of host memory available.

    - Update configurations to reduce the amount of estimated parallelism by
      increasing host memory requirements and reducing the amount of host
      memory to 50% for the purposes of the parallelism calculation.

    - Force all tests to use a common parallelism group unless otherwise
      specified in the test config.

commit 1f996f607640d81bf7137a4ed874b20c2a16cca2
Author: Christopher Bate <[email protected]>

    [compiler] Fix failure case in stablehlo-to-scf

    Fixes a failure case due to one of the recently introduced rewrites in
    `stablehlo-to-scf`.

commit 2779b632465fc3e840f5ce987f6233e824fe2ed3
Author: Christopher Bate <[email protected]>

    [compiler] Further improvements to plan bufferization pipeline

    - Split `plan-assign-memory-spaces` into three passes:
      - `plan-assign-memory-spaces`
      - `plan-optimize-memory-spaces`
      - `plan-materialize-explicit-transfers`
    - The last one is the only new code: `plan-materialize-explicit-transfers`
      converts `tensor.cast` ops that change the memory space encoding into
      explicit `bufferization.alloc_tensor` +
      `bufferization.materialize_in_destination` operations.
    - Improve handling of `bufferization.alloc_tensor` and optimization of
      `scf.for` iteration args in `plan-assign-memory-spaces`.
    - Improve handling of `tensor.reshape` in `plan-assign-memory-spaces`.
    - Fix handling of `tensor.reshape` when rewriting functions to be in
      DPS style in `plan-alloc-tensors`.

    This change also updates the LLVM dependencies in order to cherry-pick
    fix to the `tensor.reshape` bufferization interface that I merged
    upstream (llvm/llvm-project#128590).

    In addition, fix APInt assertions in `plan-execute-constant-foldable-subgraphs`.

commit 312170d8cbcd4c1fcf9cefdd628583e3dbdcc4f5
Author: Chris Bate <[email protected]>

    [compiler] Enable While-to-For conversion in Stablehlo-to-Scf pass

    Stablehlo only has one type of loop construct, `stablehlo.while`. The
    `while` loop can represent "for"-style loops as well, but if we only
    have `scf.while` loops after conversion to SCF, then we miss out on
    lot of potential optimizations which are rooted on `scf.for`.
    Experiments show that complicated JAX programs like the
    PhysicalIntelligence Pi0 model can benefit from converting `scf.while`
    to `scf.for` where possible. This improves opportunities for
    constant folding and makes analysis much easier to gauge the benefit
    of transforms like unrolling.

    This change adds some patterns to the Stablehlo-to-Scf pass to enable
    While-to-For conversion after the Stablehlo-to-Scf conversion. This
    transformation is combined with the Stablehlo-to-Scf conversion because
    the While-to-For patterns require first scalarizing block arguments of
    the While operation. The heuristics for which block arguments should
    be scalarized are implemented as control callbacks for the scalarization
    patterns. These callbacks need Stablehlo-specific logic, so it makes sense
    to test the combined conversion as a single pass. From the pass users'
    perspective, it gives the appearence of going directly from `stablehlo.while`
    to `scf.for`.

    The test cases are updated to cover the new patterns.

commit 425d19e749104354b5ea9e76e7509d029f9eac59
Author: Chris Bate <[email protected]>

    [compiler] Fix assign-memory-spaces pass to respect function-level constraints

    Fixes an issue where the `plan.memory_space` attribute on a function was
    not being respected when converting function signatures.

    MR: initialdl/mlir-tensorrt!2146

commit b612d5a22e7e3c4f08bf80fd504df5193b370bd3
Author: Chris Bate <[email protected]>

    [compiler] Update scf.while detensorization to increase flexibility

    In order to incorporate the upstream "uplift scf.while to scf.for"
    transformation as part of the `stablehlo-to-scf` conversion, we need to
    detensorize the operands of `scf.while` that are likely to correspond to
    the loop induction variable. This change refactors our existing 'scf.while'
    detensorization transformation to give more flexibility and control. The
    TensorKindAnalysis is no longer required in order to use the pattern(s).
    Detensorization of `after` and `before` arguments of `scf.while` are now
    controlled separately.

commit 3e21bf465b90e1eaaad872da40c305b70253cce0
Author: Chris Bate <[email protected]>

    [compiler] Improve handling of memory space constraints in the Plan dialect

    This commit improves the handling of memory space constraints in the
    Plan dialect. Constraints are now specified using a common attribute
    'plan.memory_space' that can be applied to functions or individual
    arguments/results. In addition, patterns in `plan-alloc-tensors`
    and `plan-assign-memory-spaces` are updated to avoid introducing
    unnecessary transfers between memory spaces.

commit 36a3b4a77242685e473817cb692a4010f690c0b3
Author: Chris Bate <[email protected]>

    [compiler] Add plan-buffer-results-to-out-params pass

    This change adds a new Plan dialect pass `plan-buffer-results-to-out-params`.
    This pass is based on the upstream Bufferization pass
    `buffer-results-to-out-params`, but it can handle a wider number of
    cases (such as promoting dynamic allocations) and uses alias analysis
    utilities to guard against failure cases that the upstream pass currently
    cannot handle. These improvements should eventually be upstreamed back
    to the Bufferization dialect.

commit 9e7127ca1e61be72b032a54d270a3da0d75639b2
Author: Chris Bate <[email protected]>

    [compiler] Update func conversion in host-to-emitc

    In the EmitC conversion/translation process, you can use `func.func`
    or `emitc.func` to define functions. Previously, we converted all
    `func.func` to `emitc.func`. However, `emitc.func` does not have a
    path for supporting multiple return values. Therefore, prefer use
    of type conversions on `func.func` instead of converting the entire
    op to `emitc.func`. Add tests to verify that we can support multiple
    return values.

commit 934db1f78ef3e7bedb67f1252b41ded7419010f8
Author: Chris Bate <[email protected]>

    [compiler] Fix two host-to-emitc bugs

    This change fixes two bugs exposed by new 'host-to-emitc' conversion
    testing:

    - The `!emitc.size_t` type does not have DataLayout information specified
      upstream. Therefore, to ensure that the type can be queried using
      DataLayout, we add a DataLayoutTypeInterface external model to the type.
      All queries are simply mapped to queries to the `index` type.

    - The upstream `func.call` conversion has a bug where it does not
      correctly convert the result types of the call operation, which
      can lead to a type mismatch for any type that does not have an
      identity conversion.

    Additional tests are added to `host-to-emitc`. Eventually the fixes for
    both these issues should be moved upstream.

commit 9d27f08ee4429f4ffbb72023babc193c7724a700
Author: Chris Bate <[email protected]>

    [common] Add Linalg-to-loops (on tensors) implementation and conversion pass

    Adds a ToLoopsOpInterface implementation and for Linalg operations. In addition,
    a conversion pass is added that converts ToLoopOpInterface operations to loops.

commit 3a419f120808eafc31f45516977ed6169b809ab9
Author: Chris Bate <[email protected]>

    NFC: Move ToLoopsOpInterface to 'mlir-tensorrt-common'

    Moves the ToLoopsOpInterface to the 'mlir-tensorrt-common' project. This
    is in preperation for enabling the ToLoopsOpInterface on LinalgOp (lowering
    while still using Tensor types) to replace the `convert-stablehlo-arith-to-scalar`
    pipeline.

    MR: initialdl/mlir-tensorrt!2137

commit 442bea12b763dd36fce864695f63896912438d87
Author: Christopher Bate <[email protected]>

    NFC: Fix formatting across several files

commit b2a65bc3e806aaa95d932af512cfa4750a9cbe4e
Author: Chris Bate <[email protected]>

    [executor] Introduce RuntimeSession "features" to control loading of runtime modules

    Previously, the RuntimeSession would always load all available runtime
    modules. This causes some inefficiences. For example, in certain integration
    tests for the Executor runtime, we don't use CUDA at all. However, because
    CUDA is still initialized by default, we would still require a GPU to
    be present just to run the integration test. Furthermore, some experimental
    modules (e.g. Lua cublas module) are not ready for "production" use and
    are only really invoked inside special integration tests.

    This change inroduces a notion of "features" to the RuntimeSession and
    RuntimeSessionOptions. A feature is just a string that identifies a
    particular runtime component. The particular semantic of a "feature" depends
    on the the actual runtime implementation. For example, for the
    LuaRuntimeSession, the feature names correspond to the available Lua
    "modules" (a module is just a group of C++ Lua extension functions),
    e.g. "core", "cuda", "tensorrt", etc.

    The RuntimeSessionOptions gains methods for enabling/disabling features.
    Certain features cause others to be added to the set automatically, e.g.
    "tensorrt" and "nccl" both require "cuda" to be added.

    The API is piped through all the way to the Python bindings to allow
    control of loaded modules at all levels. To preserve existing behavior,
    RuntimeSessions created from Python will load all available modules by
    default, but the `executor-runner|mlir-tensorrt-runner` tools now require
    features to be explicitly specified.

commit b90f8f345b2941e958f3a1cc5bcac21daebe783b
Author: Christopher Bate <[email protected]>

    NFC: Fix include guard for 'mlir-executor/Support/Status.h'

commit cdbe1f560483047291a30115a043a60bdce34d99
Author: Sagar Shelke <[email protected]>

    [compiler/lib] Add stablehlo composite to call pass to pre-processing pipeline

    This MR adds `StablehloLegalizeCompositeToCallPass` to the pre-processing
    pipeline.

    MLIR test is added.

commit 6ea3ab77aa2909cee11d08aa24543f247e8a24bf
Author: Chris Bate <[email protected]>

    [compiler] Add "default memory space" to ClusterKindAttrInterface

    Adds a new method to the ClusterKindAttrInterface so that backends can
    control the default tensor encoding (#plan.memory_space<..>) assigned by
    the `plan.assign-memory-spaces` pass at a function-scope level. In
    addition, we also allow an attribute to override the default space
    at function argument/results. This override mechnanism was previously
    lacking and will help resolve a long-standing issue where users cannot
    control the memory space of arguments/results reliably.

commit 0ea59238f5c280ab3ffbc340bb9aee7ed7bfbebb
Author: Christopher Bate <[email protected]>

    [compiler] Fix some issues related to pipeline extension mechanism

    The StablehloToExecutableTensorRTExtension had both 'disable' and
    an inherited 'disabled' member variable. Delete the inherited one
    such it should not have been introduced and was not bound to any
    option. Further, remove unused 'extensions' vector from
    CompilationTaskOptionsBase.

commit 372476d77fcaa399460965ab7bfc052f0e44c99f
Author: Christopher Bate <[email protected]>

    [executor] Fix ptrtoint and inttoptr op translation to Lua

    Previously, we could generate conflicting function types (due to pointer
    address space) when converting `executor.ptrtoint` and `executor.inttoptr`
    ops to opaque calls. Instead, defer the conversion to function call until
    the actual Lua translation point. At that point we can generate a
    function name without having to consider the pointer address space.

commit 75d18534fa67b452dd2253d6981bda6954bf1056
Author: Chris Bate <[email protected]>

    Introduce 'MLIRTensorRTCommmon' sub-project

    Certain targets need to be used across multiple sub-projects. For example,
    the 'TensorRTDynamicLoader' target is used in all sub-projects. In addition,
    the sub-projects need to be independently buildable. This change introduces
    another sub-project under the 'common' directory where shared code can be
    placed. This allows us to use `find_package` to declare the dependency, and
    downstream consumers to meet the requirement using any number of
    techniques to fullfill the 'find_package' call.

commit d7d8104087cf272bdd08f6330f27734754f0d71d
Author: Chris Bate <[email protected]>

    [compiler] Harden `stablehlo.constant` to `arith.constant` conversion

    There is a utility pass that runs in the stablehlo-to-executable pipeline
    that converts `stablehlo.constant` to `arith.constant`. This pass
    can temporarily create invalid IR due to `arith.constant` not supporting
    signful integer types. If the "verify-each" option is off,
    then the issue will not be caught since it happens to be self-correcting.
    However, the issue can still cause verification failures while debugging.
    This change fixes the issue by adding a `builtin.unrealized_conversion_cast`
    operation to bridge the type change between signless-and-signfull integer
    types.

commit a500de82a7bd70d6bfe32234719b4daa7cf32a8a
Author: Chris Bate <[email protected]>

    Integrate LLVM at f137c3d592e96330e450a8fd63ef7e8877fc1908

commit cd56aa6a511e2091fcd86106f20d27ff3673db75
Author: Christopher Bate <[email protected]>

    Fix build with BUILD_SHARED_LIBS=ON

    The new InferTensorValueRangeInterface was used without correctly
    specifying the library dependency the PlanIR and StablehloExtIR
    libraries.

commit cf1aff0ad0997947ab87485cfeec4595cb0285d7
Author: Sagar Shelke <[email protected]>

    [compiler] Maintain output order in TensorRT engine.

    For TensorRT engine conversion, first step in lowering a
    cluster containing TensorRT ops is created inline group op.
    Operands to the yield op (i.e. terminator) of inline group op
    are values from the cluster that are used outside the cluster.
    These values are collected by getting uses of each op (with
    `op->getUses()`) and checking if they are outside the cluster.
    However, this use order is not deterministic and sometimes
    it is desired to get yield results in a certian order.

    This MR makes the following changes,
    1. Add a function callback option named `ReorderRegionOpYieldValues`
    to `mlir::createRegionOpFromCluster` method. This callback function
    has signature `std::function<void(SetVector<Value> &yieldValues,
    SmallVectorImpl<Type> &yieldTypes)>` which takes cluster values used
    outside the cluster (in SetVector) and their types. By default this is
    set to nullptr.
    2. TensorRTToExecutable task is used in cases where a single `func.func`
    represents a single TensorRT engine. In this case, `ReorderRegionOpYieldValues`
    callback is implemented to make sure inline group op yield value order
    is same as func.func return values order.

    Valid MLIR test is added.

GitOrigin-RevId: 630a69d8e14506db43cfefe4be2c790f9352da4f
shelkesagar29 pushed a commit to NVIDIA/TensorRT-Incubator that referenced this pull request Jul 1, 2025
This PR moves the following internal changes to OSS,

commit 2986cac97888b5a9c1cd9064e4728cb38ca9dc45
Author: Sagar Shelke <[email protected]>

    [executor] Add complex type support to `ScalarValue`

    Previously, ScalarValue which represents scalar runtime value did not
    support complex type. This MR adds support for complex type by making storage
    union of real and complex data instaed of just real.

    MLIR tests are added via constant subgraph execution.

commit cf83a0d318b8035695d0b9fd24d578733632e253
Author: Christopher Bate <[email protected]>

    [compiler] Enable more `stablehlo.dot_general` to TensorRT using `tensorrt.einsum`

    Previously, we relied on canonicalization of `stablehlo.dot_general` to
    put all such contraction operations into a form that could be converted to
    `tensorrt.matrix_multiply`. Based on recent experiments, this can actually
    produce very inefficient TensorRT programs due to the number of reshapes
    and transpositions that must be inserted to coerce general
    `stablehlo.dot_general` into batched matrix multiplications. This change
    enables conversion of `stablehlo.dot_general` to `tensorrt.einsum`, and
    the pass and patterns now contain configurable parameters to control whether
    `tensorrt.einsum` is used as the primary method or only for fallback when
    conversion to `tensorrt.matrix_multiply` is not possible.

    A follow on change will revamp the Stablehlo preprocessing that we
    perform on 'stablehlo.dot_general' to avoid creating inefficient patterns
    and enable wider use of this pattern.

commit 528651ed1cd36c36376180c1c2232526ce972fef
Author: Christopher Bate <[email protected]>

    [compiler] Fix stablehlo-to-scf scalarization heuristics

    Fixes an issue where float tensors in the 'before' region of converted
    while loops where scalarized. The transform should only scalarize operands
    which are likely to be for-style induction variables.

commit 1d52e0a9e30dc104178c4761c1a24153abc7ea90
Author: Christopher Bate <[email protected]>

    [compiler] NFC: Drop dead code from StablehloToExecutableTask

commit f1c8d8c7cd860aedfe339d76ef7fb953baf9bd55
Author: Chris Bate <[email protected]>

    [compiler] Add `plan-promote-host-tensors-to-host-pinned` pass

    Adds a simple pass to promote "host" tensors to "host-pinned" tensors
    in common cases where we know a tensor will be transferred between host
    and device spaces. This pass runs after `plan-optimize-memory-spaces`
    since the former is sensitive to mismatching host spaces for patterns
    related to moving tranfers out of loops.

commit c27d56ea7a9661395e17fa895c610a79a92fa0c2
Author: Sagar Shelke <[email protected]>

    [executor] Handle elided dense resource elements attr during translation

    Translation to executable (which is flatbuffer) uses MLIR attr serialization to serialize
    `ElementsAttr`. However, this doesn't work when attr is elided dense resource and results
    in segfault. This MR handles this situation by replacing elided resource with `DenseElementsAttr`
    of all `one`s (`true` in case of boolean).

    IR with elided resource is usally seen only during testing of passes and not useful for
    e2e functional execution. Testing of `ExecuteConstantFoldableSubgraphs` pass is such case.
    Thus,  MLIR test cases for this pass are added.

commit 920a84e648833764563d3dc1de544a8f1b9f027e
Author: Chris Bate <[email protected]>

    [tensorrt] Fix TRT layer name generation function

    The TRT layer naming had some faulty logic that could cause the layer
    name to grow very large in the process to create a unique name. Fix
    the issue and use a static counter to reduce time spent in the loop.

commit ff0c5fa4bf5321ad0ce18579598c49f4b552fb37
Author: Christopher Bate <[email protected]>

    Further fixes to LIT configs

    Previously, we were setting `lit_config.parallelism_group` instead of
    `config.parallelism_group`. Apparently, the previous method does nothing,
    only `config.parallelism_group` has any effect.

commit d65c220b712c262992dbdf5a87fa3220a06bfb21
Author: Chris Bate <[email protected]>

    Update LIG test parallelism configs

    In more recent versions of TensorRT (10.11+ at least), the builder is taking a
    much larger amount of host memory. This can cause OOM when running the LIT test
    suites under their existing configurations.

    This change updates all LIT configs:

    - Make sure to use `%pick-one-gpu` in the LIT command line to ensure we
      stall if there are not enough GPU or host resources available. Add
      a hard limit that there must be at least 5GB of host memory available.

    - Update configurations to reduce the amount of estimated parallelism by
      increasing host memory requirements and reducing the amount of host
      memory to 50% for the purposes of the parallelism calculation.

    - Force all tests to use a common parallelism group unless otherwise
      specified in the test config.

commit 1f996f607640d81bf7137a4ed874b20c2a16cca2
Author: Christopher Bate <[email protected]>

    [compiler] Fix failure case in stablehlo-to-scf

    Fixes a failure case due to one of the recently introduced rewrites in
    `stablehlo-to-scf`.

commit 2779b632465fc3e840f5ce987f6233e824fe2ed3
Author: Christopher Bate <[email protected]>

    [compiler] Further improvements to plan bufferization pipeline

    - Split `plan-assign-memory-spaces` into three passes:
      - `plan-assign-memory-spaces`
      - `plan-optimize-memory-spaces`
      - `plan-materialize-explicit-transfers`
    - The last one is the only new code: `plan-materialize-explicit-transfers`
      converts `tensor.cast` ops that change the memory space encoding into
      explicit `bufferization.alloc_tensor` +
      `bufferization.materialize_in_destination` operations.
    - Improve handling of `bufferization.alloc_tensor` and optimization of
      `scf.for` iteration args in `plan-assign-memory-spaces`.
    - Improve handling of `tensor.reshape` in `plan-assign-memory-spaces`.
    - Fix handling of `tensor.reshape` when rewriting functions to be in
      DPS style in `plan-alloc-tensors`.

    This change also updates the LLVM dependencies in order to cherry-pick
    fix to the `tensor.reshape` bufferization interface that I merged
    upstream (llvm/llvm-project#128590).

    In addition, fix APInt assertions in `plan-execute-constant-foldable-subgraphs`.

commit 312170d8cbcd4c1fcf9cefdd628583e3dbdcc4f5
Author: Chris Bate <[email protected]>

    [compiler] Enable While-to-For conversion in Stablehlo-to-Scf pass

    Stablehlo only has one type of loop construct, `stablehlo.while`. The
    `while` loop can represent "for"-style loops as well, but if we only
    have `scf.while` loops after conversion to SCF, then we miss out on
    lot of potential optimizations which are rooted on `scf.for`.
    Experiments show that complicated JAX programs like the
    PhysicalIntelligence Pi0 model can benefit from converting `scf.while`
    to `scf.for` where possible. This improves opportunities for
    constant folding and makes analysis much easier to gauge the benefit
    of transforms like unrolling.

    This change adds some patterns to the Stablehlo-to-Scf pass to enable
    While-to-For conversion after the Stablehlo-to-Scf conversion. This
    transformation is combined with the Stablehlo-to-Scf conversion because
    the While-to-For patterns require first scalarizing block arguments of
    the While operation. The heuristics for which block arguments should
    be scalarized are implemented as control callbacks for the scalarization
    patterns. These callbacks need Stablehlo-specific logic, so it makes sense
    to test the combined conversion as a single pass. From the pass users'
    perspective, it gives the appearence of going directly from `stablehlo.while`
    to `scf.for`.

    The test cases are updated to cover the new patterns.

commit 425d19e749104354b5ea9e76e7509d029f9eac59
Author: Chris Bate <[email protected]>

    [compiler] Fix assign-memory-spaces pass to respect function-level constraints

    Fixes an issue where the `plan.memory_space` attribute on a function was
    not being respected when converting function signatures.

    MR: initialdl/mlir-tensorrt!2146

commit b612d5a22e7e3c4f08bf80fd504df5193b370bd3
Author: Chris Bate <[email protected]>

    [compiler] Update scf.while detensorization to increase flexibility

    In order to incorporate the upstream "uplift scf.while to scf.for"
    transformation as part of the `stablehlo-to-scf` conversion, we need to
    detensorize the operands of `scf.while` that are likely to correspond to
    the loop induction variable. This change refactors our existing 'scf.while'
    detensorization transformation to give more flexibility and control. The
    TensorKindAnalysis is no longer required in order to use the pattern(s).
    Detensorization of `after` and `before` arguments of `scf.while` are now
    controlled separately.

commit 3e21bf465b90e1eaaad872da40c305b70253cce0
Author: Chris Bate <[email protected]>

    [compiler] Improve handling of memory space constraints in the Plan dialect

    This commit improves the handling of memory space constraints in the
    Plan dialect. Constraints are now specified using a common attribute
    'plan.memory_space' that can be applied to functions or individual
    arguments/results. In addition, patterns in `plan-alloc-tensors`
    and `plan-assign-memory-spaces` are updated to avoid introducing
    unnecessary transfers between memory spaces.

commit 36a3b4a77242685e473817cb692a4010f690c0b3
Author: Chris Bate <[email protected]>

    [compiler] Add plan-buffer-results-to-out-params pass

    This change adds a new Plan dialect pass `plan-buffer-results-to-out-params`.
    This pass is based on the upstream Bufferization pass
    `buffer-results-to-out-params`, but it can handle a wider number of
    cases (such as promoting dynamic allocations) and uses alias analysis
    utilities to guard against failure cases that the upstream pass currently
    cannot handle. These improvements should eventually be upstreamed back
    to the Bufferization dialect.

commit 9e7127ca1e61be72b032a54d270a3da0d75639b2
Author: Chris Bate <[email protected]>

    [compiler] Update func conversion in host-to-emitc

    In the EmitC conversion/translation process, you can use `func.func`
    or `emitc.func` to define functions. Previously, we converted all
    `func.func` to `emitc.func`. However, `emitc.func` does not have a
    path for supporting multiple return values. Therefore, prefer use
    of type conversions on `func.func` instead of converting the entire
    op to `emitc.func`. Add tests to verify that we can support multiple
    return values.

commit 934db1f78ef3e7bedb67f1252b41ded7419010f8
Author: Chris Bate <[email protected]>

    [compiler] Fix two host-to-emitc bugs

    This change fixes two bugs exposed by new 'host-to-emitc' conversion
    testing:

    - The `!emitc.size_t` type does not have DataLayout information specified
      upstream. Therefore, to ensure that the type can be queried using
      DataLayout, we add a DataLayoutTypeInterface external model to the type.
      All queries are simply mapped to queries to the `index` type.

    - The upstream `func.call` conversion has a bug where it does not
      correctly convert the result types of the call operation, which
      can lead to a type mismatch for any type that does not have an
      identity conversion.

    Additional tests are added to `host-to-emitc`. Eventually the fixes for
    both these issues should be moved upstream.

commit 9d27f08ee4429f4ffbb72023babc193c7724a700
Author: Chris Bate <[email protected]>

    [common] Add Linalg-to-loops (on tensors) implementation and conversion pass

    Adds a ToLoopsOpInterface implementation and for Linalg operations. In addition,
    a conversion pass is added that converts ToLoopOpInterface operations to loops.

commit 3a419f120808eafc31f45516977ed6169b809ab9
Author: Chris Bate <[email protected]>

    NFC: Move ToLoopsOpInterface to 'mlir-tensorrt-common'

    Moves the ToLoopsOpInterface to the 'mlir-tensorrt-common' project. This
    is in preperation for enabling the ToLoopsOpInterface on LinalgOp (lowering
    while still using Tensor types) to replace the `convert-stablehlo-arith-to-scalar`
    pipeline.

    MR: initialdl/mlir-tensorrt!2137

commit 442bea12b763dd36fce864695f63896912438d87
Author: Christopher Bate <[email protected]>

    NFC: Fix formatting across several files

commit b2a65bc3e806aaa95d932af512cfa4750a9cbe4e
Author: Chris Bate <[email protected]>

    [executor] Introduce RuntimeSession "features" to control loading of runtime modules

    Previously, the RuntimeSession would always load all available runtime
    modules. This causes some inefficiences. For example, in certain integration
    tests for the Executor runtime, we don't use CUDA at all. However, because
    CUDA is still initialized by default, we would still require a GPU to
    be present just to run the integration test. Furthermore, some experimental
    modules (e.g. Lua cublas module) are not ready for "production" use and
    are only really invoked inside special integration tests.

    This change inroduces a notion of "features" to the RuntimeSession and
    RuntimeSessionOptions. A feature is just a string that identifies a
    particular runtime component. The particular semantic of a "feature" depends
    on the the actual runtime implementation. For example, for the
    LuaRuntimeSession, the feature names correspond to the available Lua
    "modules" (a module is just a group of C++ Lua extension functions),
    e.g. "core", "cuda", "tensorrt", etc.

    The RuntimeSessionOptions gains methods for enabling/disabling features.
    Certain features cause others to be added to the set automatically, e.g.
    "tensorrt" and "nccl" both require "cuda" to be added.

    The API is piped through all the way to the Python bindings to allow
    control of loaded modules at all levels. To preserve existing behavior,
    RuntimeSessions created from Python will load all available modules by
    default, but the `executor-runner|mlir-tensorrt-runner` tools now require
    features to be explicitly specified.

commit b90f8f345b2941e958f3a1cc5bcac21daebe783b
Author: Christopher Bate <[email protected]>

    NFC: Fix include guard for 'mlir-executor/Support/Status.h'

commit cdbe1f560483047291a30115a043a60bdce34d99
Author: Sagar Shelke <[email protected]>

    [compiler/lib] Add stablehlo composite to call pass to pre-processing pipeline

    This MR adds `StablehloLegalizeCompositeToCallPass` to the pre-processing
    pipeline.

    MLIR test is added.

commit 6ea3ab77aa2909cee11d08aa24543f247e8a24bf
Author: Chris Bate <[email protected]>

    [compiler] Add "default memory space" to ClusterKindAttrInterface

    Adds a new method to the ClusterKindAttrInterface so that backends can
    control the default tensor encoding (#plan.memory_space<..>) assigned by
    the `plan.assign-memory-spaces` pass at a function-scope level. In
    addition, we also allow an attribute to override the default space
    at function argument/results. This override mechnanism was previously
    lacking and will help resolve a long-standing issue where users cannot
    control the memory space of arguments/results reliably.

commit 0ea59238f5c280ab3ffbc340bb9aee7ed7bfbebb
Author: Christopher Bate <[email protected]>

    [compiler] Fix some issues related to pipeline extension mechanism

    The StablehloToExecutableTensorRTExtension had both 'disable' and
    an inherited 'disabled' member variable. Delete the inherited one
    such it should not have been introduced and was not bound to any
    option. Further, remove unused 'extensions' vector from
    CompilationTaskOptionsBase.

commit 372476d77fcaa399460965ab7bfc052f0e44c99f
Author: Christopher Bate <[email protected]>

    [executor] Fix ptrtoint and inttoptr op translation to Lua

    Previously, we could generate conflicting function types (due to pointer
    address space) when converting `executor.ptrtoint` and `executor.inttoptr`
    ops to opaque calls. Instead, defer the conversion to function call until
    the actual Lua translation point. At that point we can generate a
    function name without having to consider the pointer address space.

commit 75d18534fa67b452dd2253d6981bda6954bf1056
Author: Chris Bate <[email protected]>

    Introduce 'MLIRTensorRTCommmon' sub-project

    Certain targets need to be used across multiple sub-projects. For example,
    the 'TensorRTDynamicLoader' target is used in all sub-projects. In addition,
    the sub-projects need to be independently buildable. This change introduces
    another sub-project under the 'common' directory where shared code can be
    placed. This allows us to use `find_package` to declare the dependency, and
    downstream consumers to meet the requirement using any number of
    techniques to fullfill the 'find_package' call.

commit d7d8104087cf272bdd08f6330f27734754f0d71d
Author: Chris Bate <[email protected]>

    [compiler] Harden `stablehlo.constant` to `arith.constant` conversion

    There is a utility pass that runs in the stablehlo-to-executable pipeline
    that converts `stablehlo.constant` to `arith.constant`. This pass
    can temporarily create invalid IR due to `arith.constant` not supporting
    signful integer types. If the "verify-each" option is off,
    then the issue will not be caught since it happens to be self-correcting.
    However, the issue can still cause verification failures while debugging.
    This change fixes the issue by adding a `builtin.unrealized_conversion_cast`
    operation to bridge the type change between signless-and-signfull integer
    types.

commit a500de82a7bd70d6bfe32234719b4daa7cf32a8a
Author: Chris Bate <[email protected]>

    Integrate LLVM at f137c3d592e96330e450a8fd63ef7e8877fc1908

commit cd56aa6a511e2091fcd86106f20d27ff3673db75
Author: Christopher Bate <[email protected]>

    Fix build with BUILD_SHARED_LIBS=ON

    The new InferTensorValueRangeInterface was used without correctly
    specifying the library dependency the PlanIR and StablehloExtIR
    libraries.

commit cf1aff0ad0997947ab87485cfeec4595cb0285d7
Author: Sagar Shelke <[email protected]>

    [compiler] Maintain output order in TensorRT engine.

    For TensorRT engine conversion, first step in lowering a
    cluster containing TensorRT ops is created inline group op.
    Operands to the yield op (i.e. terminator) of inline group op
    are values from the cluster that are used outside the cluster.
    These values are collected by getting uses of each op (with
    `op->getUses()`) and checking if they are outside the cluster.
    However, this use order is not deterministic and sometimes
    it is desired to get yield results in a certian order.

    This MR makes the following changes,
    1. Add a function callback option named `ReorderRegionOpYieldValues`
    to `mlir::createRegionOpFromCluster` method. This callback function
    has signature `std::function<void(SetVector<Value> &yieldValues,
    SmallVectorImpl<Type> &yieldTypes)>` which takes cluster values used
    outside the cluster (in SetVector) and their types. By default this is
    set to nullptr.
    2. TensorRTToExecutable task is used in cases where a single `func.func`
    represents a single TensorRT engine. In this case, `ReorderRegionOpYieldValues`
    callback is implemented to make sure inline group op yield value order
    is same as func.func return values order.

    Valid MLIR test is added.

GitOrigin-RevId: 630a69d8e14506db43cfefe4be2c790f9352da4f
shelkesagar29 pushed a commit to NVIDIA/TensorRT-Incubator that referenced this pull request Jul 1, 2025
This PR moves the following internal changes to OSS,

commit 2986cac97888b5a9c1cd9064e4728cb38ca9dc45
Author: Sagar Shelke <[email protected]>

    [executor] Add complex type support to `ScalarValue`

    Previously, ScalarValue which represents scalar runtime value did not
    support complex type. This MR adds support for complex type by making storage
    union of real and complex data instaed of just real.

    MLIR tests are added via constant subgraph execution.

commit cf83a0d318b8035695d0b9fd24d578733632e253
Author: Christopher Bate <[email protected]>

    [compiler] Enable more `stablehlo.dot_general` to TensorRT using `tensorrt.einsum`

    Previously, we relied on canonicalization of `stablehlo.dot_general` to
    put all such contraction operations into a form that could be converted to
    `tensorrt.matrix_multiply`. Based on recent experiments, this can actually
    produce very inefficient TensorRT programs due to the number of reshapes
    and transpositions that must be inserted to coerce general
    `stablehlo.dot_general` into batched matrix multiplications. This change
    enables conversion of `stablehlo.dot_general` to `tensorrt.einsum`, and
    the pass and patterns now contain configurable parameters to control whether
    `tensorrt.einsum` is used as the primary method or only for fallback when
    conversion to `tensorrt.matrix_multiply` is not possible.

    A follow on change will revamp the Stablehlo preprocessing that we
    perform on 'stablehlo.dot_general' to avoid creating inefficient patterns
    and enable wider use of this pattern.

commit 528651ed1cd36c36376180c1c2232526ce972fef
Author: Christopher Bate <[email protected]>

    [compiler] Fix stablehlo-to-scf scalarization heuristics

    Fixes an issue where float tensors in the 'before' region of converted
    while loops where scalarized. The transform should only scalarize operands
    which are likely to be for-style induction variables.

commit 1d52e0a9e30dc104178c4761c1a24153abc7ea90
Author: Christopher Bate <[email protected]>

    [compiler] NFC: Drop dead code from StablehloToExecutableTask

commit f1c8d8c7cd860aedfe339d76ef7fb953baf9bd55
Author: Chris Bate <[email protected]>

    [compiler] Add `plan-promote-host-tensors-to-host-pinned` pass

    Adds a simple pass to promote "host" tensors to "host-pinned" tensors
    in common cases where we know a tensor will be transferred between host
    and device spaces. This pass runs after `plan-optimize-memory-spaces`
    since the former is sensitive to mismatching host spaces for patterns
    related to moving tranfers out of loops.

commit c27d56ea7a9661395e17fa895c610a79a92fa0c2
Author: Sagar Shelke <[email protected]>

    [executor] Handle elided dense resource elements attr during translation

    Translation to executable (which is flatbuffer) uses MLIR attr serialization to serialize
    `ElementsAttr`. However, this doesn't work when attr is elided dense resource and results
    in segfault. This MR handles this situation by replacing elided resource with `DenseElementsAttr`
    of all `one`s (`true` in case of boolean).

    IR with elided resource is usally seen only during testing of passes and not useful for
    e2e functional execution. Testing of `ExecuteConstantFoldableSubgraphs` pass is such case.
    Thus,  MLIR test cases for this pass are added.

commit 920a84e648833764563d3dc1de544a8f1b9f027e
Author: Chris Bate <[email protected]>

    [tensorrt] Fix TRT layer name generation function

    The TRT layer naming had some faulty logic that could cause the layer
    name to grow very large in the process to create a unique name. Fix
    the issue and use a static counter to reduce time spent in the loop.

commit ff0c5fa4bf5321ad0ce18579598c49f4b552fb37
Author: Christopher Bate <[email protected]>

    Further fixes to LIT configs

    Previously, we were setting `lit_config.parallelism_group` instead of
    `config.parallelism_group`. Apparently, the previous method does nothing,
    only `config.parallelism_group` has any effect.

commit d65c220b712c262992dbdf5a87fa3220a06bfb21
Author: Chris Bate <[email protected]>

    Update LIG test parallelism configs

    In more recent versions of TensorRT (10.11+ at least), the builder is taking a
    much larger amount of host memory. This can cause OOM when running the LIT test
    suites under their existing configurations.

    This change updates all LIT configs:

    - Make sure to use `%pick-one-gpu` in the LIT command line to ensure we
      stall if there are not enough GPU or host resources available. Add
      a hard limit that there must be at least 5GB of host memory available.

    - Update configurations to reduce the amount of estimated parallelism by
      increasing host memory requirements and reducing the amount of host
      memory to 50% for the purposes of the parallelism calculation.

    - Force all tests to use a common parallelism group unless otherwise
      specified in the test config.

commit 1f996f607640d81bf7137a4ed874b20c2a16cca2
Author: Christopher Bate <[email protected]>

    [compiler] Fix failure case in stablehlo-to-scf

    Fixes a failure case due to one of the recently introduced rewrites in
    `stablehlo-to-scf`.

commit 2779b632465fc3e840f5ce987f6233e824fe2ed3
Author: Christopher Bate <[email protected]>

    [compiler] Further improvements to plan bufferization pipeline

    - Split `plan-assign-memory-spaces` into three passes:
      - `plan-assign-memory-spaces`
      - `plan-optimize-memory-spaces`
      - `plan-materialize-explicit-transfers`
    - The last one is the only new code: `plan-materialize-explicit-transfers`
      converts `tensor.cast` ops that change the memory space encoding into
      explicit `bufferization.alloc_tensor` +
      `bufferization.materialize_in_destination` operations.
    - Improve handling of `bufferization.alloc_tensor` and optimization of
      `scf.for` iteration args in `plan-assign-memory-spaces`.
    - Improve handling of `tensor.reshape` in `plan-assign-memory-spaces`.
    - Fix handling of `tensor.reshape` when rewriting functions to be in
      DPS style in `plan-alloc-tensors`.

    This change also updates the LLVM dependencies in order to cherry-pick
    fix to the `tensor.reshape` bufferization interface that I merged
    upstream (llvm/llvm-project#128590).

    In addition, fix APInt assertions in `plan-execute-constant-foldable-subgraphs`.

commit 312170d8cbcd4c1fcf9cefdd628583e3dbdcc4f5
Author: Chris Bate <[email protected]>

    [compiler] Enable While-to-For conversion in Stablehlo-to-Scf pass

    Stablehlo only has one type of loop construct, `stablehlo.while`. The
    `while` loop can represent "for"-style loops as well, but if we only
    have `scf.while` loops after conversion to SCF, then we miss out on
    lot of potential optimizations which are rooted on `scf.for`.
    Experiments show that complicated JAX programs like the
    PhysicalIntelligence Pi0 model can benefit from converting `scf.while`
    to `scf.for` where possible. This improves opportunities for
    constant folding and makes analysis much easier to gauge the benefit
    of transforms like unrolling.

    This change adds some patterns to the Stablehlo-to-Scf pass to enable
    While-to-For conversion after the Stablehlo-to-Scf conversion. This
    transformation is combined with the Stablehlo-to-Scf conversion because
    the While-to-For patterns require first scalarizing block arguments of
    the While operation. The heuristics for which block arguments should
    be scalarized are implemented as control callbacks for the scalarization
    patterns. These callbacks need Stablehlo-specific logic, so it makes sense
    to test the combined conversion as a single pass. From the pass users'
    perspective, it gives the appearence of going directly from `stablehlo.while`
    to `scf.for`.

    The test cases are updated to cover the new patterns.

commit 425d19e749104354b5ea9e76e7509d029f9eac59
Author: Chris Bate <[email protected]>

    [compiler] Fix assign-memory-spaces pass to respect function-level constraints

    Fixes an issue where the `plan.memory_space` attribute on a function was
    not being respected when converting function signatures.

    MR: initialdl/mlir-tensorrt!2146

commit b612d5a22e7e3c4f08bf80fd504df5193b370bd3
Author: Chris Bate <[email protected]>

    [compiler] Update scf.while detensorization to increase flexibility

    In order to incorporate the upstream "uplift scf.while to scf.for"
    transformation as part of the `stablehlo-to-scf` conversion, we need to
    detensorize the operands of `scf.while` that are likely to correspond to
    the loop induction variable. This change refactors our existing 'scf.while'
    detensorization transformation to give more flexibility and control. The
    TensorKindAnalysis is no longer required in order to use the pattern(s).
    Detensorization of `after` and `before` arguments of `scf.while` are now
    controlled separately.

commit 3e21bf465b90e1eaaad872da40c305b70253cce0
Author: Chris Bate <[email protected]>

    [compiler] Improve handling of memory space constraints in the Plan dialect

    This commit improves the handling of memory space constraints in the
    Plan dialect. Constraints are now specified using a common attribute
    'plan.memory_space' that can be applied to functions or individual
    arguments/results. In addition, patterns in `plan-alloc-tensors`
    and `plan-assign-memory-spaces` are updated to avoid introducing
    unnecessary transfers between memory spaces.

commit 36a3b4a77242685e473817cb692a4010f690c0b3
Author: Chris Bate <[email protected]>

    [compiler] Add plan-buffer-results-to-out-params pass

    This change adds a new Plan dialect pass `plan-buffer-results-to-out-params`.
    This pass is based on the upstream Bufferization pass
    `buffer-results-to-out-params`, but it can handle a wider number of
    cases (such as promoting dynamic allocations) and uses alias analysis
    utilities to guard against failure cases that the upstream pass currently
    cannot handle. These improvements should eventually be upstreamed back
    to the Bufferization dialect.

commit 9e7127ca1e61be72b032a54d270a3da0d75639b2
Author: Chris Bate <[email protected]>

    [compiler] Update func conversion in host-to-emitc

    In the EmitC conversion/translation process, you can use `func.func`
    or `emitc.func` to define functions. Previously, we converted all
    `func.func` to `emitc.func`. However, `emitc.func` does not have a
    path for supporting multiple return values. Therefore, prefer use
    of type conversions on `func.func` instead of converting the entire
    op to `emitc.func`. Add tests to verify that we can support multiple
    return values.

commit 934db1f78ef3e7bedb67f1252b41ded7419010f8
Author: Chris Bate <[email protected]>

    [compiler] Fix two host-to-emitc bugs

    This change fixes two bugs exposed by new 'host-to-emitc' conversion
    testing:

    - The `!emitc.size_t` type does not have DataLayout information specified
      upstream. Therefore, to ensure that the type can be queried using
      DataLayout, we add a DataLayoutTypeInterface external model to the type.
      All queries are simply mapped to queries to the `index` type.

    - The upstream `func.call` conversion has a bug where it does not
      correctly convert the result types of the call operation, which
      can lead to a type mismatch for any type that does not have an
      identity conversion.

    Additional tests are added to `host-to-emitc`. Eventually the fixes for
    both these issues should be moved upstream.

commit 9d27f08ee4429f4ffbb72023babc193c7724a700
Author: Chris Bate <[email protected]>

    [common] Add Linalg-to-loops (on tensors) implementation and conversion pass

    Adds a ToLoopsOpInterface implementation and for Linalg operations. In addition,
    a conversion pass is added that converts ToLoopOpInterface operations to loops.

commit 3a419f120808eafc31f45516977ed6169b809ab9
Author: Chris Bate <[email protected]>

    NFC: Move ToLoopsOpInterface to 'mlir-tensorrt-common'

    Moves the ToLoopsOpInterface to the 'mlir-tensorrt-common' project. This
    is in preperation for enabling the ToLoopsOpInterface on LinalgOp (lowering
    while still using Tensor types) to replace the `convert-stablehlo-arith-to-scalar`
    pipeline.

    MR: initialdl/mlir-tensorrt!2137

commit 442bea12b763dd36fce864695f63896912438d87
Author: Christopher Bate <[email protected]>

    NFC: Fix formatting across several files

commit b2a65bc3e806aaa95d932af512cfa4750a9cbe4e
Author: Chris Bate <[email protected]>

    [executor] Introduce RuntimeSession "features" to control loading of runtime modules

    Previously, the RuntimeSession would always load all available runtime
    modules. This causes some inefficiences. For example, in certain integration
    tests for the Executor runtime, we don't use CUDA at all. However, because
    CUDA is still initialized by default, we would still require a GPU to
    be present just to run the integration test. Furthermore, some experimental
    modules (e.g. Lua cublas module) are not ready for "production" use and
    are only really invoked inside special integration tests.

    This change inroduces a notion of "features" to the RuntimeSession and
    RuntimeSessionOptions. A feature is just a string that identifies a
    particular runtime component. The particular semantic of a "feature" depends
    on the the actual runtime implementation. For example, for the
    LuaRuntimeSession, the feature names correspond to the available Lua
    "modules" (a module is just a group of C++ Lua extension functions),
    e.g. "core", "cuda", "tensorrt", etc.

    The RuntimeSessionOptions gains methods for enabling/disabling features.
    Certain features cause others to be added to the set automatically, e.g.
    "tensorrt" and "nccl" both require "cuda" to be added.

    The API is piped through all the way to the Python bindings to allow
    control of loaded modules at all levels. To preserve existing behavior,
    RuntimeSessions created from Python will load all available modules by
    default, but the `executor-runner|mlir-tensorrt-runner` tools now require
    features to be explicitly specified.

commit b90f8f345b2941e958f3a1cc5bcac21daebe783b
Author: Christopher Bate <[email protected]>

    NFC: Fix include guard for 'mlir-executor/Support/Status.h'

commit cdbe1f560483047291a30115a043a60bdce34d99
Author: Sagar Shelke <[email protected]>

    [compiler/lib] Add stablehlo composite to call pass to pre-processing pipeline

    This MR adds `StablehloLegalizeCompositeToCallPass` to the pre-processing
    pipeline.

    MLIR test is added.

commit 6ea3ab77aa2909cee11d08aa24543f247e8a24bf
Author: Chris Bate <[email protected]>

    [compiler] Add "default memory space" to ClusterKindAttrInterface

    Adds a new method to the ClusterKindAttrInterface so that backends can
    control the default tensor encoding (#plan.memory_space<..>) assigned by
    the `plan.assign-memory-spaces` pass at a function-scope level. In
    addition, we also allow an attribute to override the default space
    at function argument/results. This override mechnanism was previously
    lacking and will help resolve a long-standing issue where users cannot
    control the memory space of arguments/results reliably.

commit 0ea59238f5c280ab3ffbc340bb9aee7ed7bfbebb
Author: Christopher Bate <[email protected]>

    [compiler] Fix some issues related to pipeline extension mechanism

    The StablehloToExecutableTensorRTExtension had both 'disable' and
    an inherited 'disabled' member variable. Delete the inherited one
    such it should not have been introduced and was not bound to any
    option. Further, remove unused 'extensions' vector from
    CompilationTaskOptionsBase.

commit 372476d77fcaa399460965ab7bfc052f0e44c99f
Author: Christopher Bate <[email protected]>

    [executor] Fix ptrtoint and inttoptr op translation to Lua

    Previously, we could generate conflicting function types (due to pointer
    address space) when converting `executor.ptrtoint` and `executor.inttoptr`
    ops to opaque calls. Instead, defer the conversion to function call until
    the actual Lua translation point. At that point we can generate a
    function name without having to consider the pointer address space.

commit 75d18534fa67b452dd2253d6981bda6954bf1056
Author: Chris Bate <[email protected]>

    Introduce 'MLIRTensorRTCommmon' sub-project

    Certain targets need to be used across multiple sub-projects. For example,
    the 'TensorRTDynamicLoader' target is used in all sub-projects. In addition,
    the sub-projects need to be independently buildable. This change introduces
    another sub-project under the 'common' directory where shared code can be
    placed. This allows us to use `find_package` to declare the dependency, and
    downstream consumers to meet the requirement using any number of
    techniques to fullfill the 'find_package' call.

commit d7d8104087cf272bdd08f6330f27734754f0d71d
Author: Chris Bate <[email protected]>

    [compiler] Harden `stablehlo.constant` to `arith.constant` conversion

    There is a utility pass that runs in the stablehlo-to-executable pipeline
    that converts `stablehlo.constant` to `arith.constant`. This pass
    can temporarily create invalid IR due to `arith.constant` not supporting
    signful integer types. If the "verify-each" option is off,
    then the issue will not be caught since it happens to be self-correcting.
    However, the issue can still cause verification failures while debugging.
    This change fixes the issue by adding a `builtin.unrealized_conversion_cast`
    operation to bridge the type change between signless-and-signfull integer
    types.

commit a500de82a7bd70d6bfe32234719b4daa7cf32a8a
Author: Chris Bate <[email protected]>

    Integrate LLVM at f137c3d592e96330e450a8fd63ef7e8877fc1908

commit cd56aa6a511e2091fcd86106f20d27ff3673db75
Author: Christopher Bate <[email protected]>

    Fix build with BUILD_SHARED_LIBS=ON

    The new InferTensorValueRangeInterface was used without correctly
    specifying the library dependency the PlanIR and StablehloExtIR
    libraries.

commit cf1aff0ad0997947ab87485cfeec4595cb0285d7
Author: Sagar Shelke <[email protected]>

    [compiler] Maintain output order in TensorRT engine.

    For TensorRT engine conversion, first step in lowering a
    cluster containing TensorRT ops is created inline group op.
    Operands to the yield op (i.e. terminator) of inline group op
    are values from the cluster that are used outside the cluster.
    These values are collected by getting uses of each op (with
    `op->getUses()`) and checking if they are outside the cluster.
    However, this use order is not deterministic and sometimes
    it is desired to get yield results in a certian order.

    This MR makes the following changes,
    1. Add a function callback option named `ReorderRegionOpYieldValues`
    to `mlir::createRegionOpFromCluster` method. This callback function
    has signature `std::function<void(SetVector<Value> &yieldValues,
    SmallVectorImpl<Type> &yieldTypes)>` which takes cluster values used
    outside the cluster (in SetVector) and their types. By default this is
    set to nullptr.
    2. TensorRTToExecutable task is used in cases where a single `func.func`
    represents a single TensorRT engine. In this case, `ReorderRegionOpYieldValues`
    callback is implemented to make sure inline group op yield value order
    is same as func.func return values order.

    Valid MLIR test is added.

GitOrigin-RevId: 630a69d8e14506db43cfefe4be2c790f9352da4f
shelkesagar29 pushed a commit to NVIDIA/TensorRT-Incubator that referenced this pull request Jul 1, 2025
This PR moves the following internal changes to OSS,

commit 2986cac97888b5a9c1cd9064e4728cb38ca9dc45
Author: Sagar Shelke <[email protected]>

    [executor] Add complex type support to `ScalarValue`

    Previously, ScalarValue which represents scalar runtime value did not
    support complex type. This MR adds support for complex type by making storage
    union of real and complex data instaed of just real.

    MLIR tests are added via constant subgraph execution.

commit cf83a0d318b8035695d0b9fd24d578733632e253
Author: Christopher Bate <[email protected]>

    [compiler] Enable more `stablehlo.dot_general` to TensorRT using `tensorrt.einsum`

    Previously, we relied on canonicalization of `stablehlo.dot_general` to
    put all such contraction operations into a form that could be converted to
    `tensorrt.matrix_multiply`. Based on recent experiments, this can actually
    produce very inefficient TensorRT programs due to the number of reshapes
    and transpositions that must be inserted to coerce general
    `stablehlo.dot_general` into batched matrix multiplications. This change
    enables conversion of `stablehlo.dot_general` to `tensorrt.einsum`, and
    the pass and patterns now contain configurable parameters to control whether
    `tensorrt.einsum` is used as the primary method or only for fallback when
    conversion to `tensorrt.matrix_multiply` is not possible.

    A follow on change will revamp the Stablehlo preprocessing that we
    perform on 'stablehlo.dot_general' to avoid creating inefficient patterns
    and enable wider use of this pattern.

commit 528651ed1cd36c36376180c1c2232526ce972fef
Author: Christopher Bate <[email protected]>

    [compiler] Fix stablehlo-to-scf scalarization heuristics

    Fixes an issue where float tensors in the 'before' region of converted
    while loops where scalarized. The transform should only scalarize operands
    which are likely to be for-style induction variables.

commit 1d52e0a9e30dc104178c4761c1a24153abc7ea90
Author: Christopher Bate <[email protected]>

    [compiler] NFC: Drop dead code from StablehloToExecutableTask

commit f1c8d8c7cd860aedfe339d76ef7fb953baf9bd55
Author: Chris Bate <[email protected]>

    [compiler] Add `plan-promote-host-tensors-to-host-pinned` pass

    Adds a simple pass to promote "host" tensors to "host-pinned" tensors
    in common cases where we know a tensor will be transferred between host
    and device spaces. This pass runs after `plan-optimize-memory-spaces`
    since the former is sensitive to mismatching host spaces for patterns
    related to moving tranfers out of loops.

commit c27d56ea7a9661395e17fa895c610a79a92fa0c2
Author: Sagar Shelke <[email protected]>

    [executor] Handle elided dense resource elements attr during translation

    Translation to executable (which is flatbuffer) uses MLIR attr serialization to serialize
    `ElementsAttr`. However, this doesn't work when attr is elided dense resource and results
    in segfault. This MR handles this situation by replacing elided resource with `DenseElementsAttr`
    of all `one`s (`true` in case of boolean).

    IR with elided resource is usally seen only during testing of passes and not useful for
    e2e functional execution. Testing of `ExecuteConstantFoldableSubgraphs` pass is such case.
    Thus,  MLIR test cases for this pass are added.

commit 920a84e648833764563d3dc1de544a8f1b9f027e
Author: Chris Bate <[email protected]>

    [tensorrt] Fix TRT layer name generation function

    The TRT layer naming had some faulty logic that could cause the layer
    name to grow very large in the process to create a unique name. Fix
    the issue and use a static counter to reduce time spent in the loop.

commit ff0c5fa4bf5321ad0ce18579598c49f4b552fb37
Author: Christopher Bate <[email protected]>

    Further fixes to LIT configs

    Previously, we were setting `lit_config.parallelism_group` instead of
    `config.parallelism_group`. Apparently, the previous method does nothing,
    only `config.parallelism_group` has any effect.

commit d65c220b712c262992dbdf5a87fa3220a06bfb21
Author: Chris Bate <[email protected]>

    Update LIG test parallelism configs

    In more recent versions of TensorRT (10.11+ at least), the builder is taking a
    much larger amount of host memory. This can cause OOM when running the LIT test
    suites under their existing configurations.

    This change updates all LIT configs:

    - Make sure to use `%pick-one-gpu` in the LIT command line to ensure we
      stall if there are not enough GPU or host resources available. Add
      a hard limit that there must be at least 5GB of host memory available.

    - Update configurations to reduce the amount of estimated parallelism by
      increasing host memory requirements and reducing the amount of host
      memory to 50% for the purposes of the parallelism calculation.

    - Force all tests to use a common parallelism group unless otherwise
      specified in the test config.

commit 1f996f607640d81bf7137a4ed874b20c2a16cca2
Author: Christopher Bate <[email protected]>

    [compiler] Fix failure case in stablehlo-to-scf

    Fixes a failure case due to one of the recently introduced rewrites in
    `stablehlo-to-scf`.

commit 2779b632465fc3e840f5ce987f6233e824fe2ed3
Author: Christopher Bate <[email protected]>

    [compiler] Further improvements to plan bufferization pipeline

    - Split `plan-assign-memory-spaces` into three passes:
      - `plan-assign-memory-spaces`
      - `plan-optimize-memory-spaces`
      - `plan-materialize-explicit-transfers`
    - The last one is the only new code: `plan-materialize-explicit-transfers`
      converts `tensor.cast` ops that change the memory space encoding into
      explicit `bufferization.alloc_tensor` +
      `bufferization.materialize_in_destination` operations.
    - Improve handling of `bufferization.alloc_tensor` and optimization of
      `scf.for` iteration args in `plan-assign-memory-spaces`.
    - Improve handling of `tensor.reshape` in `plan-assign-memory-spaces`.
    - Fix handling of `tensor.reshape` when rewriting functions to be in
      DPS style in `plan-alloc-tensors`.

    This change also updates the LLVM dependencies in order to cherry-pick
    fix to the `tensor.reshape` bufferization interface that I merged
    upstream (llvm/llvm-project#128590).

    In addition, fix APInt assertions in `plan-execute-constant-foldable-subgraphs`.

commit 312170d8cbcd4c1fcf9cefdd628583e3dbdcc4f5
Author: Chris Bate <[email protected]>

    [compiler] Enable While-to-For conversion in Stablehlo-to-Scf pass

    Stablehlo only has one type of loop construct, `stablehlo.while`. The
    `while` loop can represent "for"-style loops as well, but if we only
    have `scf.while` loops after conversion to SCF, then we miss out on
    lot of potential optimizations which are rooted on `scf.for`.
    Experiments show that complicated JAX programs like the
    PhysicalIntelligence Pi0 model can benefit from converting `scf.while`
    to `scf.for` where possible. This improves opportunities for
    constant folding and makes analysis much easier to gauge the benefit
    of transforms like unrolling.

    This change adds some patterns to the Stablehlo-to-Scf pass to enable
    While-to-For conversion after the Stablehlo-to-Scf conversion. This
    transformation is combined with the Stablehlo-to-Scf conversion because
    the While-to-For patterns require first scalarizing block arguments of
    the While operation. The heuristics for which block arguments should
    be scalarized are implemented as control callbacks for the scalarization
    patterns. These callbacks need Stablehlo-specific logic, so it makes sense
    to test the combined conversion as a single pass. From the pass users'
    perspective, it gives the appearence of going directly from `stablehlo.while`
    to `scf.for`.

    The test cases are updated to cover the new patterns.

commit 425d19e749104354b5ea9e76e7509d029f9eac59
Author: Chris Bate <[email protected]>

    [compiler] Fix assign-memory-spaces pass to respect function-level constraints

    Fixes an issue where the `plan.memory_space` attribute on a function was
    not being respected when converting function signatures.

    MR: initialdl/mlir-tensorrt!2146

commit b612d5a22e7e3c4f08bf80fd504df5193b370bd3
Author: Chris Bate <[email protected]>

    [compiler] Update scf.while detensorization to increase flexibility

    In order to incorporate the upstream "uplift scf.while to scf.for"
    transformation as part of the `stablehlo-to-scf` conversion, we need to
    detensorize the operands of `scf.while` that are likely to correspond to
    the loop induction variable. This change refactors our existing 'scf.while'
    detensorization transformation to give more flexibility and control. The
    TensorKindAnalysis is no longer required in order to use the pattern(s).
    Detensorization of `after` and `before` arguments of `scf.while` are now
    controlled separately.

commit 3e21bf465b90e1eaaad872da40c305b70253cce0
Author: Chris Bate <[email protected]>

    [compiler] Improve handling of memory space constraints in the Plan dialect

    This commit improves the handling of memory space constraints in the
    Plan dialect. Constraints are now specified using a common attribute
    'plan.memory_space' that can be applied to functions or individual
    arguments/results. In addition, patterns in `plan-alloc-tensors`
    and `plan-assign-memory-spaces` are updated to avoid introducing
    unnecessary transfers between memory spaces.

commit 36a3b4a77242685e473817cb692a4010f690c0b3
Author: Chris Bate <[email protected]>

    [compiler] Add plan-buffer-results-to-out-params pass

    This change adds a new Plan dialect pass `plan-buffer-results-to-out-params`.
    This pass is based on the upstream Bufferization pass
    `buffer-results-to-out-params`, but it can handle a wider number of
    cases (such as promoting dynamic allocations) and uses alias analysis
    utilities to guard against failure cases that the upstream pass currently
    cannot handle. These improvements should eventually be upstreamed back
    to the Bufferization dialect.

commit 9e7127ca1e61be72b032a54d270a3da0d75639b2
Author: Chris Bate <[email protected]>

    [compiler] Update func conversion in host-to-emitc

    In the EmitC conversion/translation process, you can use `func.func`
    or `emitc.func` to define functions. Previously, we converted all
    `func.func` to `emitc.func`. However, `emitc.func` does not have a
    path for supporting multiple return values. Therefore, prefer use
    of type conversions on `func.func` instead of converting the entire
    op to `emitc.func`. Add tests to verify that we can support multiple
    return values.

commit 934db1f78ef3e7bedb67f1252b41ded7419010f8
Author: Chris Bate <[email protected]>

    [compiler] Fix two host-to-emitc bugs

    This change fixes two bugs exposed by new 'host-to-emitc' conversion
    testing:

    - The `!emitc.size_t` type does not have DataLayout information specified
      upstream. Therefore, to ensure that the type can be queried using
      DataLayout, we add a DataLayoutTypeInterface external model to the type.
      All queries are simply mapped to queries to the `index` type.

    - The upstream `func.call` conversion has a bug where it does not
      correctly convert the result types of the call operation, which
      can lead to a type mismatch for any type that does not have an
      identity conversion.

    Additional tests are added to `host-to-emitc`. Eventually the fixes for
    both these issues should be moved upstream.

commit 9d27f08ee4429f4ffbb72023babc193c7724a700
Author: Chris Bate <[email protected]>

    [common] Add Linalg-to-loops (on tensors) implementation and conversion pass

    Adds a ToLoopsOpInterface implementation and for Linalg operations. In addition,
    a conversion pass is added that converts ToLoopOpInterface operations to loops.

commit 3a419f120808eafc31f45516977ed6169b809ab9
Author: Chris Bate <[email protected]>

    NFC: Move ToLoopsOpInterface to 'mlir-tensorrt-common'

    Moves the ToLoopsOpInterface to the 'mlir-tensorrt-common' project. This
    is in preperation for enabling the ToLoopsOpInterface on LinalgOp (lowering
    while still using Tensor types) to replace the `convert-stablehlo-arith-to-scalar`
    pipeline.

    MR: initialdl/mlir-tensorrt!2137

commit 442bea12b763dd36fce864695f63896912438d87
Author: Christopher Bate <[email protected]>

    NFC: Fix formatting across several files

commit b2a65bc3e806aaa95d932af512cfa4750a9cbe4e
Author: Chris Bate <[email protected]>

    [executor] Introduce RuntimeSession "features" to control loading of runtime modules

    Previously, the RuntimeSession would always load all available runtime
    modules. This causes some inefficiences. For example, in certain integration
    tests for the Executor runtime, we don't use CUDA at all. However, because
    CUDA is still initialized by default, we would still require a GPU to
    be present just to run the integration test. Furthermore, some experimental
    modules (e.g. Lua cublas module) are not ready for "production" use and
    are only really invoked inside special integration tests.

    This change inroduces a notion of "features" to the RuntimeSession and
    RuntimeSessionOptions. A feature is just a string that identifies a
    particular runtime component. The particular semantic of a "feature" depends
    on the the actual runtime implementation. For example, for the
    LuaRuntimeSession, the feature names correspond to the available Lua
    "modules" (a module is just a group of C++ Lua extension functions),
    e.g. "core", "cuda", "tensorrt", etc.

    The RuntimeSessionOptions gains methods for enabling/disabling features.
    Certain features cause others to be added to the set automatically, e.g.
    "tensorrt" and "nccl" both require "cuda" to be added.

    The API is piped through all the way to the Python bindings to allow
    control of loaded modules at all levels. To preserve existing behavior,
    RuntimeSessions created from Python will load all available modules by
    default, but the `executor-runner|mlir-tensorrt-runner` tools now require
    features to be explicitly specified.

commit b90f8f345b2941e958f3a1cc5bcac21daebe783b
Author: Christopher Bate <[email protected]>

    NFC: Fix include guard for 'mlir-executor/Support/Status.h'

commit cdbe1f560483047291a30115a043a60bdce34d99
Author: Sagar Shelke <[email protected]>

    [compiler/lib] Add stablehlo composite to call pass to pre-processing pipeline

    This MR adds `StablehloLegalizeCompositeToCallPass` to the pre-processing
    pipeline.

    MLIR test is added.

commit 6ea3ab77aa2909cee11d08aa24543f247e8a24bf
Author: Chris Bate <[email protected]>

    [compiler] Add "default memory space" to ClusterKindAttrInterface

    Adds a new method to the ClusterKindAttrInterface so that backends can
    control the default tensor encoding (#plan.memory_space<..>) assigned by
    the `plan.assign-memory-spaces` pass at a function-scope level. In
    addition, we also allow an attribute to override the default space
    at function argument/results. This override mechnanism was previously
    lacking and will help resolve a long-standing issue where users cannot
    control the memory space of arguments/results reliably.

commit 0ea59238f5c280ab3ffbc340bb9aee7ed7bfbebb
Author: Christopher Bate <[email protected]>

    [compiler] Fix some issues related to pipeline extension mechanism

    The StablehloToExecutableTensorRTExtension had both 'disable' and
    an inherited 'disabled' member variable. Delete the inherited one
    such it should not have been introduced and was not bound to any
    option. Further, remove unused 'extensions' vector from
    CompilationTaskOptionsBase.

commit 372476d77fcaa399460965ab7bfc052f0e44c99f
Author: Christopher Bate <[email protected]>

    [executor] Fix ptrtoint and inttoptr op translation to Lua

    Previously, we could generate conflicting function types (due to pointer
    address space) when converting `executor.ptrtoint` and `executor.inttoptr`
    ops to opaque calls. Instead, defer the conversion to function call until
    the actual Lua translation point. At that point we can generate a
    function name without having to consider the pointer address space.

commit 75d18534fa67b452dd2253d6981bda6954bf1056
Author: Chris Bate <[email protected]>

    Introduce 'MLIRTensorRTCommmon' sub-project

    Certain targets need to be used across multiple sub-projects. For example,
    the 'TensorRTDynamicLoader' target is used in all sub-projects. In addition,
    the sub-projects need to be independently buildable. This change introduces
    another sub-project under the 'common' directory where shared code can be
    placed. This allows us to use `find_package` to declare the dependency, and
    downstream consumers to meet the requirement using any number of
    techniques to fullfill the 'find_package' call.

commit d7d8104087cf272bdd08f6330f27734754f0d71d
Author: Chris Bate <[email protected]>

    [compiler] Harden `stablehlo.constant` to `arith.constant` conversion

    There is a utility pass that runs in the stablehlo-to-executable pipeline
    that converts `stablehlo.constant` to `arith.constant`. This pass
    can temporarily create invalid IR due to `arith.constant` not supporting
    signful integer types. If the "verify-each" option is off,
    then the issue will not be caught since it happens to be self-correcting.
    However, the issue can still cause verification failures while debugging.
    This change fixes the issue by adding a `builtin.unrealized_conversion_cast`
    operation to bridge the type change between signless-and-signfull integer
    types.

commit a500de82a7bd70d6bfe32234719b4daa7cf32a8a
Author: Chris Bate <[email protected]>

    Integrate LLVM at f137c3d592e96330e450a8fd63ef7e8877fc1908

commit cd56aa6a511e2091fcd86106f20d27ff3673db75
Author: Christopher Bate <[email protected]>

    Fix build with BUILD_SHARED_LIBS=ON

    The new InferTensorValueRangeInterface was used without correctly
    specifying the library dependency the PlanIR and StablehloExtIR
    libraries.

commit cf1aff0ad0997947ab87485cfeec4595cb0285d7
Author: Sagar Shelke <[email protected]>

    [compiler] Maintain output order in TensorRT engine.

    For TensorRT engine conversion, first step in lowering a
    cluster containing TensorRT ops is created inline group op.
    Operands to the yield op (i.e. terminator) of inline group op
    are values from the cluster that are used outside the cluster.
    These values are collected by getting uses of each op (with
    `op->getUses()`) and checking if they are outside the cluster.
    However, this use order is not deterministic and sometimes
    it is desired to get yield results in a certian order.

    This MR makes the following changes,
    1. Add a function callback option named `ReorderRegionOpYieldValues`
    to `mlir::createRegionOpFromCluster` method. This callback function
    has signature `std::function<void(SetVector<Value> &yieldValues,
    SmallVectorImpl<Type> &yieldTypes)>` which takes cluster values used
    outside the cluster (in SetVector) and their types. By default this is
    set to nullptr.
    2. TensorRTToExecutable task is used in cases where a single `func.func`
    represents a single TensorRT engine. In this case, `ReorderRegionOpYieldValues`
    callback is implemented to make sure inline group op yield value order
    is same as func.func return values order.

    Valid MLIR test is added.

GitOrigin-RevId: 630a69d8e14506db43cfefe4be2c790f9352da4f
shelkesagar29 pushed a commit to NVIDIA/TensorRT-Incubator that referenced this pull request Jul 2, 2025
This PR moves the following internal changes to OSS,

commit 2986cac97888b5a9c1cd9064e4728cb38ca9dc45
Author: Sagar Shelke <[email protected]>

    [executor] Add complex type support to `ScalarValue`

    Previously, ScalarValue which represents scalar runtime value did not
    support complex type. This MR adds support for complex type by making storage
    union of real and complex data instaed of just real.

    MLIR tests are added via constant subgraph execution.

commit cf83a0d318b8035695d0b9fd24d578733632e253
Author: Christopher Bate <[email protected]>

    [compiler] Enable more `stablehlo.dot_general` to TensorRT using `tensorrt.einsum`

    Previously, we relied on canonicalization of `stablehlo.dot_general` to
    put all such contraction operations into a form that could be converted to
    `tensorrt.matrix_multiply`. Based on recent experiments, this can actually
    produce very inefficient TensorRT programs due to the number of reshapes
    and transpositions that must be inserted to coerce general
    `stablehlo.dot_general` into batched matrix multiplications. This change
    enables conversion of `stablehlo.dot_general` to `tensorrt.einsum`, and
    the pass and patterns now contain configurable parameters to control whether
    `tensorrt.einsum` is used as the primary method or only for fallback when
    conversion to `tensorrt.matrix_multiply` is not possible.

    A follow on change will revamp the Stablehlo preprocessing that we
    perform on 'stablehlo.dot_general' to avoid creating inefficient patterns
    and enable wider use of this pattern.

commit 528651ed1cd36c36376180c1c2232526ce972fef
Author: Christopher Bate <[email protected]>

    [compiler] Fix stablehlo-to-scf scalarization heuristics

    Fixes an issue where float tensors in the 'before' region of converted
    while loops where scalarized. The transform should only scalarize operands
    which are likely to be for-style induction variables.

commit 1d52e0a9e30dc104178c4761c1a24153abc7ea90
Author: Christopher Bate <[email protected]>

    [compiler] NFC: Drop dead code from StablehloToExecutableTask

commit f1c8d8c7cd860aedfe339d76ef7fb953baf9bd55
Author: Chris Bate <[email protected]>

    [compiler] Add `plan-promote-host-tensors-to-host-pinned` pass

    Adds a simple pass to promote "host" tensors to "host-pinned" tensors
    in common cases where we know a tensor will be transferred between host
    and device spaces. This pass runs after `plan-optimize-memory-spaces`
    since the former is sensitive to mismatching host spaces for patterns
    related to moving tranfers out of loops.

commit c27d56ea7a9661395e17fa895c610a79a92fa0c2
Author: Sagar Shelke <[email protected]>

    [executor] Handle elided dense resource elements attr during translation

    Translation to executable (which is flatbuffer) uses MLIR attr serialization to serialize
    `ElementsAttr`. However, this doesn't work when attr is elided dense resource and results
    in segfault. This MR handles this situation by replacing elided resource with `DenseElementsAttr`
    of all `one`s (`true` in case of boolean).

    IR with elided resource is usally seen only during testing of passes and not useful for
    e2e functional execution. Testing of `ExecuteConstantFoldableSubgraphs` pass is such case.
    Thus,  MLIR test cases for this pass are added.

commit 920a84e648833764563d3dc1de544a8f1b9f027e
Author: Chris Bate <[email protected]>

    [tensorrt] Fix TRT layer name generation function

    The TRT layer naming had some faulty logic that could cause the layer
    name to grow very large in the process to create a unique name. Fix
    the issue and use a static counter to reduce time spent in the loop.

commit ff0c5fa4bf5321ad0ce18579598c49f4b552fb37
Author: Christopher Bate <[email protected]>

    Further fixes to LIT configs

    Previously, we were setting `lit_config.parallelism_group` instead of
    `config.parallelism_group`. Apparently, the previous method does nothing,
    only `config.parallelism_group` has any effect.

commit d65c220b712c262992dbdf5a87fa3220a06bfb21
Author: Chris Bate <[email protected]>

    Update LIG test parallelism configs

    In more recent versions of TensorRT (10.11+ at least), the builder is taking a
    much larger amount of host memory. This can cause OOM when running the LIT test
    suites under their existing configurations.

    This change updates all LIT configs:

    - Make sure to use `%pick-one-gpu` in the LIT command line to ensure we
      stall if there are not enough GPU or host resources available. Add
      a hard limit that there must be at least 5GB of host memory available.

    - Update configurations to reduce the amount of estimated parallelism by
      increasing host memory requirements and reducing the amount of host
      memory to 50% for the purposes of the parallelism calculation.

    - Force all tests to use a common parallelism group unless otherwise
      specified in the test config.

commit 1f996f607640d81bf7137a4ed874b20c2a16cca2
Author: Christopher Bate <[email protected]>

    [compiler] Fix failure case in stablehlo-to-scf

    Fixes a failure case due to one of the recently introduced rewrites in
    `stablehlo-to-scf`.

commit 2779b632465fc3e840f5ce987f6233e824fe2ed3
Author: Christopher Bate <[email protected]>

    [compiler] Further improvements to plan bufferization pipeline

    - Split `plan-assign-memory-spaces` into three passes:
      - `plan-assign-memory-spaces`
      - `plan-optimize-memory-spaces`
      - `plan-materialize-explicit-transfers`
    - The last one is the only new code: `plan-materialize-explicit-transfers`
      converts `tensor.cast` ops that change the memory space encoding into
      explicit `bufferization.alloc_tensor` +
      `bufferization.materialize_in_destination` operations.
    - Improve handling of `bufferization.alloc_tensor` and optimization of
      `scf.for` iteration args in `plan-assign-memory-spaces`.
    - Improve handling of `tensor.reshape` in `plan-assign-memory-spaces`.
    - Fix handling of `tensor.reshape` when rewriting functions to be in
      DPS style in `plan-alloc-tensors`.

    This change also updates the LLVM dependencies in order to cherry-pick
    fix to the `tensor.reshape` bufferization interface that I merged
    upstream (llvm/llvm-project#128590).

    In addition, fix APInt assertions in `plan-execute-constant-foldable-subgraphs`.

commit 312170d8cbcd4c1fcf9cefdd628583e3dbdcc4f5
Author: Chris Bate <[email protected]>

    [compiler] Enable While-to-For conversion in Stablehlo-to-Scf pass

    Stablehlo only has one type of loop construct, `stablehlo.while`. The
    `while` loop can represent "for"-style loops as well, but if we only
    have `scf.while` loops after conversion to SCF, then we miss out on
    lot of potential optimizations which are rooted on `scf.for`.
    Experiments show that complicated JAX programs like the
    PhysicalIntelligence Pi0 model can benefit from converting `scf.while`
    to `scf.for` where possible. This improves opportunities for
    constant folding and makes analysis much easier to gauge the benefit
    of transforms like unrolling.

    This change adds some patterns to the Stablehlo-to-Scf pass to enable
    While-to-For conversion after the Stablehlo-to-Scf conversion. This
    transformation is combined with the Stablehlo-to-Scf conversion because
    the While-to-For patterns require first scalarizing block arguments of
    the While operation. The heuristics for which block arguments should
    be scalarized are implemented as control callbacks for the scalarization
    patterns. These callbacks need Stablehlo-specific logic, so it makes sense
    to test the combined conversion as a single pass. From the pass users'
    perspective, it gives the appearence of going directly from `stablehlo.while`
    to `scf.for`.

    The test cases are updated to cover the new patterns.

commit 425d19e749104354b5ea9e76e7509d029f9eac59
Author: Chris Bate <[email protected]>

    [compiler] Fix assign-memory-spaces pass to respect function-level constraints

    Fixes an issue where the `plan.memory_space` attribute on a function was
    not being respected when converting function signatures.

    MR: initialdl/mlir-tensorrt!2146

commit b612d5a22e7e3c4f08bf80fd504df5193b370bd3
Author: Chris Bate <[email protected]>

    [compiler] Update scf.while detensorization to increase flexibility

    In order to incorporate the upstream "uplift scf.while to scf.for"
    transformation as part of the `stablehlo-to-scf` conversion, we need to
    detensorize the operands of `scf.while` that are likely to correspond to
    the loop induction variable. This change refactors our existing 'scf.while'
    detensorization transformation to give more flexibility and control. The
    TensorKindAnalysis is no longer required in order to use the pattern(s).
    Detensorization of `after` and `before` arguments of `scf.while` are now
    controlled separately.

commit 3e21bf465b90e1eaaad872da40c305b70253cce0
Author: Chris Bate <[email protected]>

    [compiler] Improve handling of memory space constraints in the Plan dialect

    This commit improves the handling of memory space constraints in the
    Plan dialect. Constraints are now specified using a common attribute
    'plan.memory_space' that can be applied to functions or individual
    arguments/results. In addition, patterns in `plan-alloc-tensors`
    and `plan-assign-memory-spaces` are updated to avoid introducing
    unnecessary transfers between memory spaces.

commit 36a3b4a77242685e473817cb692a4010f690c0b3
Author: Chris Bate <[email protected]>

    [compiler] Add plan-buffer-results-to-out-params pass

    This change adds a new Plan dialect pass `plan-buffer-results-to-out-params`.
    This pass is based on the upstream Bufferization pass
    `buffer-results-to-out-params`, but it can handle a wider number of
    cases (such as promoting dynamic allocations) and uses alias analysis
    utilities to guard against failure cases that the upstream pass currently
    cannot handle. These improvements should eventually be upstreamed back
    to the Bufferization dialect.

commit 9e7127ca1e61be72b032a54d270a3da0d75639b2
Author: Chris Bate <[email protected]>

    [compiler] Update func conversion in host-to-emitc

    In the EmitC conversion/translation process, you can use `func.func`
    or `emitc.func` to define functions. Previously, we converted all
    `func.func` to `emitc.func`. However, `emitc.func` does not have a
    path for supporting multiple return values. Therefore, prefer use
    of type conversions on `func.func` instead of converting the entire
    op to `emitc.func`. Add tests to verify that we can support multiple
    return values.

commit 934db1f78ef3e7bedb67f1252b41ded7419010f8
Author: Chris Bate <[email protected]>

    [compiler] Fix two host-to-emitc bugs

    This change fixes two bugs exposed by new 'host-to-emitc' conversion
    testing:

    - The `!emitc.size_t` type does not have DataLayout information specified
      upstream. Therefore, to ensure that the type can be queried using
      DataLayout, we add a DataLayoutTypeInterface external model to the type.
      All queries are simply mapped to queries to the `index` type.

    - The upstream `func.call` conversion has a bug where it does not
      correctly convert the result types of the call operation, which
      can lead to a type mismatch for any type that does not have an
      identity conversion.

    Additional tests are added to `host-to-emitc`. Eventually the fixes for
    both these issues should be moved upstream.

commit 9d27f08ee4429f4ffbb72023babc193c7724a700
Author: Chris Bate <[email protected]>

    [common] Add Linalg-to-loops (on tensors) implementation and conversion pass

    Adds a ToLoopsOpInterface implementation and for Linalg operations. In addition,
    a conversion pass is added that converts ToLoopOpInterface operations to loops.

commit 3a419f120808eafc31f45516977ed6169b809ab9
Author: Chris Bate <[email protected]>

    NFC: Move ToLoopsOpInterface to 'mlir-tensorrt-common'

    Moves the ToLoopsOpInterface to the 'mlir-tensorrt-common' project. This
    is in preperation for enabling the ToLoopsOpInterface on LinalgOp (lowering
    while still using Tensor types) to replace the `convert-stablehlo-arith-to-scalar`
    pipeline.

    MR: initialdl/mlir-tensorrt!2137

commit 442bea12b763dd36fce864695f63896912438d87
Author: Christopher Bate <[email protected]>

    NFC: Fix formatting across several files

commit b2a65bc3e806aaa95d932af512cfa4750a9cbe4e
Author: Chris Bate <[email protected]>

    [executor] Introduce RuntimeSession "features" to control loading of runtime modules

    Previously, the RuntimeSession would always load all available runtime
    modules. This causes some inefficiences. For example, in certain integration
    tests for the Executor runtime, we don't use CUDA at all. However, because
    CUDA is still initialized by default, we would still require a GPU to
    be present just to run the integration test. Furthermore, some experimental
    modules (e.g. Lua cublas module) are not ready for "production" use and
    are only really invoked inside special integration tests.

    This change inroduces a notion of "features" to the RuntimeSession and
    RuntimeSessionOptions. A feature is just a string that identifies a
    particular runtime component. The particular semantic of a "feature" depends
    on the the actual runtime implementation. For example, for the
    LuaRuntimeSession, the feature names correspond to the available Lua
    "modules" (a module is just a group of C++ Lua extension functions),
    e.g. "core", "cuda", "tensorrt", etc.

    The RuntimeSessionOptions gains methods for enabling/disabling features.
    Certain features cause others to be added to the set automatically, e.g.
    "tensorrt" and "nccl" both require "cuda" to be added.

    The API is piped through all the way to the Python bindings to allow
    control of loaded modules at all levels. To preserve existing behavior,
    RuntimeSessions created from Python will load all available modules by
    default, but the `executor-runner|mlir-tensorrt-runner` tools now require
    features to be explicitly specified.

commit b90f8f345b2941e958f3a1cc5bcac21daebe783b
Author: Christopher Bate <[email protected]>

    NFC: Fix include guard for 'mlir-executor/Support/Status.h'

commit cdbe1f560483047291a30115a043a60bdce34d99
Author: Sagar Shelke <[email protected]>

    [compiler/lib] Add stablehlo composite to call pass to pre-processing pipeline

    This MR adds `StablehloLegalizeCompositeToCallPass` to the pre-processing
    pipeline.

    MLIR test is added.

commit 6ea3ab77aa2909cee11d08aa24543f247e8a24bf
Author: Chris Bate <[email protected]>

    [compiler] Add "default memory space" to ClusterKindAttrInterface

    Adds a new method to the ClusterKindAttrInterface so that backends can
    control the default tensor encoding (#plan.memory_space<..>) assigned by
    the `plan.assign-memory-spaces` pass at a function-scope level. In
    addition, we also allow an attribute to override the default space
    at function argument/results. This override mechnanism was previously
    lacking and will help resolve a long-standing issue where users cannot
    control the memory space of arguments/results reliably.

commit 0ea59238f5c280ab3ffbc340bb9aee7ed7bfbebb
Author: Christopher Bate <[email protected]>

    [compiler] Fix some issues related to pipeline extension mechanism

    The StablehloToExecutableTensorRTExtension had both 'disable' and
    an inherited 'disabled' member variable. Delete the inherited one
    such it should not have been introduced and was not bound to any
    option. Further, remove unused 'extensions' vector from
    CompilationTaskOptionsBase.

commit 372476d77fcaa399460965ab7bfc052f0e44c99f
Author: Christopher Bate <[email protected]>

    [executor] Fix ptrtoint and inttoptr op translation to Lua

    Previously, we could generate conflicting function types (due to pointer
    address space) when converting `executor.ptrtoint` and `executor.inttoptr`
    ops to opaque calls. Instead, defer the conversion to function call until
    the actual Lua translation point. At that point we can generate a
    function name without having to consider the pointer address space.

commit 75d18534fa67b452dd2253d6981bda6954bf1056
Author: Chris Bate <[email protected]>

    Introduce 'MLIRTensorRTCommmon' sub-project

    Certain targets need to be used across multiple sub-projects. For example,
    the 'TensorRTDynamicLoader' target is used in all sub-projects. In addition,
    the sub-projects need to be independently buildable. This change introduces
    another sub-project under the 'common' directory where shared code can be
    placed. This allows us to use `find_package` to declare the dependency, and
    downstream consumers to meet the requirement using any number of
    techniques to fullfill the 'find_package' call.

commit d7d8104087cf272bdd08f6330f27734754f0d71d
Author: Chris Bate <[email protected]>

    [compiler] Harden `stablehlo.constant` to `arith.constant` conversion

    There is a utility pass that runs in the stablehlo-to-executable pipeline
    that converts `stablehlo.constant` to `arith.constant`. This pass
    can temporarily create invalid IR due to `arith.constant` not supporting
    signful integer types. If the "verify-each" option is off,
    then the issue will not be caught since it happens to be self-correcting.
    However, the issue can still cause verification failures while debugging.
    This change fixes the issue by adding a `builtin.unrealized_conversion_cast`
    operation to bridge the type change between signless-and-signfull integer
    types.

commit a500de82a7bd70d6bfe32234719b4daa7cf32a8a
Author: Chris Bate <[email protected]>

    Integrate LLVM at f137c3d592e96330e450a8fd63ef7e8877fc1908

commit cd56aa6a511e2091fcd86106f20d27ff3673db75
Author: Christopher Bate <[email protected]>

    Fix build with BUILD_SHARED_LIBS=ON

    The new InferTensorValueRangeInterface was used without correctly
    specifying the library dependency the PlanIR and StablehloExtIR
    libraries.

commit cf1aff0ad0997947ab87485cfeec4595cb0285d7
Author: Sagar Shelke <[email protected]>

    [compiler] Maintain output order in TensorRT engine.

    For TensorRT engine conversion, first step in lowering a
    cluster containing TensorRT ops is created inline group op.
    Operands to the yield op (i.e. terminator) of inline group op
    are values from the cluster that are used outside the cluster.
    These values are collected by getting uses of each op (with
    `op->getUses()`) and checking if they are outside the cluster.
    However, this use order is not deterministic and sometimes
    it is desired to get yield results in a certian order.

    This MR makes the following changes,
    1. Add a function callback option named `ReorderRegionOpYieldValues`
    to `mlir::createRegionOpFromCluster` method. This callback function
    has signature `std::function<void(SetVector<Value> &yieldValues,
    SmallVectorImpl<Type> &yieldTypes)>` which takes cluster values used
    outside the cluster (in SetVector) and their types. By default this is
    set to nullptr.
    2. TensorRTToExecutable task is used in cases where a single `func.func`
    represents a single TensorRT engine. In this case, `ReorderRegionOpYieldValues`
    callback is implemented to make sure inline group op yield value order
    is same as func.func return values order.

    Valid MLIR test is added.

GitOrigin-RevId: 630a69d8e14506db43cfefe4be2c790f9352da4f
christopherbate pushed a commit to NVIDIA/TensorRT-Incubator that referenced this pull request Jul 10, 2025
This PR moves the following internal changes to OSS,

Author: Sagar Shelke <[email protected]>

[executor] Add complex type support to `ScalarValue`

Previously, ScalarValue which represents scalar runtime value did not
support complex type. This MR adds support for complex type by making storage
union of real and complex data instaed of just real.

MLIR tests are added via constant subgraph execution.

Author: Christopher Bate <[email protected]>

[compiler] Enable more `stablehlo.dot_general` to TensorRT using `tensorrt.einsum`

Previously, we relied on canonicalization of `stablehlo.dot_general` to
put all such contraction operations into a form that could be converted to
`tensorrt.matrix_multiply`. Based on recent experiments, this can actually
produce very inefficient TensorRT programs due to the number of reshapes
and transpositions that must be inserted to coerce general
`stablehlo.dot_general` into batched matrix multiplications. This change
enables conversion of `stablehlo.dot_general` to `tensorrt.einsum`, and
the pass and patterns now contain configurable parameters to control whether
`tensorrt.einsum` is used as the primary method or only for fallback when
conversion to `tensorrt.matrix_multiply` is not possible.

A follow on change will revamp the Stablehlo preprocessing that we
perform on 'stablehlo.dot_general' to avoid creating inefficient patterns
and enable wider use of this pattern.

Author: Christopher Bate <[email protected]>

[compiler] Fix stablehlo-to-scf scalarization heuristics

Fixes an issue where float tensors in the 'before' region of converted
while loops where scalarized. The transform should only scalarize operands
which are likely to be for-style induction variables.

Author: Christopher Bate <[email protected]>

[compiler] NFC: Drop dead code from StablehloToExecutableTask

Author: Chris Bate <[email protected]>

[compiler] Add `plan-promote-host-tensors-to-host-pinned` pass

Adds a simple pass to promote "host" tensors to "host-pinned" tensors
in common cases where we know a tensor will be transferred between host
and device spaces. This pass runs after `plan-optimize-memory-spaces`
since the former is sensitive to mismatching host spaces for patterns
related to moving tranfers out of loops.

Author: Sagar Shelke <[email protected]>

[executor] Handle elided dense resource elements attr during translation

Translation to executable (which is flatbuffer) uses MLIR attr serialization to serialize
`ElementsAttr`. However, this doesn't work when attr is elided dense resource and results
in segfault. This MR handles this situation by replacing elided resource with `DenseElementsAttr`
of all `one`s (`true` in case of boolean).

IR with elided resource is usally seen only during testing of passes and not useful for
e2e functional execution. Testing of `ExecuteConstantFoldableSubgraphs` pass is such case.
Thus,  MLIR test cases for this pass are added.

Author: Chris Bate <[email protected]>

[tensorrt] Fix TRT layer name generation function

The TRT layer naming had some faulty logic that could cause the layer
name to grow very large in the process to create a unique name. Fix
the issue and use a static counter to reduce time spent in the loop.

Author: Christopher Bate <[email protected]>

Further fixes to LIT configs

Previously, we were setting `lit_config.parallelism_group` instead of
`config.parallelism_group`. Apparently, the previous method does nothing,
only `config.parallelism_group` has any effect.

Author: Chris Bate <[email protected]>

Update LIT test parallelism configs

In more recent versions of TensorRT (10.11+ at least), the builder is taking a
much larger amount of host memory. This can cause OOM when running the LIT test
suites under their existing configurations.

This change updates all LIT configs:

- Make sure to use `%pick-one-gpu` in the LIT command line to ensure we
    stall if there are not enough GPU or host resources available. Add
    a hard limit that there must be at least 5GB of host memory available.

- Update configurations to reduce the amount of estimated parallelism by
    increasing host memory requirements and reducing the amount of host
    memory to 50% for the purposes of the parallelism calculation.

- Force all tests to use a common parallelism group unless otherwise
    specified in the test config.

Author: Christopher Bate <[email protected]>

[compiler] Fix failure case in stablehlo-to-scf

Fixes a failure case due to one of the recently introduced rewrites in
`stablehlo-to-scf`.

Author: Christopher Bate <[email protected]>

[compiler] Further improvements to plan bufferization pipeline

- Split `plan-assign-memory-spaces` into three passes:
    - `plan-assign-memory-spaces`
    - `plan-optimize-memory-spaces`
    - `plan-materialize-explicit-transfers`
- The last one is the only new code: `plan-materialize-explicit-transfers`
    converts `tensor.cast` ops that change the memory space encoding into
    explicit `bufferization.alloc_tensor` +
    `bufferization.materialize_in_destination` operations.
- Improve handling of `bufferization.alloc_tensor` and optimization of
    `scf.for` iteration args in `plan-assign-memory-spaces`.
- Improve handling of `tensor.reshape` in `plan-assign-memory-spaces`.
- Fix handling of `tensor.reshape` when rewriting functions to be in
    DPS style in `plan-alloc-tensors`.

This change also updates the LLVM dependencies in order to cherry-pick
fix to the `tensor.reshape` bufferization interface that I merged
upstream (llvm/llvm-project#128590).

In addition, fix APInt assertions in `plan-execute-constant-foldable-subgraphs`.

Author: Chris Bate <[email protected]>

[compiler] Enable While-to-For conversion in Stablehlo-to-Scf pass

Stablehlo only has one type of loop construct, `stablehlo.while`. The
`while` loop can represent "for"-style loops as well, but if we only
have `scf.while` loops after conversion to SCF, then we miss out on
lot of potential optimizations which are rooted on `scf.for`.
Experiments show that complicated JAX programs like the
PhysicalIntelligence Pi0 model can benefit from converting `scf.while`
to `scf.for` where possible. This improves opportunities for
constant folding and makes analysis much easier to gauge the benefit
of transforms like unrolling.

This change adds some patterns to the Stablehlo-to-Scf pass to enable
While-to-For conversion after the Stablehlo-to-Scf conversion. This
transformation is combined with the Stablehlo-to-Scf conversion because
the While-to-For patterns require first scalarizing block arguments of
the While operation. The heuristics for which block arguments should
be scalarized are implemented as control callbacks for the scalarization
patterns. These callbacks need Stablehlo-specific logic, so it makes sense
to test the combined conversion as a single pass. From the pass users'
perspective, it gives the appearence of going directly from `stablehlo.while`
to `scf.for`.

The test cases are updated to cover the new patterns.

Author: Chris Bate <[email protected]>

[compiler] Fix assign-memory-spaces pass to respect function-level constraints

Fixes an issue where the `plan.memory_space` attribute on a function was
not being respected when converting function signatures.

MR: initialdl/mlir-tensorrt!2146

Author: Chris Bate <[email protected]>

[compiler] Update scf.while detensorization to increase flexibility

In order to incorporate the upstream "uplift scf.while to scf.for"
transformation as part of the `stablehlo-to-scf` conversion, we need to
detensorize the operands of `scf.while` that are likely to correspond to
the loop induction variable. This change refactors our existing 'scf.while'
detensorization transformation to give more flexibility and control. The
TensorKindAnalysis is no longer required in order to use the pattern(s).
Detensorization of `after` and `before` arguments of `scf.while` are now
controlled separately.

Author: Chris Bate <[email protected]>

[compiler] Improve handling of memory space constraints in the Plan dialect

This
Plan dialect. Constraints are now specified using a common attribute
'plan.memory_space' that can be applied to functions or individual
arguments/results. In addition, patterns in `plan-alloc-tensors`
and `plan-assign-memory-spaces` are updated to avoid introducing
unnecessary transfers between memory spaces.

Author: Chris Bate <[email protected]>

[compiler] Add plan-buffer-results-to-out-params pass

This change adds a new Plan dialect pass `plan-buffer-results-to-out-params`.
This pass is based on the upstream Bufferization pass
`buffer-results-to-out-params`, but it can handle a wider number of
cases (such as promoting dynamic allocations) and uses alias analysis
utilities to guard against failure cases that the upstream pass currently
cannot handle. These improvements should eventually be upstreamed back
to the Bufferization dialect.

Author: Chris Bate <[email protected]>

[compiler] Update func conversion in host-to-emitc

In the EmitC conversion/translation process, you can use `func.func`
or `emitc.func` to define functions. Previously, we converted all
`func.func` to `emitc.func`. However, `emitc.func` does not have a
path for supporting multiple return values. Therefore, prefer use
of type conversions on `func.func` instead of converting the entire
op to `emitc.func`. Add tests to verify that we can support multiple
return values.

Author: Chris Bate <[email protected]>

[compiler] Fix two host-to-emitc bugs

This change fixes two bugs exposed by new 'host-to-emitc' conversion
testing:

- The `!emitc.size_t` type does not have DataLayout information specified
    upstream. Therefore, to ensure that the type can be queried using
    DataLayout, we add a DataLayoutTypeInterface external model to the type.
    All queries are simply mapped to queries to the `index` type.

- The upstream `func.call` conversion has a bug where it does not
    correctly convert the result types of the call operation, which
    can lead to a type mismatch for any type that does not have an
    identity conversion.

Additional tests are added to `host-to-emitc`. Eventually the fixes for
both these issues should be moved upstream.

Author: Chris Bate <[email protected]>

[common] Add Linalg-to-loops (on tensors) implementation and conversion pass

Adds a ToLoopsOpInterface implementation and for Linalg operations. In addition,
a conversion pass is added that converts ToLoopOpInterface operations to loops.

Author: Chris Bate <[email protected]>

NFC: Move ToLoopsOpInterface to 'mlir-tensorrt-common'

Moves the ToLoopsOpInterface to the 'mlir-tensorrt-common' project. This
is in preperation for enabling the ToLoopsOpInterface on LinalgOp (lowering
while still using Tensor types) to replace the `convert-stablehlo-arith-to-scalar`
pipeline.

MR: initialdl/mlir-tensorrt!2137

Author: Christopher Bate <[email protected]>

NFC: Fix formatting across several files

Author: Chris Bate <[email protected]>

[executor] Introduce RuntimeSession "features" to control loading of runtime modules

Previously, the RuntimeSession would always load all available runtime
modules. This causes some inefficiences. For example, in certain integration
tests for the Executor runtime, we don't use CUDA at all. However, because
CUDA is still initialized by default, we would still require a GPU to
be present just to run the integration test. Furthermore, some experimental
modules (e.g. Lua cublas module) are not ready for "production" use and
are only really invoked inside special integration tests.

This change inroduces a notion of "features" to the RuntimeSession and
RuntimeSessionOptions. A feature is just a string that identifies a
particular runtime component. The particular semantic of a "feature" depends
on the the actual runtime implementation. For example, for the
LuaRuntimeSession, the feature names correspond to the available Lua
"modules" (a module is just a group of C++ Lua extension functions),
e.g. "core", "cuda", "tensorrt", etc.

The RuntimeSessionOptions gains methods for enabling/disabling features.
Certain features cause others to be added to the set automatically, e.g.
"tensorrt" and "nccl" both require "cuda" to be added.

The API is piped through all the way to the Python bindings to allow
control of loaded modules at all levels. To preserve existing behavior,
RuntimeSessions created from Python will load all available modules by
default, but the `executor-runner|mlir-tensorrt-runner` tools now require
features to be explicitly specified.

Author: Christopher Bate <[email protected]>

NFC: Fix include guard for 'mlir-executor/Support/Status.h'

Author: Sagar Shelke <[email protected]>

[compiler/lib] Add stablehlo composite to call pass to pre-processing pipeline

This MR adds `StablehloLegalizeCompositeToCallPass` to the pre-processing
pipeline.

MLIR test is added.

Author: Chris Bate <[email protected]>

[compiler] Add "default memory space" to ClusterKindAttrInterface

Adds a new method to the ClusterKindAttrInterface so that backends can
control the default tensor encoding (#plan.memory_space<..>) assigned by
the `plan.assign-memory-spaces` pass at a function-scope level. In
addition, we also allow an attribute to override the default space
at function argument/results. This override mechnanism was previously
lacking and will help resolve a long-standing issue where users cannot
control the memory space of arguments/results reliably.

Author: Christopher Bate <[email protected]>

[compiler] Fix some issues related to pipeline extension mechanism

The StablehloToExecutableTensorRTExtension had both 'disable' and
an inherited 'disabled' member variable. Delete the inherited one
such it should not have been introduced and was not bound to any
option. Further, remove unused 'extensions' vector from
CompilationTaskOptionsBase.

Author: Christopher Bate <[email protected]>

[executor] Fix ptrtoint and inttoptr op translation to Lua

Previously, we could generate conflicting function types (due to pointer
address space) when converting `executor.ptrtoint` and `executor.inttoptr`
ops to opaque calls. Instead, defer the conversion to function call until
the actual Lua translation point. At that point we can generate a
function name without having to consider the pointer address space.

Author: Chris Bate <[email protected]>

Introduce 'MLIRTensorRTCommmon' sub-project

Certain targets need to be used across multiple sub-projects. For example,
the 'TensorRTDynamicLoader' target is used in all sub-projects. In addition,
the sub-projects need to be independently buildable. This change introduces
another sub-project under the 'common' directory where shared code can be
placed. This allows us to use `find_package` to declare the dependency, and
downstream consumers to meet the requirement using any number of
techniques to fullfill the 'find_package' call.

Author: Chris Bate <[email protected]>

[compiler] Harden `stablehlo.constant` to `arith.constant` conversion

There is a utility pass that runs in the stablehlo-to-executable pipeline
that converts `stablehlo.constant` to `arith.constant`. This pass
can temporarily create invalid IR due to `arith.constant` not supporting
signful integer types. If the "verify-each" option is off,
then the issue will not be caught since it happens to be self-correcting.
However, the issue can still cause verification failures while debugging.
This change fixes the issue by adding a `builtin.unrealized_conversion_cast`
operation to bridge the type change between signless-and-signfull integer
types.

Author: Chris Bate <[email protected]>

Integrate LLVM at f137c3d592e96330e450a8fd63ef7e8877fc1908

Author: Christopher Bate <[email protected]>

Fix build with BUILD_SHARED_LIBS=ON

The new InferTensorValueRangeInterface was used without correctly
specifying the library dependency the PlanIR and StablehloExtIR
libraries.

Author: Sagar Shelke <[email protected]>

[compiler] Maintain output order in TensorRT engine.

For TensorRT engine conversion, first step in lowering a
cluster containing TensorRT ops is created inline group op.
Operands to the yield op (i.e. terminator) of inline group op
are values from the cluster that are used outside the cluster.
These values are collected by getting uses of each op (with
`op->getUses()`) and checking if they are outside the cluster.
However, this use order is not deterministic and sometimes
it is desired to get yield results in a certian order.

This MR makes the following changes,
1. Add a function callback option named `ReorderRegionOpYieldValues`
to `mlir::createRegionOpFromCluster` method. This callback function
has signature `std::function<void(SetVector<Value> &yieldValues,
SmallVectorImpl<Type> &yieldTypes)>` which takes cluster values used
outside the cluster (in SetVector) and their types. By default this is
set to nullptr.
2. TensorRTToExecutable task is used in cases where a single `func.func`
represents a single TensorRT engine. In this case, `ReorderRegionOpYieldValues`
callback is implemented to make sure inline group op yield value order
is same as func.func return values order.

Valid MLIR test is added.

GitOrigin-RevId: 630a69d8e14506db43cfefe4be2c790f9352da4f
christopherbate pushed a commit to NVIDIA/TensorRT-Incubator that referenced this pull request Jul 10, 2025
This PR moves the following internal changes to OSS,

Author: Sagar Shelke <[email protected]>

[executor] Add complex type support to `ScalarValue`

Previously, ScalarValue which represents scalar runtime value did not
support complex type. This MR adds support for complex type by making storage
union of real and complex data instaed of just real.

MLIR tests are added via constant subgraph execution.

Author: Christopher Bate <[email protected]>

[compiler] Enable more `stablehlo.dot_general` to TensorRT using `tensorrt.einsum`

Previously, we relied on canonicalization of `stablehlo.dot_general` to
put all such contraction operations into a form that could be converted to
`tensorrt.matrix_multiply`. Based on recent experiments, this can actually
produce very inefficient TensorRT programs due to the number of reshapes
and transpositions that must be inserted to coerce general
`stablehlo.dot_general` into batched matrix multiplications. This change
enables conversion of `stablehlo.dot_general` to `tensorrt.einsum`, and
the pass and patterns now contain configurable parameters to control whether
`tensorrt.einsum` is used as the primary method or only for fallback when
conversion to `tensorrt.matrix_multiply` is not possible.

A follow on change will revamp the Stablehlo preprocessing that we
perform on 'stablehlo.dot_general' to avoid creating inefficient patterns
and enable wider use of this pattern.

Author: Christopher Bate <[email protected]>

[compiler] Fix stablehlo-to-scf scalarization heuristics

Fixes an issue where float tensors in the 'before' region of converted
while loops where scalarized. The transform should only scalarize operands
which are likely to be for-style induction variables.

Author: Christopher Bate <[email protected]>

[compiler] NFC: Drop dead code from StablehloToExecutableTask

Author: Chris Bate <[email protected]>

[compiler] Add `plan-promote-host-tensors-to-host-pinned` pass

Adds a simple pass to promote "host" tensors to "host-pinned" tensors
in common cases where we know a tensor will be transferred between host
and device spaces. This pass runs after `plan-optimize-memory-spaces`
since the former is sensitive to mismatching host spaces for patterns
related to moving tranfers out of loops.

Author: Sagar Shelke <[email protected]>

[executor] Handle elided dense resource elements attr during translation

Translation to executable (which is flatbuffer) uses MLIR attr serialization to serialize
`ElementsAttr`. However, this doesn't work when attr is elided dense resource and results
in segfault. This MR handles this situation by replacing elided resource with `DenseElementsAttr`
of all `one`s (`true` in case of boolean).

IR with elided resource is usally seen only during testing of passes and not useful for
e2e functional execution. Testing of `ExecuteConstantFoldableSubgraphs` pass is such case.
Thus,  MLIR test cases for this pass are added.

Author: Chris Bate <[email protected]>

[tensorrt] Fix TRT layer name generation function

The TRT layer naming had some faulty logic that could cause the layer
name to grow very large in the process to create a unique name. Fix
the issue and use a static counter to reduce time spent in the loop.

Author: Christopher Bate <[email protected]>

Further fixes to LIT configs

Previously, we were setting `lit_config.parallelism_group` instead of
`config.parallelism_group`. Apparently, the previous method does nothing,
only `config.parallelism_group` has any effect.

Author: Chris Bate <[email protected]>

Update LIT test parallelism configs

In more recent versions of TensorRT (10.11+ at least), the builder is taking a
much larger amount of host memory. This can cause OOM when running the LIT test
suites under their existing configurations.

This change updates all LIT configs:

- Make sure to use `%pick-one-gpu` in the LIT command line to ensure we
    stall if there are not enough GPU or host resources available. Add
    a hard limit that there must be at least 5GB of host memory available.

- Update configurations to reduce the amount of estimated parallelism by
    increasing host memory requirements and reducing the amount of host
    memory to 50% for the purposes of the parallelism calculation.

- Force all tests to use a common parallelism group unless otherwise
    specified in the test config.

Author: Christopher Bate <[email protected]>

[compiler] Fix failure case in stablehlo-to-scf

Fixes a failure case due to one of the recently introduced rewrites in
`stablehlo-to-scf`.

Author: Christopher Bate <[email protected]>

[compiler] Further improvements to plan bufferization pipeline

- Split `plan-assign-memory-spaces` into three passes:
    - `plan-assign-memory-spaces`
    - `plan-optimize-memory-spaces`
    - `plan-materialize-explicit-transfers`
- The last one is the only new code: `plan-materialize-explicit-transfers`
    converts `tensor.cast` ops that change the memory space encoding into
    explicit `bufferization.alloc_tensor` +
    `bufferization.materialize_in_destination` operations.
- Improve handling of `bufferization.alloc_tensor` and optimization of
    `scf.for` iteration args in `plan-assign-memory-spaces`.
- Improve handling of `tensor.reshape` in `plan-assign-memory-spaces`.
- Fix handling of `tensor.reshape` when rewriting functions to be in
    DPS style in `plan-alloc-tensors`.

This change also updates the LLVM dependencies in order to cherry-pick
fix to the `tensor.reshape` bufferization interface that I merged
upstream (llvm/llvm-project#128590).

In addition, fix APInt assertions in `plan-execute-constant-foldable-subgraphs`.

Author: Chris Bate <[email protected]>

[compiler] Enable While-to-For conversion in Stablehlo-to-Scf pass

Stablehlo only has one type of loop construct, `stablehlo.while`. The
`while` loop can represent "for"-style loops as well, but if we only
have `scf.while` loops after conversion to SCF, then we miss out on
lot of potential optimizations which are rooted on `scf.for`.
Experiments show that complicated JAX programs like the
PhysicalIntelligence Pi0 model can benefit from converting `scf.while`
to `scf.for` where possible. This improves opportunities for
constant folding and makes analysis much easier to gauge the benefit
of transforms like unrolling.

This change adds some patterns to the Stablehlo-to-Scf pass to enable
While-to-For conversion after the Stablehlo-to-Scf conversion. This
transformation is combined with the Stablehlo-to-Scf conversion because
the While-to-For patterns require first scalarizing block arguments of
the While operation. The heuristics for which block arguments should
be scalarized are implemented as control callbacks for the scalarization
patterns. These callbacks need Stablehlo-specific logic, so it makes sense
to test the combined conversion as a single pass. From the pass users'
perspective, it gives the appearence of going directly from `stablehlo.while`
to `scf.for`.

The test cases are updated to cover the new patterns.

Author: Chris Bate <[email protected]>

[compiler] Fix assign-memory-spaces pass to respect function-level constraints

Fixes an issue where the `plan.memory_space` attribute on a function was
not being respected when converting function signatures.

MR: initialdl/mlir-tensorrt!2146

Author: Chris Bate <[email protected]>

[compiler] Update scf.while detensorization to increase flexibility

In order to incorporate the upstream "uplift scf.while to scf.for"
transformation as part of the `stablehlo-to-scf` conversion, we need to
detensorize the operands of `scf.while` that are likely to correspond to
the loop induction variable. This change refactors our existing 'scf.while'
detensorization transformation to give more flexibility and control. The
TensorKindAnalysis is no longer required in order to use the pattern(s).
Detensorization of `after` and `before` arguments of `scf.while` are now
controlled separately.

Author: Chris Bate <[email protected]>

[compiler] Improve handling of memory space constraints in the Plan dialect

This
Plan dialect. Constraints are now specified using a common attribute
'plan.memory_space' that can be applied to functions or individual
arguments/results. In addition, patterns in `plan-alloc-tensors`
and `plan-assign-memory-spaces` are updated to avoid introducing
unnecessary transfers between memory spaces.

Author: Chris Bate <[email protected]>

[compiler] Add plan-buffer-results-to-out-params pass

This change adds a new Plan dialect pass `plan-buffer-results-to-out-params`.
This pass is based on the upstream Bufferization pass
`buffer-results-to-out-params`, but it can handle a wider number of
cases (such as promoting dynamic allocations) and uses alias analysis
utilities to guard against failure cases that the upstream pass currently
cannot handle. These improvements should eventually be upstreamed back
to the Bufferization dialect.

Author: Chris Bate <[email protected]>

[compiler] Update func conversion in host-to-emitc

In the EmitC conversion/translation process, you can use `func.func`
or `emitc.func` to define functions. Previously, we converted all
`func.func` to `emitc.func`. However, `emitc.func` does not have a
path for supporting multiple return values. Therefore, prefer use
of type conversions on `func.func` instead of converting the entire
op to `emitc.func`. Add tests to verify that we can support multiple
return values.

Author: Chris Bate <[email protected]>

[compiler] Fix two host-to-emitc bugs

This change fixes two bugs exposed by new 'host-to-emitc' conversion
testing:

- The `!emitc.size_t` type does not have DataLayout information specified
    upstream. Therefore, to ensure that the type can be queried using
    DataLayout, we add a DataLayoutTypeInterface external model to the type.
    All queries are simply mapped to queries to the `index` type.

- The upstream `func.call` conversion has a bug where it does not
    correctly convert the result types of the call operation, which
    can lead to a type mismatch for any type that does not have an
    identity conversion.

Additional tests are added to `host-to-emitc`. Eventually the fixes for
both these issues should be moved upstream.

Author: Chris Bate <[email protected]>

[common] Add Linalg-to-loops (on tensors) implementation and conversion pass

Adds a ToLoopsOpInterface implementation and for Linalg operations. In addition,
a conversion pass is added that converts ToLoopOpInterface operations to loops.

Author: Chris Bate <[email protected]>

NFC: Move ToLoopsOpInterface to 'mlir-tensorrt-common'

Moves the ToLoopsOpInterface to the 'mlir-tensorrt-common' project. This
is in preperation for enabling the ToLoopsOpInterface on LinalgOp (lowering
while still using Tensor types) to replace the `convert-stablehlo-arith-to-scalar`
pipeline.

MR: initialdl/mlir-tensorrt!2137

Author: Christopher Bate <[email protected]>

NFC: Fix formatting across several files

Author: Chris Bate <[email protected]>

[executor] Introduce RuntimeSession "features" to control loading of runtime modules

Previously, the RuntimeSession would always load all available runtime
modules. This causes some inefficiences. For example, in certain integration
tests for the Executor runtime, we don't use CUDA at all. However, because
CUDA is still initialized by default, we would still require a GPU to
be present just to run the integration test. Furthermore, some experimental
modules (e.g. Lua cublas module) are not ready for "production" use and
are only really invoked inside special integration tests.

This change inroduces a notion of "features" to the RuntimeSession and
RuntimeSessionOptions. A feature is just a string that identifies a
particular runtime component. The particular semantic of a "feature" depends
on the the actual runtime implementation. For example, for the
LuaRuntimeSession, the feature names correspond to the available Lua
"modules" (a module is just a group of C++ Lua extension functions),
e.g. "core", "cuda", "tensorrt", etc.

The RuntimeSessionOptions gains methods for enabling/disabling features.
Certain features cause others to be added to the set automatically, e.g.
"tensorrt" and "nccl" both require "cuda" to be added.

The API is piped through all the way to the Python bindings to allow
control of loaded modules at all levels. To preserve existing behavior,
RuntimeSessions created from Python will load all available modules by
default, but the `executor-runner|mlir-tensorrt-runner` tools now require
features to be explicitly specified.

Author: Christopher Bate <[email protected]>

NFC: Fix include guard for 'mlir-executor/Support/Status.h'

Author: Sagar Shelke <[email protected]>

[compiler/lib] Add stablehlo composite to call pass to pre-processing pipeline

This MR adds `StablehloLegalizeCompositeToCallPass` to the pre-processing
pipeline.

MLIR test is added.

Author: Chris Bate <[email protected]>

[compiler] Add "default memory space" to ClusterKindAttrInterface

Adds a new method to the ClusterKindAttrInterface so that backends can
control the default tensor encoding (#plan.memory_space<..>) assigned by
the `plan.assign-memory-spaces` pass at a function-scope level. In
addition, we also allow an attribute to override the default space
at function argument/results. This override mechnanism was previously
lacking and will help resolve a long-standing issue where users cannot
control the memory space of arguments/results reliably.

Author: Christopher Bate <[email protected]>

[compiler] Fix some issues related to pipeline extension mechanism

The StablehloToExecutableTensorRTExtension had both 'disable' and
an inherited 'disabled' member variable. Delete the inherited one
such it should not have been introduced and was not bound to any
option. Further, remove unused 'extensions' vector from
CompilationTaskOptionsBase.

Author: Christopher Bate <[email protected]>

[executor] Fix ptrtoint and inttoptr op translation to Lua

Previously, we could generate conflicting function types (due to pointer
address space) when converting `executor.ptrtoint` and `executor.inttoptr`
ops to opaque calls. Instead, defer the conversion to function call until
the actual Lua translation point. At that point we can generate a
function name without having to consider the pointer address space.

Author: Chris Bate <[email protected]>

Introduce 'MLIRTensorRTCommmon' sub-project

Certain targets need to be used across multiple sub-projects. For example,
the 'TensorRTDynamicLoader' target is used in all sub-projects. In addition,
the sub-projects need to be independently buildable. This change introduces
another sub-project under the 'common' directory where shared code can be
placed. This allows us to use `find_package` to declare the dependency, and
downstream consumers to meet the requirement using any number of
techniques to fullfill the 'find_package' call.

Author: Chris Bate <[email protected]>

[compiler] Harden `stablehlo.constant` to `arith.constant` conversion

There is a utility pass that runs in the stablehlo-to-executable pipeline
that converts `stablehlo.constant` to `arith.constant`. This pass
can temporarily create invalid IR due to `arith.constant` not supporting
signful integer types. If the "verify-each" option is off,
then the issue will not be caught since it happens to be self-correcting.
However, the issue can still cause verification failures while debugging.
This change fixes the issue by adding a `builtin.unrealized_conversion_cast`
operation to bridge the type change between signless-and-signfull integer
types.

Author: Chris Bate <[email protected]>

Integrate LLVM at f137c3d592e96330e450a8fd63ef7e8877fc1908

Author: Christopher Bate <[email protected]>

Fix build with BUILD_SHARED_LIBS=ON

The new InferTensorValueRangeInterface was used without correctly
specifying the library dependency the PlanIR and StablehloExtIR
libraries.

Author: Sagar Shelke <[email protected]>

[compiler] Maintain output order in TensorRT engine.

For TensorRT engine conversion, first step in lowering a
cluster containing TensorRT ops is created inline group op.
Operands to the yield op (i.e. terminator) of inline group op
are values from the cluster that are used outside the cluster.
These values are collected by getting uses of each op (with
`op->getUses()`) and checking if they are outside the cluster.
However, this use order is not deterministic and sometimes
it is desired to get yield results in a certian order.

This MR makes the following changes,
1. Add a function callback option named `ReorderRegionOpYieldValues`
to `mlir::createRegionOpFromCluster` method. This callback function
has signature `std::function<void(SetVector<Value> &yieldValues,
SmallVectorImpl<Type> &yieldTypes)>` which takes cluster values used
outside the cluster (in SetVector) and their types. By default this is
set to nullptr.
2. TensorRTToExecutable task is used in cases where a single `func.func`
represents a single TensorRT engine. In this case, `ReorderRegionOpYieldValues`
callback is implemented to make sure inline group op yield value order
is same as func.func return values order.

Valid MLIR test is added.

GitOrigin-RevId: 630a69d8e14506db43cfefe4be2c790f9352da4f
christopherbate pushed a commit to NVIDIA/TensorRT-Incubator that referenced this pull request Jul 10, 2025
Author: Sagar Shelke <[email protected]>

[executor] Add complex type support to `ScalarValue`

Previously, ScalarValue which represents scalar runtime value did not
support complex type. This MR adds support for complex type by making
storage union of real and complex data instaed of just real.

MLIR tests are added via constant subgraph execution.

Author: Christopher Bate <[email protected]>

[compiler] Enable more `stablehlo.dot_general` to TensorRT using
`tensorrt.einsum`

Previously, we relied on canonicalization of `stablehlo.dot_general`
to put all such contraction operations into a form that could be
converted to `tensorrt.matrix_multiply`. Based on recent experiments,
this can actually produce very inefficient TensorRT programs due to
the number of reshapes and transpositions that must be inserted to
coerce general `stablehlo.dot_general` into batched matrix
multiplications. This change enables conversion of
`stablehlo.dot_general` to `tensorrt.einsum`, and the pass and
patterns now contain configurable parameters to control whether
`tensorrt.einsum` is used as the primary method or only for fallback
when conversion to `tensorrt.matrix_multiply` is not possible.

A follow on change will revamp the Stablehlo preprocessing that we
perform on 'stablehlo.dot_general' to avoid creating inefficient
patterns and enable wider use of this pattern.

Author: Christopher Bate <[email protected]>

[compiler] Fix stablehlo-to-scf scalarization heuristics

Fixes an issue where float tensors in the 'before' region of converted
while loops where scalarized. The transform should only scalarize
operands which are likely to be for-style induction variables.

Author: Christopher Bate <[email protected]>

[compiler] NFC: Drop dead code from StablehloToExecutableTask

Author: Chris Bate <[email protected]>

[compiler] Add `plan-promote-host-tensors-to-host-pinned` pass

Adds a simple pass to promote "host" tensors to "host-pinned" tensors
in common cases where we know a tensor will be transferred between
host and device spaces. This pass runs after
`plan-optimize-memory-spaces` since the former is sensitive to
mismatching host spaces for patterns related to moving tranfers out of
loops.

Author: Sagar Shelke <[email protected]>

[executor] Handle elided dense resource elements attr during
translation

Translation to executable (which is flatbuffer) uses MLIR attr
serialization to serialize `ElementsAttr`. However, this doesn't work
when attr is elided dense resource and results in segfault. This MR
handles this situation by replacing elided resource with
`DenseElementsAttr` of all `one`s (`true` in case of boolean).

IR with elided resource is usally seen only during testing of passes
and not useful for e2e functional execution. Testing of
`ExecuteConstantFoldableSubgraphs` pass is such case. Thus,  MLIR test
cases for this pass are added.

Author: Chris Bate <[email protected]>

[tensorrt] Fix TRT layer name generation function

The TRT layer naming had some faulty logic that could cause the layer
name to grow very large in the process to create a unique name. Fix
the issue and use a static counter to reduce time spent in the loop.

Author: Christopher Bate <[email protected]>

Further fixes to LIT configs

Previously, we were setting `lit_config.parallelism_group` instead of
`config.parallelism_group`. Apparently, the previous method does
nothing, only `config.parallelism_group` has any effect.

Author: Chris Bate <[email protected]>

Update LIT test parallelism configs

In more recent versions of TensorRT (10.11+ at least), the builder is
taking a much larger amount of host memory. This can cause OOM when
running the LIT test suites under their existing configurations.

This change updates all LIT configs:

- Make sure to use `%pick-one-gpu` in the LIT command line to ensure
    we stall if there are not enough GPU or host resources available.
    Add a hard limit that there must be at least 5GB of host memory
    available.

- Update configurations to reduce the amount of estimated parallelism
    by increasing host memory requirements and reducing the amount of
    host memory to 50% for the purposes of the parallelism
    calculation.

- Force all tests to use a common parallelism group unless otherwise
    specified in the test config.

Author: Christopher Bate <[email protected]>

[compiler] Fix failure case in stablehlo-to-scf

Fixes a failure case due to one of the recently introduced rewrites in
`stablehlo-to-scf`.

Author: Christopher Bate <[email protected]>

[compiler] Further improvements to plan bufferization pipeline

- Split `plan-assign-memory-spaces` into three passes:
    - `plan-assign-memory-spaces`
    - `plan-optimize-memory-spaces`
    - `plan-materialize-explicit-transfers`
- The last one is the only new code:
    `plan-materialize-explicit-transfers` converts `tensor.cast` ops
    that change the memory space encoding into explicit
    `bufferization.alloc_tensor` +
    `bufferization.materialize_in_destination` operations.
- Improve handling of `bufferization.alloc_tensor` and optimization of
    `scf.for` iteration args in `plan-assign-memory-spaces`.
- Improve handling of `tensor.reshape` in `plan-assign-memory-spaces`.
- Fix handling of `tensor.reshape` when rewriting functions to be in
    DPS style in `plan-alloc-tensors`.

This change also updates the LLVM dependencies in order to cherry-pick
fix to the `tensor.reshape` bufferization interface that I merged
upstream (llvm/llvm-project#128590).

In addition, fix APInt assertions in
`plan-execute-constant-foldable-subgraphs`.

Author: Chris Bate <[email protected]>

[compiler] Enable While-to-For conversion in Stablehlo-to-Scf pass

This change adds some patterns to the Stablehlo-to-Scf pass to enable
While-to-For conversion after the Stablehlo-to-Scf conversion. This
transformation is combined with the Stablehlo-to-Scf conversion
because the While-to-For patterns require first scalarizing block
arguments of the While operation. The heuristics for which block
arguments should be scalarized are implemented as control callbacks
for the scalarization patterns. These callbacks need
Stablehlo-specific logic, so it makes sense to test the combined
conversion as a single pass. From the pass users' perspective, it
gives the appearence of going directly from `stablehlo.while` to
`scf.for`.

The test cases are updated to cover the new patterns.

Author: Chris Bate <[email protected]>

[compiler] Fix assign-memory-spaces pass to respect function-level
constraints

Fixes an issue where the `plan.memory_space` attribute on a function
was not being respected when converting function signatures.

MR: initialdl/mlir-tensorrt!2146

Author: Chris Bate <[email protected]>

[compiler] Update scf.while detensorization to increase flexibility

In order to incorporate the upstream "uplift scf.while to scf.for"
transformation as part of the `stablehlo-to-scf` conversion, we need
to detensorize the operands of `scf.while` that are likely to
correspond to the loop induction variable. This change refactors our
existing 'scf.while' detensorization transformation to give more
flexibility and control. The TensorKindAnalysis is no longer required
in order to use the pattern(s). Detensorization of `after` and
`before` arguments of `scf.while` are now controlled separately.

Author: Chris Bate <[email protected]>

[compiler] Improve handling of memory space constraints in the Plan
dialect

This Plan dialect. Constraints are now specified using a common
attribute 'plan.memory_space' that can be applied to functions or
individual arguments/results. In addition, patterns in
`plan-alloc-tensors` and `plan-assign-memory-spaces` are updated to
avoid introducing unnecessary transfers between memory spaces.

Author: Chris Bate <[email protected]>

[compiler] Add plan-buffer-results-to-out-params pass

This change adds a new Plan dialect pass
`plan-buffer-results-to-out-params`. This pass is based on the
upstream Bufferization pass `buffer-results-to-out-params`, but it can
handle a wider number of cases (such as promoting dynamic allocations)
and uses alias analysis utilities to guard against failure cases that
the upstream pass currently cannot handle. These improvements should
eventually be upstreamed back to the Bufferization dialect.

Author: Chris Bate <[email protected]>

[compiler] Update func conversion in host-to-emitc

In the EmitC conversion/translation process, you can use `func.func`
or `emitc.func` to define functions. Previously, we converted all
`func.func` to `emitc.func`. However, `emitc.func` does not have a
path for supporting multiple return values. Therefore, prefer use of
type conversions on `func.func` instead of converting the entire op to
`emitc.func`. Add tests to verify that we can support multiple return
values.

Author: Chris Bate <[email protected]>

[compiler] Fix two host-to-emitc bugs

This change fixes two bugs exposed by new 'host-to-emitc' conversion
testing:

- The `!emitc.size_t` type does not have DataLayout information
    specified upstream. Therefore, to ensure that the type can be
    queried using DataLayout, we add a DataLayoutTypeInterface
    external model to the type. All queries are simply mapped to
    queries to the `index` type.

- The upstream `func.call` conversion has a bug where it does not
    correctly convert the result types of the call operation, which
    can lead to a type mismatch for any type that does not have an
    identity conversion.

Additional tests are added to `host-to-emitc`. Eventually the fixes
for both these issues should be moved upstream.

Author: Chris Bate <[email protected]>

[common] Add Linalg-to-loops (on tensors) implementation and
conversion pass

Adds a ToLoopsOpInterface implementation and for Linalg operations. In
addition, a conversion pass is added that converts ToLoopOpInterface
operations to loops.

Author: Chris Bate <[email protected]>

NFC: Move ToLoopsOpInterface to 'mlir-tensorrt-common'

Moves the ToLoopsOpInterface to the 'mlir-tensorrt-common' project.
This is in preperation for enabling the ToLoopsOpInterface on LinalgOp
(lowering while still using Tensor types) to replace the
`convert-stablehlo-arith-to-scalar` pipeline.

MR: initialdl/mlir-tensorrt!2137

Author: Christopher Bate <[email protected]>

NFC: Fix formatting across several files

Author: Chris Bate <[email protected]>

[executor] Introduce RuntimeSession "features" to control loading of
runtime modules

Previously, the RuntimeSession would always load all available runtime
modules. This causes some inefficiences. For example, in certain
integration tests for the Executor runtime, we don't use CUDA at all.
However, because CUDA is still initialized by default, we would still
require a GPU to be present just to run the integration test.
Furthermore, some experimental modules (e.g. Lua cublas module) are
not ready for "production" use and are only really invoked inside
special integration tests.

This change inroduces a notion of "features" to the RuntimeSession and
RuntimeSessionOptions. A feature is just a string that identifies a
particular runtime component. The particular semantic of a "feature"
depends on the the actual runtime implementation. For example, for the
LuaRuntimeSession, the feature names correspond to the available Lua
"modules" (a module is just a group of C++ Lua extension functions),
e.g. "core", "cuda", "tensorrt", etc.

The RuntimeSessionOptions gains methods for enabling/disabling
features. Certain features cause others to be added to the set
automatically, e.g. "tensorrt" and "nccl" both require "cuda" to be
added.

The API is piped through all the way to the Python bindings to allow
control of loaded modules at all levels. To preserve existing
behavior, RuntimeSessions created from Python will load all available
modules by default, but the `executor-runner|mlir-tensorrt-runner`
tools now require features to be explicitly specified.

Author: Christopher Bate <[email protected]>

NFC: Fix include guard for 'mlir-executor/Support/Status.h'

Author: Sagar Shelke <[email protected]>

[compiler/lib] Add stablehlo composite to call pass to pre-processing
pipeline

This MR adds `StablehloLegalizeCompositeToCallPass` to the
pre-processing pipeline.

MLIR test is added.

Author: Chris Bate <[email protected]>

[compiler] Add "default memory space" to ClusterKindAttrInterface

Adds a new method to the ClusterKindAttrInterface so that backends can
control the default tensor encoding (#plan.memory_space<..>) assigned
by the `plan.assign-memory-spaces` pass at a function-scope level. In
addition, we also allow an attribute to override the default space at
function argument/results. This override mechnanism was previously
lacking and will help resolve a long-standing issue where users cannot
control the memory space of arguments/results reliably.

Author: Christopher Bate <[email protected]>

[compiler] Fix some issues related to pipeline extension mechanism

The StablehloToExecutableTensorRTExtension had both 'disable' and an
inherited 'disabled' member variable. Delete the inherited one such it
should not have been introduced and was not bound to any option.
Further, remove unused 'extensions' vector from
CompilationTaskOptionsBase.

Author: Christopher Bate <[email protected]>

[executor] Fix ptrtoint and inttoptr op translation to Lua

Previously, we could generate conflicting function types (due to
pointer address space) when converting `executor.ptrtoint` and
`executor.inttoptr` ops to opaque calls. Instead, defer the conversion
to function call until the actual Lua translation point. At that point
we can generate a function name without having to consider the pointer
address space.

Author: Chris Bate <[email protected]>

Introduce 'MLIRTensorRTCommmon' sub-project

Certain targets need to be used across multiple sub-projects. For
example, the 'TensorRTDynamicLoader' target is used in all
sub-projects. In addition, the sub-projects need to be independently
buildable. This change introduces another sub-project under the
'common' directory where shared code can be placed. This allows us to
use `find_package` to declare the dependency, and downstream consumers
to meet the requirement using any number of techniques to fullfill the
'find_package' call.

Author: Chris Bate <[email protected]>

[compiler] Harden `stablehlo.constant` to `arith.constant` conversion

There is a utility pass that runs in the stablehlo-to-executable
pipeline that converts `stablehlo.constant` to `arith.constant`. This
pass can temporarily create invalid IR due to `arith.constant` not
supporting signful integer types. If the "verify-each" option is off,
then the issue will not be caught since it happens to be
self-correcting. However, the issue can still cause verification
failures while debugging. This change fixes the issue by adding a
`builtin.unrealized_conversion_cast` operation to bridge the type
change between signless-and-signfull integer types.

Author: Chris Bate <[email protected]>

Integrate LLVM at f137c3d592e96330e450a8fd63ef7e8877fc1908

Author: Christopher Bate <[email protected]>

Fix build with BUILD_SHARED_LIBS=ON

The new InferTensorValueRangeInterface was used without correctly
specifying the library dependency the PlanIR and StablehloExtIR
libraries.

Author: Sagar Shelke <[email protected]>

[compiler] Maintain output order in TensorRT engine.

For TensorRT engine conversion, first step in lowering a cluster
containing TensorRT ops is created inline group op. Operands to the
yield op (i.e. terminator) of inline group op are values from the
cluster that are used outside the cluster. These values are collected
by getting uses of each op (with `op->getUses()`) and checking if they
are outside the cluster. However, this use order is not deterministic
and sometimes it is desired to get yield results in a certian order.

This MR makes the following changes,
1. Add a function callback option named `ReorderRegionOpYieldValues`
to `mlir::createRegionOpFromCluster` method. This callback function
has signature `std::function<void(SetVector<Value> &yieldValues,
SmallVectorImpl<Type> &yieldTypes)>` which takes cluster values used
outside the cluster (in SetVector) and their types. By default this is
set to nullptr.
2. TensorRTToExecutable task is used in cases where a single
`func.func` represents a single TensorRT engine. In this case,
`ReorderRegionOpYieldValues` callback is implemented to make sure
inline group op yield value order is same as func.func return values
order.

Valid MLIR test is added.

GitOrigin-RevId: 630a69d8e14506db43cfefe4be2c790f9352da4f

DependencyProvider.cmake #	modified:
build_tools/cmake/Dependencies.cmake #	modified:
build_tools/patches/mlir/0005-mlir-memref-Fix-memref.global-overly-constrained-ver.patch
build_tools/patches/mlir/0006-mlir-emitc-Fix-two-EmitC-bugs.patch #
deleted:
build_tools/patches/mlir/0008-MLIR-Remove-unnecessary-include-from-MathToEmitC.h-t.patch
build_tools/patches/mlir/0009-mlir-Support-FileLineColRange-in-LLVM-debug-translat.patch
build_tools/patches/mlir/0010-MLIR-Fix-LLVMIRTransforms-build-failure-125485.patch
build_tools/patches/mlir/0011-MLIR-Fix-bufferization-interface-for-tensor-reshape.patch
build_tools/patches/stablehlo/0001-Fix-a-couple-missing-checks-for-static-shapes-in-sta.patch
build_tools/patches/stablehlo/0002-cmake-Update-usage-of-HandleLLVMOptions-and-LLVM_DEF.patch
build_tools/patches/stablehlo/0003-Don-t-insert-unnecessary-arith.index_cast-ops.patch
build_tools/patches/stablehlo/0004-Fix-ZeroExtent-condition-in-simplification-pattern.patch
build_tools/patches/stablehlo/0005-Fix-crash-on-ComplexType-in-PointwiseToLinalgMapConv.patch
build_tools/patches/stablehlo/0006-Remove-explicit-use-of-LLVMSupport.patch
build_tools/patches/stablehlo/0007-Fix-circular-dependence-between-StablehloPasses-and-.patch
build_tools/patches/torch_mlir/0001-cmake-Allow-finding-Stablehlo-via-find_package.patch
build_tools/patches/torch_mlir/0002-Make-compatible-with-more-recent-Stablehlo-version.patch
build_tools/patches/torch_mlir/0003-Fix-some-configuration-paths-in-LIT-cfg.patch
common/include/mlir-tensorrt-common/CMakeLists.txt #	renamed:
executor/include/mlir-executor/Runtime/Backend/Lua/LuaRegistration.h
-> common/include/mlir-tensorrt-common/Conversion/Passes.h #	new
file:   common/include/mlir-tensorrt-common/Conversion/Passes.td #
new file:
common/include/mlir-tensorrt-common/Dialect/EmitCExt/IR/DataLayoutImpl.h
common/include/mlir-tensorrt-common/Dialect/LinalgExt/Transforms/ToLoopsOpInterfaceImpl.h
common/include/mlir-tensorrt-common/Interfaces/ToLoopsOpInterface.h #
new file:
common/include/mlir-tensorrt-common/Interfaces/ToLoopsOpInterface.td #
new file:   common/lib/CMakeLists.txt #	new file:
common/lib/Conversion/CMakeLists.txt #	new file:
common/lib/Conversion/ToLoops/CMakeLists.txt #	new file:
common/lib/Conversion/ToLoops/ConvertToLoops.cpp #	new file:
common/lib/Dialect/CMakeLists.txt #	new file:
common/lib/Dialect/EmitCExt/CMakeLists.txt #	new file:
common/lib/Dialect/EmitCExt/DataLayoutImpl.cpp #	new file:
common/lib/Dialect/LinalgExt/CMakeLists.txt #	new file:
common/lib/Dialect/LinalgExt/Transforms/CMakeLists.txt #	new file:
common/lib/Dialect/LinalgExt/Transforms/ToLoopsOpInterfaceImpl.cpp #
new file:   common/lib/Interfaces/CMakeLists.txt #	new file:
common/lib/Interfaces/ToLoopsOpInterface.cpp #	new file:
common/lib/Utils/CMakeLists.txt #	renamed:
executor/lib/Utils/TensorRTDynamicLoader/CMakeLists.txt ->
common/lib/Utils/TensorRTDynamicLoader/CMakeLists.txt #	renamed:
executor/lib/Utils/TensorRTDynamicLoader/TensorRTDynamicLoader.cpp ->
common/lib/Utils/TensorRTDynamicLoader/TensorRTDynamicLoader.cpp #
modified:   compiler/CMakeLists.txt #	modified:
compiler/include/mlir-tensorrt/Backends/Host/HostBackend.td #
modified:   compiler/include/mlir-tensorrt/Compiler/Extension.h #
modified:   compiler/include/mlir-tensorrt/Compiler/OptionsProviders.h
compiler/include/mlir-tensorrt/Compiler/StablehloToExecutable/StablehloToExecutable.h
compiler/include/mlir-tensorrt/Compiler/StablehloToExecutable/TensorRTExtension.h
modified:
compiler/include/mlir-tensorrt/Conversion/StablehloToTensorRT/StablehloToTensorRT.h
compiler/include/mlir-tensorrt/Conversion/TensorRTCommon/ConvertToTensorRTCommon.h
compiler/include/mlir-tensorrt/Dialect/Plan/IR/PlanDialect.td #	new
file:   compiler/include/mlir-tensorrt/Dialect/Plan/IR/PlanEnums.h #
modified:
compiler/include/mlir-tensorrt/Dialect/Plan/IR/PlanInterfaces.h #
modified:
compiler/include/mlir-tensorrt/Dialect/Plan/IR/PlanInterfaces.td #
modified:
compiler/include/mlir-tensorrt/Dialect/Plan/Transforms/Passes.td #
modified:   compiler/include/mlir-tensorrt/InitAllDialects.h #
modified:   compiler/include/mlir-tensorrt/InitAllPasses.h #
modified:   compiler/include/mlir-tensorrt/Transforms/Transforms.h #
modified:   compiler/lib/Backends/Host/HostBackend.cpp #	modified:
compiler/lib/CAPI/Compiler/Registration/RegisterAllDialects.cpp #
modified:   compiler/lib/Compiler/OptionsProviders.cpp #	modified:
compiler/lib/Compiler/StablehloToExecutable/Passes.cpp #	modified:
compiler/lib/Compiler/StablehloToExecutable/StableHloInputPipelines.cpp
compiler/lib/Compiler/StablehloToExecutable/StablehloToExecutable.cpp
modified:   compiler/lib/Conversion/HostToEmitC/HostToEmitC.cpp #
modified:   compiler/lib/Conversion/StablehloToScf/CMakeLists.txt #
modified:   compiler/lib/Conversion/StablehloToScf/StablehloToScf.cpp
compiler/lib/Conversion/StablehloToTensorRT/CMakeLists.txt #
modified:   compiler/lib/Conversion/StablehloToTensorRT/Matchers.h #
new file:
compiler/lib/Conversion/StablehloToTensorRT/ReductionConversions.cpp #
modified:
compiler/lib/Conversion/StablehloToTensorRT/StablehloToTensorRT.cpp #
modified:   compiler/lib/Dialect/Plan/Transforms/AllocTensors.cpp #
modified:
compiler/lib/Dialect/Plan/Transforms/AssignMemorySpaces.cpp #
modified:   compiler/lib/Dialect/Plan/Transforms/CMakeLists.txt #
modified:   compiler/lib/Dialect/Plan/Transforms/CreateShapeFuncs.cpp
compiler/lib/Dialect/Plan/Transforms/MaterializeExplicitTransfers.cpp
compiler/lib/Dialect/Plan/Transforms/ModuleBufferization/BufferResultsToOutParams.cpp
compiler/lib/Dialect/Plan/Transforms/ModuleBufferization/ModuleBufferizationAnalysis.cpp
compiler/lib/Dialect/Plan/Transforms/OptimizeMemorySpaces.cpp #
modified:   compiler/lib/Dialect/Plan/Transforms/Passes.cpp #	new
file:
compiler/lib/Dialect/Plan/Transforms/PromoteHostTensorsToHostPinned.cpp
compiler/lib/Transforms/SCFDetensorizeLoops/SCFDetensorizeLoops.cpp #
new file:   compiler/test/Conversion/HostToEmitC/func-to-emitc.mlir #
modified:   compiler/test/Conversion/HostToEmitC/memref-to-emitc.mlir
compiler/test/Conversion/StablehloToArith/stablehlo-constant-to-arith.mlir
compiler/test/Conversion/StablehloToScf/stablehlo-to-scf.mlir #	new
file:
compiler/test/Conversion/StablehloToTensorRT/dot-to-einsum.mlir #
modified:
compiler/test/Conversion/StablehloToTensorRT/stablehlo-to-tensorrt-invalid.mlir
compiler/test/Conversion/StablehloToTensorRT/stablehlo-to-tensorrt-trt10.mlir
compiler/test/Conversion/StablehloToTensorRT/stablehlo-to-tensorrt.mlir
file:
compiler/test/Dialect/Plan/assign-and-optimize-memory-spaces.mlir #
deleted:    compiler/test/Dialect/Plan/assign-memory-spaces.mlir #
new file:
compiler/test/Dialect/Plan/buffer-results-to-out-params.mlir #	new
file:   compiler/test/Dialect/Plan/materialize-explicit-transfers.mlir
compiler/test/Dialect/Plan/materialize-shape-calculations-composite.mlir
compiler/test/Dialect/Plan/materialize-shape-calculations.mlir #
modified:   compiler/test/Dialect/Plan/plan-bufferize-pipeline.mlir #
new file:
compiler/test/Dialect/Plan/promote-host-tensors-to-host-pinned.mlir #
new file:
compiler/test/Pipelines/StableHloInputPipeline/preprocessing-pipeline.mlir
compiler/test/Target/Lua/IntegrationTests/ClusteringDynamicShape/end-to-end-binary.mlir
compiler/test/Target/Lua/IntegrationTests/ClusteringDynamicShape/end-to-end-unary.mlir
compiler/test/Target/Lua/IntegrationTests/buffer-ops-bf16.mlir #
modified:
compiler/test/Target/Lua/IntegrationTests/buffer-ops-dynamic.mlir #
modified:
compiler/test/Target/Lua/IntegrationTests/buffer-ops-f16.mlir #
modified:
compiler/test/Target/Lua/IntegrationTests/buffer-ops-f32.mlir #
modified:
compiler/test/Target/Lua/IntegrationTests/buffer-ops-f8E4M3FN.mlir #
modified:
compiler/test/Target/Lua/IntegrationTests/buffer-ops-i1.mlir #
modified:
compiler/test/Target/Lua/IntegrationTests/buffer-ops-i4.mlir #	new
file:   compiler/test/Target/Lua/IntegrationTests/lit.local.cfg #
modified:
compiler/test/Target/Lua/IntegrationTests/memcpy-strided.mlir #
modified:   compiler/test/Target/Lua/IntegrationTests/memcpy.mlir #
modified:
compiler/test/Transforms/SCFDetensorizeLoops/scf-detensorize-loops.mlir
compiler/test/python/IntegrationTests/Torch/test_torch_add.py #
modified:   compiler/test/python/IntegrationTests/lit.local.cfg #
modified:
compiler/test/python/IntegrationTests/test_call_validation.py #
modified:
compiler/test/python/IntegrationTests/test_non_dps_cconv.py #
modified:
compiler/test/python/IntegrationTests/test_return_allocation_loop.py #
modified:
compiler/test/python/IntegrationTests/test_stablehlo_add.py #
modified:
compiler/test/python/IntegrationTests/test_stablehlo_dynamic.py #
modified:
compiler/test/python/IntegrationTests/test_stablehlo_dynamic_iota.py #
modified:
compiler/test/python/IntegrationTests/test_tensorrt10_data_type_support.py
compiler/test/python/IntegrationTests/test_tensorrt_add.py #
modified:
compiler/test/python/mlir_tensorrt_compiler/compiler_api/test_compiler_api.py
compiler/test/python/mlir_tensorrt_compiler/compiler_api/test_compiler_debug_dump.py
compiler/test/python/mlir_tensorrt_compiler/compiler_api/test_plugin_schema_api.py
compiler/test/python/mlir_tensorrt_runtime/test_runtime_api.py #
modified:
compiler/test/python/mlir_tensorrt_runtime/test_runtime_debug_dump.py
executor/cmake/ExecutorDependencies.cmake #	modified:
executor/include/mlir-executor-c/Runtime/Runtime.h #	modified:
executor/include/mlir-executor/Conversion/ConvertToExecutorCommon.h #
modified:   executor/include/mlir-executor/Executor/IR/ExecutorOps.td
modified:   executor/include/mlir-executor/Runtime/API/API.h #
modified:
executor/include/mlir-executor/Runtime/Backend/Lua/LuaExtensionRegistry.h
executor/include/mlir-executor/Runtime/Backend/Lua/LuaRuntime.h #
modified:
executor/include/mlir-executor/Runtime/Backend/Utils/NvtxUtils.h #
modified:   executor/include/mlir-executor/Support/Status.h #
modified:   executor/lib/CAPI/Runtime/Runtime.cpp #	modified:
executor/lib/Executor/IR/Executor.cpp #	modified:
executor/lib/Executor/Transforms/Passes.cpp #	modified:
executor/lib/Runtime/API/API.cpp #	modified:
executor/lib/Runtime/Backend/Lua/LuaExtensionRegistry.cpp #	modified:
executor/lib/Runtime/Backend/Lua/LuaRuntime.cpp #	modified:
executor/lib/Target/Lua/TranslateToLua.cpp #	modified:
executor/lib/Target/Lua/TranslateToRuntimeExecutable.cpp #	modified:
executor/lib/Tools/ExecutorRunnerMain.cpp #	modified:
executor/lib/Utils/CMakeLists.txt #	modified:
executor/test/Executor/lower-builtins.mlir #	modified:
executor/test/IntegrationTests/arithmetic.mlir #	modified:
executor/test/IntegrationTests/assertion.mlir #	modified:
executor/test/IntegrationTests/complex.mlir #	modified:
executor/test/IntegrationTests/control-flow-nested.mlir #	modified:
executor/test/IntegrationTests/control-flow.mlir #	modified:
executor/test/IntegrationTests/coroutine.mlir #	modified:
executor/test/IntegrationTests/fill-device-f32.mlir #	modified:
executor/test/IntegrationTests/fill-f32.mlir #	modified:
executor/test/IntegrationTests/fill-i1.mlir #	modified:
executor/test/IntegrationTests/host-buffer-c32.mlir #	modified:
executor/test/IntegrationTests/host-buffer-i4.mlir #	modified:
executor/test/IntegrationTests/load-globals.mlir #	modified:
executor/test/IntegrationTests/pointer-cast-ops.mlir #	new file:
executor/test/IntegrationTests/ptr-to-int.mlir #	modified:
executor/test/IntegrationTests/stream.mlir #	modified:
executor/test/Unit/Runtime/LuaRuntime/ExecuteFunctionWithLuaBackendTests.cpp
modified:   integrations/python/bindings/Runtime/RuntimePyBind.cpp #
modified:
integrations/python/mlir_tensorrt_runtime/mlir_tensorrt/runtime/_mlir_libs/_api.pyi
integrations/python/mlir_tensorrt_tools/mlir_tensorrt/tools/gpu_tools.py
tensorrt/include/mlir-tensorrt-dialect/Target/TensorRTEncodingOpInterface/NetworkEncoder.h
tensorrt/include/mlir-tensorrt-dialect/TensorRT/IR/TensorRTOps.td #
modified:
tensorrt/lib/Target/TensorRTEncodingOpInterface/NetworkEncoder.cpp #
modified:   tensorrt/test/lit.cfg.py #	new file:
third_party/torch-mlir-cmake/CMakeLists.txt #	new file:
third_party/torch-mlir-cmake/TorchMLIRModule.cpp #
christopherbate pushed a commit to NVIDIA/TensorRT-Incubator that referenced this pull request Jul 10, 2025
Integrate internal changes

Author: Sagar Shelke <[email protected]>

[executor] Add complex type support to `ScalarValue`

Previously, ScalarValue which represents scalar runtime value did not
support complex type. This MR adds support for complex type by making
storage union of real and complex data instaed of just real.

MLIR tests are added via constant subgraph execution.

Author: Christopher Bate <[email protected]>

[compiler] Enable more `stablehlo.dot_general` to TensorRT using
`tensorrt.einsum`

Previously, we relied on canonicalization of `stablehlo.dot_general`
to put all such contraction operations into a form that could be
converted to `tensorrt.matrix_multiply`. Based on recent experiments,
this can actually produce very inefficient TensorRT programs due to
the number of reshapes and transpositions that must be inserted to
coerce general `stablehlo.dot_general` into batched matrix
multiplications. This change enables conversion of
`stablehlo.dot_general` to `tensorrt.einsum`, and the pass and
patterns now contain configurable parameters to control whether
`tensorrt.einsum` is used as the primary method or only for fallback
when conversion to `tensorrt.matrix_multiply` is not possible.

A follow on change will revamp the Stablehlo preprocessing that we
perform on 'stablehlo.dot_general' to avoid creating inefficient
patterns and enable wider use of this pattern.

Author: Christopher Bate <[email protected]>

[compiler] Fix stablehlo-to-scf scalarization heuristics

Fixes an issue where float tensors in the 'before' region of converted
while loops where scalarized. The transform should only scalarize
operands which are likely to be for-style induction variables.

Author: Christopher Bate <[email protected]>

[compiler] NFC: Drop dead code from StablehloToExecutableTask

Author: Chris Bate <[email protected]>

[compiler] Add `plan-promote-host-tensors-to-host-pinned` pass

Adds a simple pass to promote "host" tensors to "host-pinned" tensors
in common cases where we know a tensor will be transferred between
host and device spaces. This pass runs after
`plan-optimize-memory-spaces` since the former is sensitive to
mismatching host spaces for patterns related to moving tranfers out of
loops.

Author: Sagar Shelke <[email protected]>

[executor] Handle elided dense resource elements attr during
translation

Translation to executable (which is flatbuffer) uses MLIR attr
serialization to serialize `ElementsAttr`. However, this doesn't work
when attr is elided dense resource and results in segfault. This MR
handles this situation by replacing elided resource with
`DenseElementsAttr` of all `one`s (`true` in case of boolean).

IR with elided resource is usally seen only during testing of passes
and not useful for e2e functional execution. Testing of
`ExecuteConstantFoldableSubgraphs` pass is such case. Thus,  MLIR test
cases for this pass are added.

Author: Chris Bate <[email protected]>

[tensorrt] Fix TRT layer name generation function

The TRT layer naming had some faulty logic that could cause the layer
name to grow very large in the process to create a unique name. Fix
the issue and use a static counter to reduce time spent in the loop.

Author: Christopher Bate <[email protected]>

Further fixes to LIT configs

Previously, we were setting `lit_config.parallelism_group` instead of
`config.parallelism_group`. Apparently, the previous method does
nothing, only `config.parallelism_group` has any effect.

Author: Chris Bate <[email protected]>

Update LIT test parallelism configs

In more recent versions of TensorRT (10.11+ at least), the builder is
taking a much larger amount of host memory. This can cause OOM when
running the LIT test suites under their existing configurations.

This change updates all LIT configs:

- Make sure to use `%pick-one-gpu` in the LIT command line to ensure
    we stall if there are not enough GPU or host resources available.
    Add a hard limit that there must be at least 5GB of host memory
    available.

- Update configurations to reduce the amount of estimated parallelism
    by increasing host memory requirements and reducing the amount of
    host memory to 50% for the purposes of the parallelism
    calculation.

- Force all tests to use a common parallelism group unless otherwise
    specified in the test config.

Author: Christopher Bate <[email protected]>

[compiler] Fix failure case in stablehlo-to-scf

Fixes a failure case due to one of the recently introduced rewrites in
`stablehlo-to-scf`.

Author: Christopher Bate <[email protected]>

[compiler] Further improvements to plan bufferization pipeline

- Split `plan-assign-memory-spaces` into three passes:
    - `plan-assign-memory-spaces`
    - `plan-optimize-memory-spaces`
    - `plan-materialize-explicit-transfers`
- The last one is the only new code:
    `plan-materialize-explicit-transfers` converts `tensor.cast` ops
    that change the memory space encoding into explicit
    `bufferization.alloc_tensor` +
    `bufferization.materialize_in_destination` operations.
- Improve handling of `bufferization.alloc_tensor` and optimization of
    `scf.for` iteration args in `plan-assign-memory-spaces`.
- Improve handling of `tensor.reshape` in `plan-assign-memory-spaces`.
- Fix handling of `tensor.reshape` when rewriting functions to be in
    DPS style in `plan-alloc-tensors`.

This change also updates the LLVM dependencies in order to cherry-pick
fix to the `tensor.reshape` bufferization interface that I merged
upstream (llvm/llvm-project#128590).

In addition, fix APInt assertions in
`plan-execute-constant-foldable-subgraphs`.

Author: Chris Bate <[email protected]>

[compiler] Enable While-to-For conversion in Stablehlo-to-Scf pass

This change adds some patterns to the Stablehlo-to-Scf pass to enable
While-to-For conversion after the Stablehlo-to-Scf conversion. This
transformation is combined with the Stablehlo-to-Scf conversion
because the While-to-For patterns require first scalarizing block
arguments of the While operation. The heuristics for which block
arguments should be scalarized are implemented as control callbacks
for the scalarization patterns. These callbacks need
Stablehlo-specific logic, so it makes sense to test the combined
conversion as a single pass. From the pass users' perspective, it
gives the appearence of going directly from `stablehlo.while` to
`scf.for`.

The test cases are updated to cover the new patterns.

Author: Chris Bate <[email protected]>

[compiler] Fix assign-memory-spaces pass to respect function-level
constraints

Fixes an issue where the `plan.memory_space` attribute on a function
was not being respected when converting function signatures.

MR: initialdl/mlir-tensorrt!2146

Author: Chris Bate <[email protected]>

[compiler] Update scf.while detensorization to increase flexibility

In order to incorporate the upstream "uplift scf.while to scf.for"
transformation as part of the `stablehlo-to-scf` conversion, we need
to detensorize the operands of `scf.while` that are likely to
correspond to the loop induction variable. This change refactors our
existing 'scf.while' detensorization transformation to give more
flexibility and control. The TensorKindAnalysis is no longer required
in order to use the pattern(s). Detensorization of `after` and
`before` arguments of `scf.while` are now controlled separately.

Author: Chris Bate <[email protected]>

[compiler] Improve handling of memory space constraints in the Plan
dialect

This Plan dialect. Constraints are now specified using a common
attribute 'plan.memory_space' that can be applied to functions or
individual arguments/results. In addition, patterns in
`plan-alloc-tensors` and `plan-assign-memory-spaces` are updated to
avoid introducing unnecessary transfers between memory spaces.

Author: Chris Bate <[email protected]>

[compiler] Add plan-buffer-results-to-out-params pass

This change adds a new Plan dialect pass
`plan-buffer-results-to-out-params`. This pass is based on the
upstream Bufferization pass `buffer-results-to-out-params`, but it can
handle a wider number of cases (such as promoting dynamic allocations)
and uses alias analysis utilities to guard against failure cases that
the upstream pass currently cannot handle. These improvements should
eventually be upstreamed back to the Bufferization dialect.

Author: Chris Bate <[email protected]>

[compiler] Update func conversion in host-to-emitc

In the EmitC conversion/translation process, you can use `func.func`
or `emitc.func` to define functions. Previously, we converted all
`func.func` to `emitc.func`. However, `emitc.func` does not have a
path for supporting multiple return values. Therefore, prefer use of
type conversions on `func.func` instead of converting the entire op to
`emitc.func`. Add tests to verify that we can support multiple return
values.

Author: Chris Bate <[email protected]>

[compiler] Fix two host-to-emitc bugs

This change fixes two bugs exposed by new 'host-to-emitc' conversion
testing:

- The `!emitc.size_t` type does not have DataLayout information
    specified upstream. Therefore, to ensure that the type can be
    queried using DataLayout, we add a DataLayoutTypeInterface
    external model to the type. All queries are simply mapped to
    queries to the `index` type.

- The upstream `func.call` conversion has a bug where it does not
    correctly convert the result types of the call operation, which
    can lead to a type mismatch for any type that does not have an
    identity conversion.

Additional tests are added to `host-to-emitc`. Eventually the fixes
for both these issues should be moved upstream.

Author: Chris Bate <[email protected]>

[common] Add Linalg-to-loops (on tensors) implementation and
conversion pass

Adds a ToLoopsOpInterface implementation and for Linalg operations. In
addition, a conversion pass is added that converts ToLoopOpInterface
operations to loops.

Author: Chris Bate <[email protected]>

NFC: Move ToLoopsOpInterface to 'mlir-tensorrt-common'

Moves the ToLoopsOpInterface to the 'mlir-tensorrt-common' project.
This is in preperation for enabling the ToLoopsOpInterface on LinalgOp
(lowering while still using Tensor types) to replace the
`convert-stablehlo-arith-to-scalar` pipeline.

MR: initialdl/mlir-tensorrt!2137

Author: Christopher Bate <[email protected]>

NFC: Fix formatting across several files

Author: Chris Bate <[email protected]>

[executor] Introduce RuntimeSession "features" to control loading of
runtime modules

Previously, the RuntimeSession would always load all available runtime
modules. This causes some inefficiences. For example, in certain
integration tests for the Executor runtime, we don't use CUDA at all.
However, because CUDA is still initialized by default, we would still
require a GPU to be present just to run the integration test.
Furthermore, some experimental modules (e.g. Lua cublas module) are
not ready for "production" use and are only really invoked inside
special integration tests.

This change inroduces a notion of "features" to the RuntimeSession and
RuntimeSessionOptions. A feature is just a string that identifies a
particular runtime component. The particular semantic of a "feature"
depends on the the actual runtime implementation. For example, for the
LuaRuntimeSession, the feature names correspond to the available Lua
"modules" (a module is just a group of C++ Lua extension functions),
e.g. "core", "cuda", "tensorrt", etc.

The RuntimeSessionOptions gains methods for enabling/disabling
features. Certain features cause others to be added to the set
automatically, e.g. "tensorrt" and "nccl" both require "cuda" to be
added.

The API is piped through all the way to the Python bindings to allow
control of loaded modules at all levels. To preserve existing
behavior, RuntimeSessions created from Python will load all available
modules by default, but the `executor-runner|mlir-tensorrt-runner`
tools now require features to be explicitly specified.

Author: Christopher Bate <[email protected]>

NFC: Fix include guard for 'mlir-executor/Support/Status.h'

Author: Sagar Shelke <[email protected]>

[compiler/lib] Add stablehlo composite to call pass to pre-processing
pipeline

This MR adds `StablehloLegalizeCompositeToCallPass` to the
pre-processing pipeline.

MLIR test is added.

Author: Chris Bate <[email protected]>

[compiler] Add "default memory space" to ClusterKindAttrInterface

Adds a new method to the ClusterKindAttrInterface so that backends can
control the default tensor encoding (#plan.memory_space<..>) assigned
by the `plan.assign-memory-spaces` pass at a function-scope level. In
addition, we also allow an attribute to override the default space at
function argument/results. This override mechnanism was previously
lacking and will help resolve a long-standing issue where users cannot
control the memory space of arguments/results reliably.

Author: Christopher Bate <[email protected]>

[compiler] Fix some issues related to pipeline extension mechanism

The StablehloToExecutableTensorRTExtension had both 'disable' and an
inherited 'disabled' member variable. Delete the inherited one such it
should not have been introduced and was not bound to any option.
Further, remove unused 'extensions' vector from
CompilationTaskOptionsBase.

Author: Christopher Bate <[email protected]>

[executor] Fix ptrtoint and inttoptr op translation to Lua

Previously, we could generate conflicting function types (due to
pointer address space) when converting `executor.ptrtoint` and
`executor.inttoptr` ops to opaque calls. Instead, defer the conversion
to function call until the actual Lua translation point. At that point
we can generate a function name without having to consider the pointer
address space.

Author: Chris Bate <[email protected]>

Introduce 'MLIRTensorRTCommmon' sub-project

Certain targets need to be used across multiple sub-projects. For
example, the 'TensorRTDynamicLoader' target is used in all
sub-projects. In addition, the sub-projects need to be independently
buildable. This change introduces another sub-project under the
'common' directory where shared code can be placed. This allows us to
use `find_package` to declare the dependency, and downstream consumers
to meet the requirement using any number of techniques to fullfill the
'find_package' call.

Author: Chris Bate <[email protected]>

[compiler] Harden `stablehlo.constant` to `arith.constant` conversion

There is a utility pass that runs in the stablehlo-to-executable
pipeline that converts `stablehlo.constant` to `arith.constant`. This
pass can temporarily create invalid IR due to `arith.constant` not
supporting signful integer types. If the "verify-each" option is off,
then the issue will not be caught since it happens to be
self-correcting. However, the issue can still cause verification
failures while debugging. This change fixes the issue by adding a
`builtin.unrealized_conversion_cast` operation to bridge the type
change between signless-and-signfull integer types.

Author: Chris Bate <[email protected]>

Integrate LLVM at f137c3d592e96330e450a8fd63ef7e8877fc1908

Author: Christopher Bate <[email protected]>

Fix build with BUILD_SHARED_LIBS=ON

The new InferTensorValueRangeInterface was used without correctly
specifying the library dependency the PlanIR and StablehloExtIR
libraries.

Author: Sagar Shelke <[email protected]>

[compiler] Maintain output order in TensorRT engine.

For TensorRT engine conversion, first step in lowering a cluster
containing TensorRT ops is created inline group op. Operands to the
yield op (i.e. terminator) of inline group op are values from the
cluster that are used outside the cluster. These values are collected
by getting uses of each op (with `op->getUses()`) and checking if they
are outside the cluster. However, this use order is not deterministic
and sometimes it is desired to get yield results in a certian order.

This MR makes the following changes,
1. Add a function callback option named `ReorderRegionOpYieldValues`
to `mlir::createRegionOpFromCluster` method. This callback function
has signature `std::function<void(SetVector<Value> &yieldValues,
SmallVectorImpl<Type> &yieldTypes)>` which takes cluster values used
outside the cluster (in SetVector) and their types. By default this is
set to nullptr.
2. TensorRTToExecutable task is used in cases where a single
`func.func` represents a single TensorRT engine. In this case,
`ReorderRegionOpYieldValues` callback is implemented to make sure
inline group op yield value order is same as func.func return values
order.

Valid MLIR test is added.

GitOrigin-RevId: 630a69d8e14506db43cfefe4be2c790f9352da4f

Co-authored-by: Copybara Bot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants