De-tensorize function boundary pass (#1904)

josephleekl · web-flow · commit 63aca2fe9d85 · 2025-08-12T09:39:04.000-04:00
**Context:** Scalar tensors across function boundaries are prevalent due to JAX, which results in extra instructions required to extract elements from tensor arguments, and construct tensors from scalar elements to pass to the functions. **Description of the Change:** A new MLIR pass is created to detensorize function boundaries. Extra tensor extract and from_element ops are added then folded. **Benefits:** Reduce instructions count ### Time score |Workflow |Catalyst (qjit+exec)| |Catalyst (exec)| | | :--- | ---: | :---: | ---: | :---: | |QPE |0.99 | |0.99 | | |QSVT |0.99 | |1.00 | | |XAS |1.40 |🟢|0.99 | | |shor |0.98 | |0.98 | | |molecular_hamiltonian| - | | - | | |sampling |1.00 | |1.02 | | |stateprep |0.98 | |0.68 |🔴| |grover |1.02 | |0.96 |🔴| |QAOA_layers_scaling |1.00 | |1.01 | | |QML |0.98 | |1.00 | | |QML_jaxjit | - | | - | | |UCCSD |0.98 | |1.02 | | |VQE | - | |1.00 | | ## Detailed Per-Workflow Results - The assumed noise level for runtime improvements/regressions is 3.0% - ⚠️ marks workflows with runtime fluctuations greater than 5.0% (std/mean) ### Catalyst (compilation + execution) |Workflow |time [s]|std/mean| |time score| |virt mem [MB]|virt mem score| |phys mem [MB]|phys mem score| | | :--- | ---: | ---: | :---: | ---: | :---: | ---: | ---: | :---: | ---: | ---: | :---: | |QPE[11-12] |1.196 |7.1% |⚠️|0.97 | |365.729 | - | |116.785 | - | | |QPE[12-12] |2.222 |6.5% |⚠️|1.00 | |499.972 | - | |106.713 | - | | |QSVT[9] |26.464 |18.0% |⚠️|0.99 | |4,279.710 | - | |2,471.539 | - | | |XAS[2-1-9] |28.783 |0.8% | |1.46 |🟢|4,424.084 | - | |2,501.321 | - | | |XAS[2-2-9] |38.110 |0.6% | |1.33 |🟢|4,424.159 | - | |2,500.624 | - | | |shor[15] |1.288 |7.7% |⚠️|0.98 | |237.171 | - | |138.236 | - | | |shor[33] |3.168 |3.2% | |0.99 | |237.761 | - | |141.140 | - | | |sampling[24-2] |1.563 |5.0% | |1.02 | |1,046.111 | - | |116.871 | - | | |sampling[25-2] |3.411 |3.1% | |0.99 | |1,852.645 | - | |106.881 | - | | |stateprep[12-MottonenStatePreparation]|7.099 |2.8% | |0.98 | |404.347 | - | |229.323 | - | | |grover[18] |9.678 |0.9% | |1.02 | |459.678 | - | |291.979 | - | | |QAOA_layers_scaling[19-4] |111.156 |0.4% | |1.00 | |584.816 | - | |5,387.448 | - | | |QML[IQPKernelClassifier-12-10] |121.986 |1.1% | |0.98 | |2,122.109 | - | |912.646 | - | | |UCCSD[H2O-STO-3G] |12.956 |1.4% | |0.99 | |532.305 | - | |286.573 | - | | |UCCSD[NH3-STO-3G] |33.076 |2.3% | |0.97 | |1,025.818 | - | |537.739 | - | | |VQE[H2O-STO-3G] |21.477 |0.8% | |0.99 | |475.394 | - | |271.380 | - | | |VQE[NH3-STO-3G] |72.532 |0.7% | | - | |1,224.448 | - | |785.674 | - | | ### Catalyst (execution only) |Workflow |time [s]|std/mean| |time score| |virt mem [MB]|virt mem score| |phys mem [MB]|phys mem score| | | :--- | ---: | ---: | :---: | ---: | :---: | ---: | ---: | :---: | ---: | ---: | :---: | |QPE[11-12] |0.896 |1.0% | |0.98 | |365.954 | - | |117.281 | - | | |QPE[12-12] |2.060 |0.7% | |0.99 | |500.201 | - | |116.912 | - | | |QSVT[9] |0.143 |6.2% |⚠️|1.00 | |4,279.643 | - | |2,471.780 | - | | |XAS[2-1-9] |9.320 |0.7% | |1.00 | |4,424.198 | - | |2,502.304 | - | | |XAS[2-2-9] |18.713 |0.4% | |0.99 | |4,424.159 | - | |2,496.918 | - | | |shor[15] |0.051 |6.1% |⚠️|0.96 |🔴|237.228 | - | |138.162 | - | | |shor[33] |1.817 |1.5% | |1.00 | |237.699 | - | |140.599 | - | | |sampling[24-2] |1.368 |1.6% | |1.03 |🟢|1,046.060 | - | |103.055 | - | | |sampling[25-2] |3.186 |1.3% | |1.00 | |1,856.906 | - | |123.036 | - | | |stateprep[12-MottonenStatePreparation]|0.033 |19.5% |⚠️|0.68 |🔴|404.091 | - | |242.000 | - | | |grover[18] |1.561 |1.6% | |0.96 |🔴|468.432 | - | |308.101 | - | | |QAOA_layers_scaling[19-4] |106.051 |0.5% | |1.01 | |584.730 | - | |5,745.373 | - | | |QML[IQPKernelClassifier-12-10] |97.194 |3.2% | |1.00 | |1,184.518 | - | |623.976 | - | | |UCCSD[H2O-STO-3G] |0.113 |1.9% | |1.05 |🟢|530.752 | - | |292.098 | - | | |UCCSD[NH3-STO-3G] |1.153 |4.2% | |0.99 | |1,025.717 | - | |550.011 | - | | |VQE[H2O-STO-3G] |2.067 |1.8% | |1.00 | |475.332 | - | |272.044 | - | | |VQE[NH3-STO-3G] |16.754 |1.0% | |1.01 | |1,224.450 | - | |785.883 | - | | **Possible Drawbacks:** **Related GitHub Issues:** [sc-95476]
diff --git a/doc/releases/changelog-dev.md b/doc/releases/changelog-dev.md
@@ -4,6 +4,9 @@
 
 <h3>Improvements 🛠</h3>
 
+* Added `detensorizefunctionboundary` pass to remove scalar tensors across function boundaries and enabled `symbol-dce` pass to remove dead functions, reducing the number of instructions for compilation.
+  [(#1904)](https://github.com/PennyLaneAI/catalyst/pull/1904)
+
 * Workflows `for_loop`, `while_loop` and `cond` now error out if `qml.capture` is enabled.
   [(#1945)](https://github.com/PennyLaneAI/catalyst/pull/1945)
 
diff --git a/frontend/catalyst/pipelines.py b/frontend/catalyst/pipelines.py
@@ -239,7 +239,9 @@ def get_hlo_lowering_stage(_options: CompileOptions) -> List[str]:
         "cse",
         "func.func(linalg-detensorize{aggressive-mode})",
         "detensorize-scf",
+        "detensorize-function-boundary",
         "canonicalize",
+        "symbol-dce",
     ]
     return hlo_lowering
 
diff --git a/mlir/include/Catalyst/Transforms/Passes.h b/mlir/include/Catalyst/Transforms/Passes.h
@@ -26,6 +26,7 @@ std::unique_ptr<mlir::Pass> createArrayListToMemRefPass();
 std::unique_ptr<mlir::Pass> createBufferDeallocationPass();
 std::unique_ptr<mlir::Pass> createCatalystBufferizationPass();
 std::unique_ptr<mlir::Pass> createCatalystConversionPass();
+std::unique_ptr<mlir::Pass> createDetensorizeFunctionBoundaryPass();
 std::unique_ptr<mlir::Pass> createDetensorizeSCFPass();
 std::unique_ptr<mlir::Pass> createDisableAssertionPass();
 std::unique_ptr<mlir::Pass> createGEPInboundsPass();
diff --git a/mlir/include/Catalyst/Transforms/Passes.td b/mlir/include/Catalyst/Transforms/Passes.td
@@ -17,6 +17,16 @@
 
 include "mlir/Pass/PassBase.td"
 
+def DetensorizeFunctionBoundaryPass : Pass<"detensorize-function-boundary"> {
+    let summary = "Detensorize across function boundary.";
+
+    let dependentDialects = [
+        "tensor::TensorDialect",
+    ];
+
+    let constructor = "catalyst::createDetensorizeFunctionBoundaryPass()";
+}
+
 def DetensorizeSCFPass : Pass<"detensorize-scf"> {
     let summary = "Detensorize for, if, while operations from the SCF dialect.";
 
diff --git a/mlir/lib/Catalyst/Transforms/CMakeLists.txt b/mlir/lib/Catalyst/Transforms/CMakeLists.txt
@@ -8,6 +8,7 @@ file(GLOB SRC
     BufferizableOpInterfaceImpl.cpp
     catalyst_to_llvm.cpp
     DetectQNodes.cpp
+    DetensorizeFunctionBoundaryPass.cpp
     DetensorizeSCFPass.cpp
     disable_assertion.cpp
     DisableAssertionPatterns.cpp
diff --git a/mlir/lib/Catalyst/Transforms/DetensorizeFunctionBoundaryPass.cpp b/mlir/lib/Catalyst/Transforms/DetensorizeFunctionBoundaryPass.cpp
@@ -0,0 +1,239 @@
+#define DEBUG_TYPE "detensorize-func-boundary"
+
+#include "mlir/Dialect/Func/IR/FuncOps.h"
+#include "mlir/Dialect/Tensor/IR/Tensor.h"
+#include "mlir/IR/IRMapping.h"
+#include "mlir/IR/PatternMatch.h"
+#include "mlir/Pass/Pass.h"
+#include "mlir/Transforms/GreedyPatternRewriteDriver.h"
+
+#include "Catalyst/IR/CatalystDialect.h"
+
+using namespace llvm;
+using namespace mlir;
+using namespace catalyst;
+
+namespace {
+bool isScalarTensor(Type type)
+{
+    if (auto rankedType = dyn_cast<RankedTensorType>(type)) {
+        return rankedType.getRank() == 0;
+    }
+    return false;
+}
+
+Type getScalarOrOriginalType(Type type)
+{
+    if (isScalarTensor(type)) {
+        return dyn_cast<RankedTensorType>(type).getElementType();
+    }
+    else {
+        return type;
+    }
+}
+
+bool hasScalarTensorSignature(func::FuncOp funcOp)
+{
+    for (Type type : funcOp.getFunctionType().getInputs()) {
+        if (isScalarTensor(type)) {
+            return true;
+        }
+    }
+    for (Type type : funcOp.getFunctionType().getResults()) {
+        if (isScalarTensor(type)) {
+            return true;
+        }
+    }
+    return false;
+}
+
+struct DetensorizeCallSitePattern : public OpRewritePattern<func::CallOp> {
+    using OpRewritePattern<func::CallOp>::OpRewritePattern;
+
+    LogicalResult matchAndRewrite(func::CallOp callOp, PatternRewriter &rewriter) const override
+    {
+        auto funcOp =
+            SymbolTable::lookupNearestSymbolFrom<func::FuncOp>(callOp, callOp.getCalleeAttr());
+
+        // Skip for main function
+        if (!funcOp || funcOp->hasAttr("llvm.emit_c_interface")) {
+            return failure();
+        }
+
+        if (!hasScalarTensorSignature(funcOp)) {
+            return failure();
+        }
+
+        // Skip for QNodes
+        // Some Gradient boundaries only work for Tensor signatures
+        // and not scalar ones, hence we skip them here.
+        if (funcOp->hasAttr("qnode")) {
+            return failure();
+        }
+
+        // Create detensorized FuncOp if it does not already exist
+        auto module = callOp->getParentOfType<ModuleOp>();
+        std::string newFuncName = funcOp.getName().str() + ".detensorized";
+        auto newFuncOp = module.lookupSymbol<func::FuncOp>(newFuncName);
+
+        if (!newFuncOp) {
+            OpBuilder::InsertionGuard guard(rewriter);
+            rewriter.setInsertionPointToEnd(module.getBody());
+
+            // Create the new function with a detensorized signature
+            FunctionType funcType = funcOp.getFunctionType();
+            SmallVector<Type> newArgTypes, newResultTypes;
+            SmallVector<NamedAttribute> newAttrs;
+            extractDetensorizedOpSignature(funcType, funcOp, newArgTypes, newResultTypes, newAttrs);
+
+            // Create the new function, passing the collected signature
+            auto newFuncType = FunctionType::get(getContext(), newArgTypes, newResultTypes);
+            newFuncOp =
+                rewriter.create<func::FuncOp>(funcOp.getLoc(), newFuncName, newFuncType, newAttrs);
+
+            // Map FuncOp body and return operation
+            Block *newEntryBlock = newFuncOp.addEntryBlock();
+            IRMapping mapper;
+            mapFuncOpBodyAndReturnOp(rewriter, newEntryBlock, funcOp, mapper);
+        }
+
+        // Rewrite the original call site to use the new detensorized function
+        replaceCallOp(rewriter, callOp, newFuncOp);
+        return success();
+    }
+
+    void extractDetensorizedOpSignature(FunctionType &funcType, func::FuncOp &funcOp,
+                                        SmallVector<Type> &newArgTypes,
+                                        SmallVector<Type> &newResultTypes,
+                                        SmallVector<NamedAttribute> &newAttrs) const
+    {
+        for (Type type : funcType.getInputs()) {
+            newArgTypes.push_back(getScalarOrOriginalType(type));
+        }
+        for (Type type : funcType.getResults()) {
+            newResultTypes.push_back(getScalarOrOriginalType(type));
+        }
+
+        // Collect all attributes from the original function
+        for (const NamedAttribute &attr : funcOp->getAttrs()) {
+            if (attr.getName() == funcOp.getSymNameAttrName() ||
+                attr.getName() == funcOp.getFunctionTypeAttrName()) {
+                continue;
+            }
+            newAttrs.push_back(attr);
+        }
+    }
+
+    void mapFuncOpBodyAndReturnOp(PatternRewriter &rewriter, Block *newEntryBlock,
+                                  func::FuncOp &funcOp, IRMapping &mapper) const
+    {
+        rewriter.setInsertionPointToStart(newEntryBlock);
+        for (const auto &it : llvm::enumerate(funcOp.getArguments())) {
+            Value oldArg = it.value();
+            Value newArg = newEntryBlock->getArgument(it.index());
+
+            if (isScalarTensor(oldArg.getType())) {
+                // Insert a FromElementsOp if the old argument is a scalar tensor
+                auto fromElementsOp = rewriter.create<tensor::FromElementsOp>(
+                    newArg.getLoc(), oldArg.getType(), newArg);
+                mapper.map(oldArg, fromElementsOp.getResult());
+            }
+            else {
+                mapper.map(oldArg, newArg);
+            }
+        }
+
+        // Clone the operations from the body of old function (excluding the old return)
+        rewriter.setInsertionPointToEnd(newEntryBlock);
+        for (Operation &op : funcOp.front().without_terminator()) {
+            rewriter.clone(op, mapper);
+        }
+
+        // Create a new return operation with the mapped results
+        auto oldReturnOp = cast<func::ReturnOp>(funcOp.front().getTerminator());
+        SmallVector<Value> newReturnOperands;
+        newReturnOperands.reserve(oldReturnOp.getNumOperands());
+        for (Value operand : oldReturnOp.getOperands()) {
+            Value newOperand = mapper.lookup(operand);
+            if (isScalarTensor(newOperand.getType())) {
+                // Insert ExtractOp if the operand is a scalar tensor
+                auto extractOp = rewriter.create<tensor::ExtractOp>(oldReturnOp.getLoc(),
+                                                                    newOperand, ValueRange{});
+                newReturnOperands.push_back(extractOp.getResult());
+            }
+            else {
+                newReturnOperands.push_back(newOperand);
+            }
+        }
+        rewriter.create<func::ReturnOp>(oldReturnOp.getLoc(), newReturnOperands);
+    }
+
+    void replaceCallOp(PatternRewriter &rewriter, func::CallOp &callOp,
+                       func::FuncOp &newFuncOp) const
+    {
+        rewriter.setInsertionPoint(callOp);
+        SmallVector<Value> newOperands;
+        for (Value operand : callOp.getOperands()) {
+            // Insert ExtractOp if the old operand is a scalar tensor to bridge the detensorized
+            // function
+            if (isScalarTensor(operand.getType())) {
+                auto extractOp =
+                    rewriter.create<tensor::ExtractOp>(callOp.getLoc(), operand, ValueRange{});
+                newOperands.push_back(extractOp.getResult());
+            }
+            else {
+                newOperands.push_back(operand);
+            }
+        }
+
+        auto newCallOp = rewriter.create<func::CallOp>(callOp.getLoc(), newFuncOp, newOperands);
+
+        SmallVector<Value> newResults;
+        for (size_t i = 0; i < callOp.getNumResults(); ++i) {
+            Value oldResult = callOp.getResult(i);
+            Value newResult = newCallOp.getResult(i);
+            if (isScalarTensor(oldResult.getType())) {
+                // Insert a FromElementsOp if the old result is a scalar tensor to bridge the
+                // detensorized function
+                auto fromElementsOp = rewriter.create<tensor::FromElementsOp>(
+                    callOp.getLoc(), oldResult.getType(), newResult);
+                newResults.push_back(fromElementsOp.getResult());
+            }
+            else {
+                newResults.push_back(newResult);
+            }
+        }
+
+        rewriter.replaceOp(callOp, newResults);
+    }
+};
+} // namespace
+
+namespace catalyst {
+#define GEN_PASS_DEF_DETENSORIZEFUNCTIONBOUNDARYPASS
+#include "Catalyst/Transforms/Passes.h.inc"
+
+struct DetensorizeFunctionBoundaryPass
+    : public impl::DetensorizeFunctionBoundaryPassBase<DetensorizeFunctionBoundaryPass> {
+    using impl::DetensorizeFunctionBoundaryPassBase<
+        DetensorizeFunctionBoundaryPass>::DetensorizeFunctionBoundaryPassBase;
+    void runOnOperation() override
+    {
+        MLIRContext *context = &getContext();
+        RewritePatternSet patterns(context);
+
+        patterns.add<DetensorizeCallSitePattern>(context);
+
+        GreedyRewriteConfig config;
+        if (failed(applyPatternsGreedily(getOperation(), std::move(patterns), config))) {
+            signalPassFailure();
+        }
+    }
+};
+
+std::unique_ptr<Pass> createDetensorizeFunctionBoundaryPass()
+{
+    return std::make_unique<DetensorizeFunctionBoundaryPass>();
+}
+
+} // namespace catalyst
diff --git a/mlir/lib/Catalyst/Transforms/DetensorizeSCFPass.cpp b/mlir/lib/Catalyst/Transforms/DetensorizeSCFPass.cpp
@@ -1,4 +1,4 @@
-#define DEBUG_TYPE "myhelloworld"
+#define DEBUG_TYPE "detensorize-scf"
 
 #include "mlir/Dialect/SCF/IR/SCF.h"
 #include "mlir/Dialect/Tensor/IR/Tensor.h"
diff --git a/mlir/lib/Catalyst/Transforms/RegisterAllPasses.cpp b/mlir/lib/Catalyst/Transforms/RegisterAllPasses.cpp
@@ -40,6 +40,7 @@ void catalyst::registerAllCatalystPasses()
     mlir::registerPass(catalyst::createDecomposeNonCliffordPPRPass);
     mlir::registerPass(catalyst::createDecomposeCliffordPPRPass);
     mlir::registerPass(catalyst::createCountPPMSpecsPass);
+    mlir::registerPass(catalyst::createDetensorizeFunctionBoundaryPass);
     mlir::registerPass(catalyst::createDetensorizeSCFPass);
     mlir::registerPass(catalyst::createDisableAssertionPass);
     mlir::registerPass(catalyst::createDisentangleCNOTPass);
diff --git a/mlir/test/Catalyst/DetensorizeFunctionBoundaryTest.mlir b/mlir/test/Catalyst/DetensorizeFunctionBoundaryTest.mlir

Original file line number	Diff line number	Diff line change
`@@ -239,7 +239,9 @@ def get_hlo_lowering_stage(_options: CompileOptions) -> List[str]:`
`239`	`239`	`"cse",`
`240`	`240`	`"func.func(linalg-detensorize{aggressive-mode})",`
`241`	`241`	`"detensorize-scf",`
	`242`	`+ "detensorize-function-boundary",`
`242`	`243`	`"canonicalize",`
	`244`	`+ "symbol-dce",`
`243`	`245`	`]`
`244`	`246`	`return hlo_lowering`
`245`	`247`
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-#define DEBUG_TYPE "myhelloworld"`
	`1`	`+#define DEBUG_TYPE "detensorize-scf"`
`2`	`2`
`3`	`3`	`#include "mlir/Dialect/SCF/IR/SCF.h"`
`4`	`4`	`#include "mlir/Dialect/Tensor/IR/Tensor.h"`