Skip to content

Conversation

@lukamac
Copy link
Contributor

@lukamac lukamac commented Jul 8, 2025

Main contribution of this PR is abstracting the DMA logic into the AsyncDma class and showing it works with the currently supported DMAs like the PULP's mchan and L3 dma, and Snitch's cluster dma.
The main goal was enabling any-dimensional transfers which got incarnated as the AnydimAsyncDmaTransferAdapter.

Performance comparison

Comparison of network execution on Siracusa emulated in GVSoC
Network Conf L1 size Cycles before Cycles after Diff Perc. diff
Simple regression SB L2 64k 636738 637758 1020 0.2%
Simple regression SB L3 64k 1007791 989804 -17987 -1.8%
Simple regression DB L2 64k 648051 643213 -4838 -0.7%
Simple regression DB L3 64k 954903 988656 33753 3.5%
MobileNetV2 SB L3 64k Not runnable 198539808 #VALUE! #VALUE!
MobileNetV2 DB L3 64k Not runnable 194710802 #VALUE! #VALUE!
miniMobileNetV2 SB L2 16k 120580 110003 -10577 -8.8%
miniMobileNetV2 SB L3 16k 392763 361312 -31451 -8.0%
miniMobileNetV2 DB L2 16k 122368 110997 -11371 -9.3%
miniMobileNetV2 DB L3 16k 362411 361283 -1128 -0.3%
microLlama/microLlama1 SB L2 10k 657015 519142 -137873 -21.0%
microLlama/microLlama1 SB L3 10k 4461938 3977350 -484588 -10.9%
microLlama/microLlama1 DB L2 10k 707464 542643 -164821 -23.3%
microLlama/microLlama1 DB L3 10k 3913082 4047015 133933 3.4%
CCT/CCT_1_16_16_8 SB L2 64k 493469 485747 -7722 -1.6%
CCT/CCT_1_16_16_8 SB L3 64k 1255862 1219939 -35923 -2.9%
CCT/CCT_1_16_16_8 DB L2 64k 513180 504340 -8840 -1.7%
CCT/CCT_1_16_16_8 DB L3 64k 1227566 1213998 -13568 -1.1%

I wanted to test also Snitch but the program only prints 0 cycles. Visually checking the code, there is no fundamental change except for the fact that we now emit less barriers.

Added

  • AsyncDma abstraction of DMA's
  • test runner per DMA and a script that tests all the DMA's
  • generic Single/DoubleBufferingTilingCodeGeneration classes
  • TilingVariableReplacementUpdate class that updates the variable replacement refs
  • TilingHoistingMixIn class that encapsulates all the hoisting helper functions of tiling
  • sorting of input memory allocations to allow references that live in the same memory level as the memory they are referencing
  • a function that tests the tiling solution for correctness which currently only tests buffer allocation for byte alignment
  • IntrospectiveCodeTransformation: _indexPointer(), indexVars(), dereferenceVars(). The *Vars functions index/dereference a list of variables (useful for tiling)
  • NetworkContext: unravelReference() that unravels a _ReferenceBuffer until the base buffer
  • NetworkContext: is_object() - helper function that determines whether the string represents a name of a local or global object
  • NetworkContext: is_buffer() - helper function that determines whether the string represents a name of a buffer
  • missing checks for environment variables
  • _permuteHyperRectangle helper function

Changed

  • mchan HAL is now reduced to bare-bones
  • refactor of the IntrospectiveCodeTransformation to work on the Mako template and made imo clearer
  • refactor of memory allocation code transformation passes
  • _ReferenceBuffer accepts an optional offset argument to offset the reference
  • NetworkContext: hoistReference - accepts the actual buffer as reference instead of name, accepts shape, offset, and override_type arguments, and returns the actual buffer, not its name
  • _mangleNodeRep -> _mangleOpRepr - the canonical way we use is OperatorRepresentation. NodeRep and ParseDict are old iterations of that.
  • rename of permutation functions to follow this convention: permute is an action that permutes something, permutation is a function that generates a permutation
  • _permuteList to just _permute
  • removed manual buffer name mangling since we do it in the ExecutionBlock generate() function, simplifies templates
  • we now check that buffer shapes/hyperrectangles/tiling ranks match which required changing a few serializeTilingSolution functions to preserve the same shape rank
  • big refactor of the code generation part of the TilingExtension
  • port of PULPOpen tiling code generation
  • port of Snitch tiling code generation
  • PULPClusterTilingSB and PULPClusterTilingDB now allow for transfers of any rank (dimensionality)
  • PULP's final output diff is now calculated as absolute error, instead of just subtraction
  • common code generation code between testMVP/generateNetwork/... was extracted into a single generateTestNetwork function
  • in some functions, instead of passing the name of a buffer, the actual buffer is just passed
  • tile function allows overriding the optimizer with external tilingSolution and memoryMap
  • refactor of the permutation functions for clarity

PR Merge Checklist

  1. The PR is rebased on the latest devel commit and pointing to devel.
  2. Your PR reviewed and approved.
  3. All checks are passing.
  4. The CHANGELOG.md file has been updated.
  5. If the docker was modified, change back its link after review.

@Xeratec Xeratec added the Feature Addition of new features label Jul 8, 2025
@Xeratec Xeratec added this to the Release 0.2.1 milestone Jul 8, 2025
@Xeratec Xeratec added this to Deeploy Jul 8, 2025
@Xeratec Xeratec moved this to In review in Deeploy Jul 8, 2025
@Xeratec Xeratec moved this from In review to In progress in Deeploy Jul 30, 2025
@lukamac lukamac marked this pull request as draft July 31, 2025 11:18
@lukamac lukamac changed the title PULPOpen: Allow for any-dimensional transfers between L1 and L2 Refactor tiling code generation Jul 31, 2025
@lukamac lukamac force-pushed the rewrite-pulp-dma branch 5 times, most recently from 32d9155 to e8869ed Compare August 5, 2025 16:43
@lukamac lukamac force-pushed the rewrite-pulp-dma branch 2 times, most recently from 0f96aea to 2147f44 Compare August 8, 2025 08:13
@lukamac lukamac marked this pull request as ready for review August 8, 2025 11:29
@Xeratec Xeratec moved this from In progress to In review in Deeploy Aug 11, 2025
@Xeratec
Copy link
Member

Xeratec commented Aug 23, 2025

@coderabbitai summary

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 23, 2025

✅ Actions performed

Summary regeneration triggered.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 23, 2025

Summary by CodeRabbit

  • New Features
    • Async DMA support across targets (L3, MCHAN, Snitch) with single/double-buffer tiling.
    • Enhanced tiling/memory pipeline, improved IO discovery, and minimal type helpers.
  • Refactor
    • Major DMA-aware tiling/codegen overhaul; new single/double-buffer generators.
    • Updated permutation utilities and template/introspection flow.
    • Removed legacy DMA headers/paths; unified offset/rectangle utilities.
  • Tests
    • New DMA test runners and comprehensive matrix harness.
    • Unified test network generator; revamped type-mapping.
    • Stricter env checks and absolute diff reporting.
  • Chores
    • Added CI job to run DMA tests.
    • Updated platform includes and build guards.

Walkthrough

Adds DMA-aware tiling and codegen infrastructure (AsyncDma, single/double buffering), integrates DMA in PULP/Snitch pipelines, refactors tiling/memory mapping APIs, revises permutation/minimization utilities, updates reference/aliasing semantics, adjusts multiple templates to avoid name mangling, replaces/updates PULP DMA headers, introduces new DMA test runners and CI job, and updates tests for new type-mapping API.

Changes

Cohort / File(s) Summary
CI
.github/workflows/CI.yml
Adds job deeploy-test-dmas to install package and run DeeployTest/testDmas.py on dynamic runner/image.
Closure generation
Deeploy/CommonExtensions/CodeTransformationPasses/Closure.py
Tweaks dynamic reference extraction (named arg unrollStructs=True) and removes dedup loops during closure struct gen.
Introspective transformation refactor
Deeploy/CommonExtensions/CodeTransformationPasses/IntrospectiveCodeTransformation.py
Switches NodeTemplate→Template flow; compiles source via codegen; adds pointer index/deref helpers; revises dynamic-expr extraction signatures and typing.
Memory allocation pass refactor
Deeploy/CommonExtensions/CodeTransformationPasses/MemoryAllocation.py
Reworks to memory-level buffers, renames ctor arg, adds static classifiers/topo sort, simplifies passthrough; updates imports/types.
Data types helpers
Deeploy/CommonExtensions/DataTypes.py
Adds minimalIntegerType and minimalFloatType; typing imports updated.
Permutation utilities and usages
Deeploy/CommonExtensions/OptimizationPasses/TopologyOptimizationPasses/LoweringOptimizationPasses.py
Introduces generic _permute, _permuteHyperRectangle, renames permutation helpers, tightens typings; updates call sites.
Types and references
Deeploy/DeeployTypes.py
Overhauls _ReferenceBuffer (offsets), alias resolution, reference hoisting API, object/buffer checks, IO discovery, and op-repr mangling rename.
Engine coloring message
Deeploy/EngineExtension/NetworkDeployers/EngineColoringDeployer.py
Error now lists uncolored node names and operations.
Name mangling removals (targets)
Deeploy/Targets/CortexM/Templates/CMSISUtils.py, Deeploy/Targets/Generic/Templates/DebugPrintTemplate.py, .../ITAMaxTemplate.py, Deeploy/Targets/MemPool/Templates/ITAMaxTemplate.py, .../ITATemplate.py, .../GemmTemplate.py, .../RQGemmTemplate.py, .../RQMatMulTemplate.py
Replace ctxt._mangle(...) with raw names or capture returned transient buffer names where applicable.
Generic tiling constraints updates
Deeploy/Targets/Generic/TileConstraints/TransposeTileConstraint.py, .../iRMSNormTileConstraint.py
Use _permuteHyperRectangle; simplify schedule serialization; construct weight cube directly.
Neureka tile constraints offset API
Deeploy/Targets/Neureka/TileConstraints/Neureka{Dense,Depthwise,Pointwise}Constraint.py
Switch to calculateFlatOffsetInBytes for weight offsets.
PULP bindings and platform
Deeploy/Targets/PULPOpen/Bindings.py, .../Platform.py
Integrates L3/MCHAN DMA, adds variable-replacement update pass, reorders pipeline, adds L3 memory generation; replaces dory_dma.h include with mchan_siracusa.h.
PULP cluster/L3 tiling refactor
Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPClusterTiling.py, .../PULPL3Tiling.py
New constructors (externalMemory, localMemory, dma); adopt new SB/DB base classes; inline class variants; sequential SB/DB in Snitch analog.
Removed legacy PULP tiling modules
Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPClusterTilingSB.py, .../PULPClusterTilingDB.py, .../PULPL3TilingSB.py, .../PULPL3TilingDB.py
Delete old SB/DB tiling codegen modules.
PULP DMA implementations (new)
Deeploy/Targets/PULPOpen/DMA/L3Dma.py, .../MchanDma.py
Add AsyncDma drivers with futures, templates, checks, and blocking adapter for L3; MCHAN 1D/2D commands and waiting strategy.
PULP auto-transpose DMA
Deeploy/Targets/PULPOpen/CodeTransformationPasses/AutoTransposeUtils.py
Rework to _permuteHyperRectangle, minimizeRectangle; adjust stride/shape handling and return values.
Snitch bindings and tiling
Deeploy/Targets/Snitch/Bindings.py, .../CodeTransformationPasses/SnitchClusterTiling.py, .../DMA/SnitchDma.py, .../CodeTransformationPasses/SnitchClusterTilingSB.py
Add Snitch DMA; refactor tiling to SB/DB classes with (externalMemory, localMemory, dma); remove old SB module.
Tiling core: DMA-based codegen (new)
Deeploy/TilingExtension/AsyncDma.py, .../CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py, .../DoubleBufferingTilingCodeGeneration.py
Introduce AsyncDma/Future/primitives, blocking/anydim adapters, and SB/DB tiling codegen passes using DMA futures.
Tiling codegen refactor
Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py
Rework to DMA-centered transfers, new init signature, helpers, multi-schedule support, and meta info handling.
Tiling hoisting mixin (new)
Deeploy/TilingExtension/CodeTransformationPasses/TilingHoistingMixIn.py
Add dictOfArrays, hoisting/prefix utilities, multi-buffer reference hoisting, tile count/idx hoisting.
Tiling variable replacement refactor
Deeploy/TilingExtension/CodeTransformationPasses/TilingVariableReplacement.py
Arena-based allocations, updated apply flow, adds TilingVariableReplacementUpdate.
Tiling prototypes meta changes
Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py
TilingMetaInfo fields updated (numTiles→str, add totalNumTiles, tileIdxPtr); adjust measurement arrays and loops.
Tiling utilities API changes
Deeploy/TilingExtension/TilingCodegen.py
Replace minimization/offset APIs; add pad/stride/flat offset helpers; new computeTileHyperRectangles.
Tiler extension & scheduler
Deeploy/TilingExtension/TilerExtension.py, .../MemoryScheduler.py, .../TileConstraint.py, .../MemoryConstraints.py
Separate memory map from tiling solution; add validation/annotation; use _permute; switch to computeTileHyperRectangles; broaden shape typing.
Tests: new DMA matrix and runners
DeeployTest/testDmas.py, .../testRunner_siracusa_mchandma.py, .../testRunner_siracusa_l3dma.py, .../testRunner_snitch_dma.py
Add runners for MCHAN/L3/Snitch DMA with pipelines; testDmas.py iterates configurations and launches runners.
Tests: type-mapping API change
DeeployTest/testUtils/typeMapping.py, usages in DeeployTest/*
Replace inferInputType with inferTypeAndOffset; add helpers (minimal type, dtype mapping). Update callers across tests.
Tests: code generation API
DeeployTest/testUtils/codeGenerate.py, usages in DeeployTest/*
Consolidate generation into generateTestNetwork; update headers/impl generation signatures and verbosity handling.
Tests: tiling utils
DeeployTest/testUtils/tilingUtils.py
Add DBOnlyL3Tiler, DBTiler, SBTiler with multiBufferStrategy.
Tests: platform mapping typing
DeeployTest/testUtils/platformMapping.py
Strengthen inputTypes to Dict[str, Type[Pointer]].
Tests: minor updates
DeeployTest/Platforms/Siracusa/src/deeploytest.c, .../testUtils/testRunner.py, .../testMVP.py, .../generateNetwork.py, .../testSlice_PULP.py, .../testSchedulingExtension.py, .../testPrintInputOutputTransformation.py, .../deeployStateEqualityTest.py, .../testTilerExtension.py
Adjust diff calc to absolute; assert LLVM env; switch to new type-mapping and generation APIs; update flows and logs.
Target libraries (PULP)
TargetLibraries/PULPOpen/inc/mchan_siracusa.h (add), .../inc/mchan_v6.h (add), .../inc/mchan_v7.h (add), .../inc/dory_dma.h (del), .../inc/mchan.h (del), .../src/dory_dma.c (del)
Replace DORY DMA with MCHAN v6/v7 headers and Siracusa config; remove dory DMA sources/headers.
Build guards
TargetLibraries/PULPOpen/CMakeLists.txt, cmake/snitch/snitch.cmake
Fix typo; add fatal guard for SNITCH_HOME env.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Deployer
  participant TilingPass as TilingCodeGeneration (SB/DB)
  participant Adapter as AnydimAsyncDmaTransferAdapter
  participant DMA as AsyncDma
  participant Future

  User->>Deployer: apply()
  Deployer->>TilingPass: generateTilingLoop(tilingSchedules)
  TilingPass->>Adapter: transfer(external, local, shape, strides, dir, future)
  alt kernelRank == transferRank
    Adapter->>DMA: transfer(..., future)
  else kernelRank < transferRank
    Adapter->>Adapter: emit nested loops + offset ptrs
    Adapter->>DMA: transfer(inner-shape,..., future)
  else kernelRank > transferRank
    Adapter->>DMA: transfer(padded-shape,..., future)
  end
  DMA->>Future: get/init/wait/deinit (via waiting strategy)
  TilingPass-->>Deployer: ExecutionBlock (tile loop + DMA calls)
Loading
sequenceDiagram
  autonumber
  participant Pipeline as PULPOpen Bindings Pipeline
  participant TVR as TilingVariableReplacement
  participant TVRUpd as TilingVariableReplacementUpdate
  participant Cluster as PULPClusterTiling("L2","L1", MchanDma)
  participant L3 as PULPL3Tiling("L3","L2", l3DmaHack)
  participant MemGen as MemoryManagementGeneration

  Pipeline->>TVR: apply(arena alloc, tiled refs)
  TVR->>TVRUpd: apply(update refs per tile)
  TVRUpd->>Cluster: apply(SB/DB tiling with DMA)
  Cluster->>L3: apply(L3 tiling with DMA)
  L3->>MemGen: manage L2/L3/default allocations
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~150 minutes

Possibly related PRs

Suggested reviewers

  • Victor-Jung
  • Xeratec

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Pre-merge checks (3 passed)

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title “Refactor tiling code generation” succinctly captures the primary focus of the changeset, highlighting the core architectural overhaul of the TilingExtension and related DMA abstractions without extraneous detail, and is clear and specific enough for a reviewer to understand the main intent at a glance.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
Description Check ✅ Passed The pull request description clearly outlines the core contributions—such as the AsyncDma abstraction, Any-dimensional DMA adapter, and refactored tiling code generation—corresponding directly to the detailed changes presented in the summary, and it provides relevant performance data and lists of added and modified components that match the files changed in the PR.
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Xeratec
Copy link
Member

Xeratec commented Aug 25, 2025

@coderabbitai Provide further details about the modification to the tiling extension (All the files under Deeploy/TilingExtension/)

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 25, 2025

Tip

For best results, initiate chat on the files or code changes.

@Xeratec I'll provide a comprehensive analysis of the tiling extension modifications. This represents a major architectural refactor of the entire tiling system.

Major Architectural Changes

1. Async DMA Framework Introduction (AsyncDma.py)

  • New Module: Introduces a complete asynchronous DMA transfer framework
  • Key Components:
    • AsyncDma base class with transfer validation and rank-based templates
    • Future abstraction for async operations with init/deinit/wait lifecycle
    • Waiting strategies: PerTensorWaitingStrategy and TensorGroupWaitingStrategy
    • AnydimAsyncDmaTransferAdapter for arbitrary dimensional transfers using nested loops
    • BlockingDmaFromAsyncDmaAdapter to bridge async/blocking semantics

2. Tiling Code Generation Restructure

The core tiling generation has been completely rewritten with three new classes:

Base Class Refactor (TilingCodeGeneration.py)

  • From: Static, single-memory path tiling
  • To: DMA-driven, tile-aware, hoisted codegen with robust memory constraints
  • Key Changes:
    • Now inherits from TilingHoistingMixIn for memory management
    • Constructor signature: __init__(externalMemory, localMemory, dma, bufferCount)
    • generateTilingLoop now accepts List[TilingSchedule] instead of single schedule
    • Uses TensorMemoryConstraint instead of NodeMemoryConstraint
    • Introduces transfer legalization and arbitrary-dimensional support

Single Buffering (SingleBufferingTilingCodeGeneration.py)

  • New Class: Extends TilingCodeGeneration for single-buffer tiling
  • Features:
    • DMA-based transfer scheduling between external/local memories
    • Future-based synchronization for ingress/egress paths
    • Per-tile code generation with dynamic operator representation updates
    • Integrated shape validation and transfer legalization

Double Buffering (DoubleBufferingTilingCodeGeneration.py)

  • New Class: Implements sophisticated double-buffering strategy
  • Features:
    • Two-buffer scheme with modular arithmetic-based selection (TILING_I % 2)
    • Simultaneous input/output transfers with async DMA coordination
    • Buffer-chaining logic with future synchronization
    • Integrated setup/teardown with DMA lifecycle management

3. Memory Management Revolution

Hoisting Infrastructure (TilingHoistingMixIn.py)

  • New Mixin: Provides comprehensive memory-hoisting utilities
  • Key Features:
    • _hoistValues(): Integer sequences as ConstantBuffers with minimal type selection
    • _hoistReference(): VariableBuffer references with shape/offset support
    • _hoistTileNumAndIdxPtr(): Tile counting and indexing infrastructure
    • _hoistOpReprUpdates(): Dynamic operator representation management
    • _hoistMultibufferReferences(): Double-buffering setup with validation

Arena-Based Memory (TilingVariableReplacement.py)

  • Deep Refactor: From pointer-level dereference to arena-based allocations
  • New Features:
    • Arena-based allocation via _arenaAllocate()
    • Explicit variable tiling with TilingVariableReplacementUpdate class
    • Modular handling of transients, variable replacements, and tiled tensors
    • Template-based update generation for post-tiling variable updates

4. Profiling and Instrumentation Overhaul (TilingPrototypes.py)

  • Unified Profiling: Converts per-event timing to unified measurement arrays
  • Key Changes:
    • TilingMetaInfo gains totalNumTiles: int and tileIdxPtr: str
    • numTiles changes from int to str for macro-driven access
    • Measurement storage becomes parameter-driven with single arrays
    • Loop indexing becomes macro-driven with string-typed tile counts

5. Shape and Stride Management (TilingCodegen.py)

  • Rectangle Minimization API: minimizeRectangleDims()minimizeRectangle()
    • From buffer-centric to shape-oriented approach
    • Returns (HyperRectangle, Tuple[int, ...]) instead of two HyperRectangles
  • New Utilities:
    • padShape(), padOffset(), padStride() for rank normalization
    • stridesFromShape(), calculateFlatOffset() for memory layout calculations
    • computeTileHyperRectangles() for tile generation with validation

6. Dual-Path Tiling Workflow (TilerExtension.py)

  • Manual Override Support: New MemoryMap type alias and dual-path workflow
  • API Changes:
    • computeTilingSchedule() returns only TilingSolution
    • New computeMemoryMap(), annotateMemoryLevel(), testTilingSolutionCorrectness()
    • tile() method supports manual tilingSolution and memoryMap override
  • Validation: Built-in correctness testing for buffer alignment and memory constraints

7. Enhanced Type Safety and Constraints

  • Memory Constraints (MemoryConstraints.py): Shape annotation updated to Tuple[int, ...] for multi-dimensional support
  • Tile Constraints (TileConstraint.py): Updated to use computeTileHyperRectangles() and enhanced memory path validation
  • Memory Scheduler (MemoryScheduler.py): Migrated to new _permute() function for type-safe permutations

Impact and Benefits

Performance

  • Async DMA enables overlapped computation and data movement
  • Double buffering maximizes memory bandwidth utilization
  • Optimized transfer scheduling reduces idle time

Flexibility

  • Support for arbitrary dimensional transfers (1D, 2D, 3D, N-D)
  • Pluggable DMA backends with different capabilities
  • Manual tiling override for custom optimization strategies

Maintainability

  • Modular architecture with clear separation of concerns
  • Type-safe interfaces with comprehensive validation
  • Unified profiling and instrumentation framework

Scalability

  • Arena-based memory management for efficient allocation
  • Hoisting infrastructure for optimal memory usage
  • Support for complex multi-level memory hierarchies

This represents a complete modernization of the tiling system, moving from a basic static approach to a sophisticated, async, multi-buffered, and highly configurable framework suitable for complex heterogeneous computing scenarios.

Copy link
Member

@Xeratec Xeratec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukamac, I reviewed all your changes except for the refactoring of the TilingExtension. There are a lot of changes, and I am not super familiar with the previous code. It is clear that the newer code is nicer and better to understand, but I am struggling with the changes in TilingCodeGeneration.py and TilingVariableReplacement.py.

I would prefer an offline walkthrough through the remaining changes.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
Deeploy/CommonExtensions/DataTypes.py (1)

111-116: Fix type annotation mismatch: return a tuple, not a list.

sorted(...) returns a list; annotation declares a Tuple. Wrap with tuple(...).

-IntegerDataTypes: Tuple[Type[IntegerImmediate], ...] = (sorted((
-    *SignedIntegerDataTypes,
-    *UnsignedIntegerDataTypes,
-),
-                                                               key = lambda _type: _type.typeWidth))
+IntegerDataTypes: Tuple[Type[IntegerImmediate], ...] = tuple(sorted(
+    (*SignedIntegerDataTypes, *UnsignedIntegerDataTypes),
+    key = lambda _type: _type.typeWidth
+))
DeeployTest/testUtils/codeGenerate.py (1)

37-50: Remove buffer shape mutation and flatten outputs in _shapeBroadcast
In _shapeBroadcast, drop ctxt.lookup(f"{name}").shape = broadcastNum.shape and change the else branch to broadcastNum = value.reshape(-1) so that both global and local paths emit 1-D data without mutating the buffer’s stored shape. Retain ctxt.is_global for the global check.

Deeploy/Targets/PULPOpen/CodeTransformationPasses/AutoTransposeUtils.py (1)

177-178: Python slicing bug with -0 yields empty tuple.

When len(fixedPermStrides) == 0, rect.dims[:-0] is empty. Return full dims in that case.

-    return struct, remainderStrides, rect.dims[:-len(fixedPermStrides)]
+    drop = len(fixedPermStrides)
+    return struct, remainderStrrides, rect.dims[:-drop] if drop > 0 else rect.dims
DeeployTest/testMVP.py (1)

272-279: Bug: signProp is hard-coded to False, disabling offsetting for sign-prop platforms

This disables the input/output offset path below, likely breaking platforms that rely on sign propagation (and deployers that don’t accept inputOffsets). Compute signProp from platform to keep behavior consistent with setupDeployer().

Apply this diff:

-    platform = deployer.Platform
-    signProp = False
+    platform = deployer.Platform
+    # Keep main() consistent with setupDeployer(): derive platform sign-prop preference
+    _, signProp = mapPlatform(args.platform)
♻️ Duplicate comments (43)
Deeploy/Targets/MemPool/Templates/ITATemplate.py (1)

325-326: Sanitize nodeName before interpolating into C identifiers (prevents invalid/broken codegen).

If nodeName contains '/', '-', '.', spaces, or starts with a digit, the generated identifiers will be invalid C and compilation will fail. Use a C-safe variant when forming ${data_in_array} and ${quant_params_array}.

Apply this diff here:

-        operatorRepresentation['data_in_array'] = f"{nodeName}_data_in_array"
-        operatorRepresentation['quant_params_array'] = f"{nodeName}_quant_params_array"
+        operatorRepresentation['data_in_array'] = f"{nodeName_safe}_data_in_array"
+        operatorRepresentation['quant_params_array'] = f"{nodeName_safe}_quant_params_array"

Add the helper once at module scope and derive nodeName_safe near its first use:

@@
-from typing import Dict, Tuple
+from typing import Dict, Tuple
+import re
 import numpy as np
@@
+def _c_ident(s: str) -> str:
+    s = re.sub(r'[^A-Za-z0-9_]', '_', str(s))
+    if not re.match(r'[A-Za-z_]', s):
+        s = f'_{s}'
+    return s
@@
-        nodeName = operatorRepresentation['nodeName']
+        nodeName = operatorRepresentation['nodeName']
+        nodeName_safe = _c_ident(nodeName)

Optional: consider using nodeName_safe consistently for other identifier-like names built from nodeName in this file for uniform safety.

DeeployTest/testUtils/testRunner.py (2)

324-336: Verify CI/export passes this new check.

Ensure workflows set LLVM_INSTALL_DIR or pass --toolchain_install_dir wherever TestRunner is used.

#!/bin/bash
# Find TestRunner invocations and check if toolchain dir is provided or env is set.
rg -nP '\bpython\s+.*testRunner.*\.py\b' .github/workflows -C3 || true
rg -n "LLVM_INSTALL_DIR" .github/workflows || true

325-327: Replace assert with explicit validation; handle toolchain and path.

Asserts are stripped with -O and Ruff flags the f-string. Validate deterministically and check existence.

-        assert self._args.toolchain_install_dir is not None, f"Environment variable LLVM_INSTALL_DIR is not set"
-        self._dir_toolchain = os.path.normpath(self._args.toolchain_install_dir)
+        # Validate toolchain install dir (LLVM only). Avoid asserts in production.
+        if self._args.toolchain.upper() == "LLVM":
+            if not self._args.toolchain_install_dir:
+                raise ValueError("Missing toolchain install dir: set --toolchain_install_dir or LLVM_INSTALL_DIR.")
+            self._dir_toolchain = os.path.normpath(self._args.toolchain_install_dir)
+            if not os.path.isdir(self._dir_toolchain):
+                raise FileNotFoundError(f"Toolchain directory not found: {self._dir_toolchain}")
+        else:
+            self._dir_toolchain = os.path.normpath(self._args.toolchain_install_dir) if self._args.toolchain_install_dir else ""
Deeploy/Targets/Generic/Templates/ITAMaxTemplate.py (1)

45-49: Use hoisted buffer’s actual name.

hoistTransientBuffer may mangle/alter the name; store and use the returned buffer name.

-        ctxt.hoistTransientBuffer(name, size)
-        operatorRepresentation['ctxtBuffer'] = name
+        buf = ctxt.hoistTransientBuffer(name, size)
+        operatorRepresentation['ctxtBuffer'] = getattr(buf, "name", name)
         operatorRepresentation['ctxtBufferSize'] = size
-
-        return ctxt, operatorRepresentation, [name]
+        return ctxt, operatorRepresentation, [operatorRepresentation['ctxtBuffer']]
DeeployTest/Platforms/Siracusa/src/deeploytest.c (1)

172-179: Avoid unsigned wraparound; widen before subtraction and fix printf casts.

Compute error in wider signed type; print with matching formats to avoid UB.

-        OUTPUTTYPE expected = ((OUTPUTTYPE *)testOutputVector[buf])[i];
-        OUTPUTTYPE actual = ((OUTPUTTYPE *)compbuf)[i];
-        int error = expected - actual;
-        OUTPUTTYPE diff = (OUTPUTTYPE)(error < 0 ? -error : error);
+        OUTPUTTYPE expected = ((OUTPUTTYPE *)testOutputVector[buf])[i];
+        OUTPUTTYPE actual = ((OUTPUTTYPE *)compbuf)[i];
+        int32_t error = (int32_t)expected - (int32_t)actual;
+        uint32_t diff = (uint32_t)(error < 0 ? -error : error);
 
         if (diff) {
           tot_err += 1;
-          printf("Expected: %4d  ", expected);
-          printf("Actual: %4d  ", actual);
-          printf("Diff: %4d at Index %12u in Output %u\r\n", diff, i, buf);
+          printf("Expected: %4d  ", (int)expected);
+          printf("Actual: %4d  ", (int)actual);
+          printf("Diff: %4u at Index %12u in Output %u\r\n", diff, (unsigned)i, (unsigned)buf);
         }
Deeploy/EngineExtension/NetworkDeployers/EngineColoringDeployer.py (1)

65-70: Replace assert with deterministic exception; treat empty/None engine as uncolored.

Asserts can be stripped; raise a stable, informative error.

-        uncoloredNodes = [node for node in graph.nodes if "engine" not in node.attrs]
-        uncoloredOperations = set(node.op for node in uncoloredNodes)
-        assert len(
-            uncoloredNodes
-        ) == 0, f"Missing engine color for nodes {[node.name for node in uncoloredNodes]} with operations {uncoloredOperations}"
+        uncoloredNodes = [n for n in graph.nodes if not isinstance(n.attrs.get("engine"), str) or not n.attrs.get("engine")]
+        if uncoloredNodes:
+            nodeNames = sorted((n.name or "<unnamed>") for n in uncoloredNodes)
+            uncoloredOps = sorted({n.op for n in uncoloredNodes if n.op is not None})
+            raise ValueError(f"Missing engine color for nodes {nodeNames} with operations {uncoloredOps}")
Deeploy/Targets/PULPOpen/Platform.py (1)

256-258: Switch to mchan_siracusa.h looks fine; verify there are no residual dory_dma usages and that headers resolve on all PULP variants.

Run the quick scan below and fix any remaining references if found.

#!/bin/bash
# Residual includes/usages of dory_dma
rg -nP 'dory_dma\.h|dory_dma_' -g '!**/build/**'

# Confirm new header is referenced and present
rg -nP 'mchan_siracusa\.h' -g '!**/build/**'
fd -HI 'mchan_siracusa.h'

# Optional: ensure include paths expose TargetLibraries/PULPOpen/inc
rg -nP 'include_paths|CFLAGS|CPPFLAGS' -n . -g '!**/build/**'
TargetLibraries/PULPOpen/inc/mchan_v6.h (1)

40-47: Good: enforce mutual exclusivity for event vs. polled modes.

Deeploy/Targets/Generic/TileConstraints/iRMSNormTileConstraint.py (1)

31-31: Fix potential overflow and dtype of "size" replacement.

np.prod can exceed 16-bit; hardcoding uint16_t risks truncation. Derive the minimal adequate integer type after computing sizes and ensure Python ints.

Apply:

-from Deeploy.CommonExtensions.DataTypes import uint16_t
+from Deeploy.CommonExtensions.DataTypes import minimalIntegerType
-        replacements = {"size": []}
-        replacementTypes = {"size": PointerClass(uint16_t)}
+        replacements = {"size": []}
-        for cube in outputCubes:
-            newSize = np.prod(cube.dims)
-            replacements["size"].append(newSize)
+        for cube in outputCubes:
+            newSize = int(np.prod(cube.dims))
+            replacements["size"].append(newSize)
-        variableReplacementSchedule = VariableReplacementScheme(replacements, replacementTypes)
+        replacementTypes = {"size": PointerClass(minimalIntegerType(replacements["size"]))}
+        variableReplacementSchedule = VariableReplacementScheme(replacements, replacementTypes)

Also applies to: 79-80, 81-83, 95-97

DeeployTest/testRunner_siracusa_mchandma.py (1)

50-53: Replace asserts with explicit validation and use zip(strict=True).
Same as earlier feedback; asserts can be stripped and messages have a minor grammar issue.

Apply:

-assert len(inputShape) == len(tileShape), \
-    f'Input and tile shape should be of the same dimensionality. Received {len(inputShape)}D input shape vs. {len(tileShape)}D tile shape.'
-assert all(tileDim <= inDim for inDim, tileDim in zip(inputShape, tileShape)), \
-    f'Each tile shape dimension should be smaller then the corresponding input one. Received {tileShape} > {inputShape}'
+if len(inputShape) != len(tileShape):
+    raise ValueError(
+        f"Input and tile shape must have the same dimensionality. Got {len(inputShape)}D vs {len(tileShape)}D.")
+if not all(tileDim <= inDim for inDim, tileDim in zip(inputShape, tileShape, strict=True)):
+    raise ValueError(
+        f"Each tile dimension must be <= the corresponding input one. Got tile={tileShape}, input={inputShape}.")
Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py (1)

39-41: Consider extracting rank-aware cube construction into a shared helper

GEMM/MatMul (and Snitch counterparts) now duplicate rank/prefix handling. A single utility (e.g., make_ranked_cube(offset2d, dims2d, batch_off, batch_size, b_off=0, b_size=1)) would de-duplicate and harden behavior across targets.

I can sketch this helper and migrate both PULPOpen and Snitch variants in a follow-up if desired.

Deeploy/Targets/PULPOpen/TileConstraints/MatMulTileConstraint.py (1)

115-119: Bug: transA/transB ignored in serialization (tiles A/B as non-transposed)

NSize and A/B sub-rectangles don’t account for transpose flags. This mis-tiles when transA/transB == 1.

Apply:

@@
-        buffA = ctxt.lookup(operatorRepresentation['A'])
-        buffB = ctxt.lookup(operatorRepresentation['B'])
+        buffA = ctxt.lookup(operatorRepresentation['A'])
+        buffB = ctxt.lookup(operatorRepresentation['B'])
+        transA = int(operatorRepresentation.get("transA", 0))
+        transB = int(operatorRepresentation.get("transB", 0))
@@
-        NSize = buffA.shape[-1]
+        NSize = buffA.shape[-2] if transA else buffA.shape[-1]
@@
-            AMatrixOffsets = (MOffset, NOffset)
-            AMatrixShape = (MSize, NSize)
+            if transA == 0:
+                AMatrixOffsets = (MOffset, NOffset)
+                AMatrixShape = (MSize, NSize)
+            else:
+                AMatrixOffsets = (NOffset, MOffset)
+                AMatrixShape = (NSize, MSize)
@@
-            BMatrixOffsets = (NOffset, OOffset)
-            BMatrixShape = (NSize, OSize)
+            if transB == 0:
+                BMatrixOffsets = (NOffset, OOffset)
+                BMatrixShape = (NSize, OSize)
+            else:
+                BMatrixOffsets = (OOffset, NOffset)
+                BMatrixShape = (OSize, NSize)
#!/bin/bash
# Check for other MatMul/GEMM serializers still using buffA.shape[-1] for N unconditionally
rg -n -C2 -P 'serializeTilingSolution\(|shape\[-1\]\)' Deeploy/Targets | rg -n -P 'MatMul|GEMM'

Also applies to: 148-153

TargetLibraries/PULPOpen/inc/mchan_v7.h (2)

68-71: Make register pointers uintptr_t-safe and assert 32-bit target.

Prevents silent truncation when compiled on hosts with 64-bit pointers.

Apply this diff:

-#include "pmsis.h"
+#include "pmsis.h"
+#include <stdint.h>
+
+_Static_assert(sizeof(uintptr_t) == 4, "MCHAN assumes 32-bit addresses.");
@@
-static volatile uint32_t *const cmd_ptr =
-    (volatile uint32_t *const)(MCHAN_BASE_ADDR + 0x0);
-static volatile uint32_t *const status_ptr =
-    (volatile uint32_t *const)(MCHAN_BASE_ADDR + 0x4);
+static volatile uint32_t *const cmd_ptr =
+    (volatile uint32_t *const)((uintptr_t)MCHAN_BASE_ADDR + 0x0u);
+static volatile uint32_t *const status_ptr =
+    (volatile uint32_t *const)((uintptr_t)MCHAN_BASE_ADDR + 0x4u);

73-113: Inline helpers and cast pointers via uintptr_t before truncation.

Avoid multiple-definition bloat and UB when casting pointers to 32-bit regs.

Apply this diff:

-static void mchan_transfer_1d(uint32_t cmd, void *loc, void *ext) {
+static inline void mchan_transfer_1d(uint32_t cmd, void *loc, void *ext) {
   // TODO: assert flags are set correctly
-  *cmd_ptr = (uint32_t)cmd;
-  *cmd_ptr = (uint32_t)loc;
-  *cmd_ptr = (uint32_t)ext;
+  *cmd_ptr = (uint32_t)cmd;
+  *cmd_ptr = (uint32_t)(uintptr_t)loc;
+  *cmd_ptr = (uint32_t)(uintptr_t)ext;
 }
@@
-static void mchan_transfer_2d_loc_strided(uint32_t cmd, void *loc, void *ext,
+static inline void mchan_transfer_2d_loc_strided(uint32_t cmd, void *loc, void *ext,
                                           uint32_t loc_size_1d,
                                           uint32_t loc_stride_2d) {
   // TODO: assert flags are set correctly
-  *cmd_ptr = (uint32_t)cmd;
-  *cmd_ptr = (uint32_t)loc;
-  *cmd_ptr = (uint32_t)ext;
+  *cmd_ptr = (uint32_t)cmd;
+  *cmd_ptr = (uint32_t)(uintptr_t)loc;
+  *cmd_ptr = (uint32_t)(uintptr_t)ext;
   *cmd_ptr = (uint32_t)loc_size_1d;
   *cmd_ptr = (uint32_t)loc_stride_2d;
 }
@@
-static void mchan_transfer_2d_ext_strided(uint32_t cmd, void *loc, void *ext,
+static inline void mchan_transfer_2d_ext_strided(uint32_t cmd, void *loc, void *ext,
                                           uint32_t ext_size_1d,
                                           uint32_t ext_stride_2d) {
   // TODO: assert flags are set correctly
-  *cmd_ptr = (uint32_t)cmd;
-  *cmd_ptr = (uint32_t)loc;
-  *cmd_ptr = (uint32_t)ext;
+  *cmd_ptr = (uint32_t)cmd;
+  *cmd_ptr = (uint32_t)(uintptr_t)loc;
+  *cmd_ptr = (uint32_t)(uintptr_t)ext;
   *cmd_ptr = (uint32_t)ext_size_1d;
   *cmd_ptr = (uint32_t)ext_stride_2d;
 }
@@
-static void mchan_transfer_2d_loc_strided_ext_strided(
+static inline void mchan_transfer_2d_loc_strided_ext_strided(
     uint32_t cmd, void *loc, void *ext, uint32_t loc_size_1d,
     uint32_t loc_stride_2d, uint32_t ext_size_1d, uint32_t ext_stride_2d) {
   // TODO: assert flags are set correctly
-  *cmd_ptr = (uint32_t)cmd;
-  *cmd_ptr = (uint32_t)loc;
-  *cmd_ptr = (uint32_t)ext;
+  *cmd_ptr = (uint32_t)cmd;
+  *cmd_ptr = (uint32_t)(uintptr_t)loc;
+  *cmd_ptr = (uint32_t)(uintptr_t)ext;
   *cmd_ptr = (uint32_t)ext_size_1d;
   *cmd_ptr = (uint32_t)ext_stride_2d;
   *cmd_ptr = (uint32_t)loc_size_1d;
   *cmd_ptr = (uint32_t)loc_stride_2d;
 }
DeeployTest/testDmas.py (1)

29-46: Harden subprocess call: build argv list; avoid shell=True.

Prevents shell injection and quoting issues; aligns with static analysis (S602). Also keep "-DNUM_CORES=8" as a single argv element to preserve current parsing.

-    cmd = [f"python {testRunner}", f"-t test{dma}", "-DNUM_CORES=8"]
-    cmd.append(f"--input-shape {' '.join(str(x) for x in inputShape)}")
-    cmd.append(f"--tile-shape {' '.join(str(x) for x in tileShape)}")
-    cmd.append(f"--node-count {nodeCount}")
-    cmd.append(f"--type {dataType}")
+    import sys, shlex
+    cmd = [
+        sys.executable, testRunner,
+        "-t", f"test{dma}",
+        "-DNUM_CORES=8",
+        "--input-shape", *[str(x) for x in inputShape],
+        "--tile-shape", *[str(x) for x in tileShape],
+        "--node-count", str(nodeCount),
+        "--type", dataType,
+    ]
     if doublebuffer:
         cmd.append("--doublebuffer")
 
-    full_cmd = " ".join(cmd)
+    full_cmd = shlex.join(cmd)
 
     print(f"Running command:\n{full_cmd}\n")
 
     try:
-        subprocess.run(full_cmd, shell = True, check = True)
+        subprocess.run(cmd, check = True)
     except subprocess.CalledProcessError:
         print(f"test{dma}: Failed test:" + cfg_str)
         print(f"Rerun with command:\n{full_cmd}")
-        exit(-1)
+        import sys as _sys
+        _sys.exit(1)
DeeployTest/testRunner_snitch_dma.py (1)

88-93: Ensure float inputs are float32; drop unused variable.

np.random.rand returns float64; cast to float32. Remove unused np.iinfo.

-if not testRunner._args.skipgen:
-    if dtype == np.float32:
-        test_inputs = np.random.rand(*inputShape)
-    else:
-        info = np.iinfo(dtype)
-        test_inputs = np.arange(stop = np.prod(inputShape), dtype = dtype).reshape(inputShape)
+if not testRunner._args.skipgen:
+    if dtype == np.float32:
+        test_inputs = np.random.rand(*inputShape).astype(np.float32)
+    else:
+        test_inputs = np.arange(stop = np.prod(inputShape), dtype = dtype).reshape(inputShape)
DeeployTest/testRunner_siracusa_l3dma.py (2)

50-53: Replace asserts with explicit validation and use zip(strict=...)

Asserts can be stripped with -O and misshape pairs won’t be detected. Use explicit checks.

-assert len(inputShape) == len(tileShape), \
-    f'Input and tile shape should be of the same dimensionality. Received {len(inputShape)}D input shape vs. {len(tileShape)}D tile shape.'
-assert all(tileDim <= inDim for inDim, tileDim in zip(inputShape, tileShape)), \
-    f'Each tile shape dimension should be smaller then the corresponding input one. Received {tileShape} > {inputShape}'
+if len(inputShape) != len(tileShape):
+    raise ValueError(
+        f'Input and tile shape must have the same dimensionality. Received {len(inputShape)}D vs. {len(tileShape)}D.'
+    )
+# If Python < 3.10, drop strict=True.
+if not all(tileDim <= inDim for inDim, tileDim in zip(inputShape, tileShape, strict=True)):
+    raise ValueError(
+        f'Each tile dimension must be <= the corresponding input one. Received tiles {tileShape} > input {inputShape}.'
+    )

81-86: Fix float32 dtype and remove unused variable

np.random.rand yields float64; cast to match float32 tensors. Remove the unused info.

-if dtype == np.float32:
-    test_inputs = np.random.rand(*inputShape)
+if dtype == np.float32:
+    test_inputs = np.random.rand(*inputShape).astype(dtype)
 else:
-    info = np.iinfo(dtype)
     test_inputs = np.arange(stop = np.prod(inputShape), dtype = dtype).reshape(inputShape)
Deeploy/Targets/PULPOpen/DMA/L3Dma.py (2)

33-36: Replace asserts with runtime checks; fix device name and typos (duplicate).

Asserts can be stripped with -O and messages reference “Mchan” and “contigous”. Use explicit exceptions, correct device name to L3Dma, and “contiguous”.

-        assert strideExt[-1] == 1, \
-            "Mchan supports only contigous transfers of the innermost dimension for external memory"
-        assert strideLoc[0] == shape[1] and strideLoc[1] == 1, \
-            f"Mchan supports only contigous transfers for local memory. Received local shape: {shape}, stride: {strideLoc}"
+        if strideExt[-1] != 1:
+            raise ValueError("L3Dma supports only contiguous transfers of the innermost dimension for external memory")
+        if not (len(shape) >= 2 and strideLoc[0] == shape[1] and strideLoc[1] == 1):
+            raise ValueError(
+                f"L3Dma supports only contiguous transfers for local memory. Received local shape: {shape}, stride: {strideLoc}"
+            )

43-48: Pass DMA sizes in bytes, not elements (duplicate).

pi_cl_ram_copy_2d expects size/stride/length in bytes. Multiply by element width.

-        operatorRepresentation.update({
-            "ext2loc": 1 if direction == "ExternalToLocal" else 0,
-            "transfer_size": math.prod(shape),
-            "length": shape[1],
-            "stride": strideExt[0],
-        })
+        bytes_per_elem = externalBuffer._type.referencedType.typeWidth // 8
+        operatorRepresentation.update({
+            "ext2loc": 1 if direction == "ExternalToLocal" else 0,
+            "transfer_size": math.prod(shape) * bytes_per_elem,
+            "length": shape[1] * bytes_per_elem,
+            "stride": strideExt[0] * bytes_per_elem,
+        })
Deeploy/Targets/PULPOpen/CodeTransformationPasses/AutoTransposeUtils.py (1)

41-42: Bug: HyperRectangle ctor args swapped (offset vs dims).

Constructor expects (offset, dims); passing (dims, offset) breaks minimization/stride derivation.

-    maxTransferRect = HyperRectangle(maxTransferDims, inRect.offset)
+    maxTransferRect = HyperRectangle(offset=inRect.offset, dims=maxTransferDims)
Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (2)

78-88: Shared structure with SB path — consider factoring common pieces into base.

Ingress/egress scheduling, reference hoisting, and future orchestration duplicate SB logic. A small helper in the base could reduce drift.


132-140: Fix DMA prefetch future: not initialized/waited/deinitialized — race with first tile.

Reuse the main future so the existing init/wait/deinit covers the prefetch. The standalone initialFuture is never init'ed/waited/deinit'ed.

-        gen = AnydimAsyncDmaTransferAdapter(self.dma)
-
-        initialFuture = self.dma.getFuture(tensorName + "_init")
-        initialDmaTransferCalls = gen.transfer(ctxt, externalBufferRef, localBuffer, rectangles[0].dims,
-                                               stridesFromShape(externalBufferShape),
-                                               stridesFromShape(rectangles[0].dims), "ExternalToLocal",
-                                               initialFuture, math.prod(externalBufferShape))
+        gen = AnydimAsyncDmaTransferAdapter(self.dma)
+        # Reuse the same future so it's properly init'ed, waited, and deinit'ed
+        initialDmaTransferCalls = gen.transfer(ctxt, externalBufferRef, localBuffer, rectangles[0].dims,
+                                               stridesFromShape(externalBufferShape),
+                                               stridesFromShape(rectangles[0].dims), "ExternalToLocal",
+                                               future, math.prod(externalBufferShape))
Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (4)

111-118: Enforce tile-count invariants; keep single source of truth for counts

Validate ingress/egress lengths match and ensure totalNumTiles reflects the schedule length.

-        metaInfo = TilingMetaInfo(
+        if len(tilingSchedule.inputLoadSchedule) != len(tilingSchedule.outputLoadSchedule):
+            raise ValueError(
+                f"Tiling schedule ingress/egress length mismatch: "
+                f"{len(tilingSchedule.inputLoadSchedule)} vs {len(tilingSchedule.outputLoadSchedule)}"
+            )
+        metaInfo = TilingMetaInfo(
             nodeName = operatorRepresentation['nodeName'] + f"_{self.externalMemory}",
             nodeOps = operatorRepresentation['nodeOps'],
             numTiles = operatorRepresentation['numTiles'],
             totalNumTiles = len(tilingSchedule.outputLoadSchedule),
             tileIdxPtr = operatorRepresentation['tileIdxPtr'],
             tileIdxVar = "TILING_I",

Would you like a follow-up patch to assert equality with the value used by _hoistTileNumAndIdxPtr?


27-27: Replace Set-based futures with ordered, de-duplicated List; remove unused variable; ensure deterministic init/deinit/wait order

Sets make codegen order non-deterministic and break stable teardown ordering; also referenceUpdates is unused. Switch to List[Future], de-duplicate while preserving insertion order, and reverse deinit.

-from typing import Dict, List, Set, Tuple
+from typing import Dict, List, Tuple
@@
-            tileIdxVar: str, direction: DmaDirection) -> Tuple[NetworkContext, List[CodeSnippet], Set[Future]]:
+            tileIdxVar: str, direction: DmaDirection) -> Tuple[NetworkContext, List[CodeSnippet], List[Future]]:
         callStack: List[CodeSnippet] = []
-        referenceUpdates: List[CodeSnippet] = []
-        futures: Set[Future] = set()
+        futures: List[Future] = []
@@
-            future = self.dma.getFuture(tensorName)
-            futures.add(future)
+            future = self.dma.getFuture(tensorName)
+            if future not in futures:
+                futures.append(future)
@@
-        ingressDmaWaitStatements = [future.wait() for future in ingressFutures]
-        egressDmaWaitStatements = [future.wait() for future in egressFutures]
+        ingressDmaWaitStatements = [future.wait() for future in ingressFutures]
+        egressDmaWaitStatements = [future.wait() for future in egressFutures]
@@
-        setupStatements = self.dma.setup()
-        setupStatements += [f.init() for f in ingressFutures | egressFutures]
-
-        teardownStatements = self.dma.teardown()
-        teardownStatements.extend(f.deinit() for f in ingressFutures | egressFutures)
+        setupStatements = self.dma.setup()
+        allFutures: List[Future] = []
+        allFutures.extend(ingressFutures)
+        for f in egressFutures:
+            if f not in allFutures:
+                allFutures.append(f)
+        setupStatements += [f.init() for f in allFutures]
+
+        teardownStatements = self.dma.teardown()
+        teardownStatements.extend(f.deinit() for f in reversed(allFutures))

Also applies to: 49-53, 99-107


56-63: Replace asserts with explicit checks; guard missing constraints/shape

Asserts can be stripped with -O and indexing without get can KeyError. Raise clear exceptions and validate presence.

-            assert localBuffer._memoryLevel == self.localMemory
-            assert isinstance(localBuffer, _ReferenceBuffer)
-            externalBuffer = ctxt.lookup(localBuffer._referenceName)
-            assert isinstance(externalBuffer, VariableBuffer)
-            tensorMemoryConstraint = tensorMemoryConstraintDict[externalBuffer.name]
-            externalBufferShape = tensorMemoryConstraint.memoryConstraints[self.externalMemory].shape
-            assert externalBufferShape is not None
+            if localBuffer._memoryLevel != self.localMemory:
+                raise ValueError(
+                    f"Local buffer '{localBuffer.name}' expected in '{self.localMemory}', "
+                    f"got '{localBuffer._memoryLevel}'"
+                )
+            if not isinstance(localBuffer, _ReferenceBuffer):
+                raise TypeError(f"Expected _ReferenceBuffer for '{localBuffer.name}', got {type(localBuffer).__name__}")
+            externalBuffer = ctxt.lookup(localBuffer._referenceName)
+            if not isinstance(externalBuffer, VariableBuffer):
+                raise TypeError(
+                    f"Expected VariableBuffer for external reference '{localBuffer._referenceName}', "
+                    f"got {type(externalBuffer).__name__}"
+                )
+            tensorMemoryConstraint = tensorMemoryConstraintDict.get(externalBuffer.name)
+            if tensorMemoryConstraint is None:
+                raise KeyError(f"Missing TensorMemoryConstraint for '{externalBuffer.name}'")
+            memCnstr = tensorMemoryConstraint.memoryConstraints.get(self.externalMemory)
+            if memCnstr is None or memCnstr.shape is None:
+                raise KeyError(
+                    f"Missing shape for '{externalBuffer.name}' in memory level '{self.externalMemory}'"
+                )
+            externalBufferShape = memCnstr.shape

92-98: Use DmaDirection enum, not string literals

Passing strings breaks type expectations and templates.

-        ctxt, ingressDmaTransferCalls, ingressFutures = self._generateTransferScheduleCalls(
-            ctxt, operatorRepresentation, tilingSchedule.inputLoadSchedule,
-            nodeMemoryConstraint.inputTensorMemoryConstraints, "TILING_I", "ExternalToLocal")
+        ctxt, ingressDmaTransferCalls, ingressFutures = self._generateTransferScheduleCalls(
+            ctxt, operatorRepresentation, tilingSchedule.inputLoadSchedule,
+            nodeMemoryConstraint.inputTensorMemoryConstraints, "TILING_I", DmaDirection.ExternalToLocal)
@@
-        ctxt, egressDmaTransferCalls, egressFutures = self._generateTransferScheduleCalls(
-            ctxt, operatorRepresentation, tilingSchedule.outputLoadSchedule,
-            nodeMemoryConstraint.outputTensorMemoryConstraints, "TILING_I", "LocalToExternal")
+        ctxt, egressDmaTransferCalls, egressFutures = self._generateTransferScheduleCalls(
+            ctxt, operatorRepresentation, tilingSchedule.outputLoadSchedule,
+            nodeMemoryConstraint.outputTensorMemoryConstraints, "TILING_I", DmaDirection.LocalToExternal)
Deeploy/Targets/Snitch/DMA/SnitchDma.py (2)

33-38: Replace assert; fix typo and stray f-string

Use explicit exceptions; “contiguous” spelling; drop unnecessary f-prefix.

     def checkTransfer(self, ctxt: NetworkContext, externalBuffer: VariableBuffer, localBuffer: VariableBuffer,
                       shape: Tuple[int, ...], strideExt: Tuple[int, ...], strideLoc: Tuple[int, ...],
                       direction: DmaDirection) -> None:
         super().checkTransfer(ctxt, externalBuffer, localBuffer, shape, strideExt, strideLoc, direction)
-        assert strideLoc[1] == 1 and strideExt[1] == 1, f"Supports only contigous transfers in the innermost dimension"
+        if not (strideLoc[1] == 1 and strideExt[1] == 1):
+            raise ValueError("Supports only contiguous transfers in the innermost dimension")

39-51: Use DmaDirection enum and convert element counts to byte counts for DMA API

snrt expects sizes/strides in bytes. Multiply by element size and avoid string directions.

     def transferOpRepr(self, externalBuffer: VariableBuffer, localBuffer: VariableBuffer, shape: Tuple[int, ...],
                        strideExt: Tuple[int, ...], strideLoc: Tuple[int, ...], direction: DmaDirection,
                        future: Future) -> OperatorRepresentation:
         _ = future
+        bytes_per_elem = localBuffer._type.referencedType.typeWidth // 8
         operatorRepresentation: OperatorRepresentation = {
-            "dest": localBuffer.name if direction == "ExternalToLocal" else externalBuffer.name,
-            "src": externalBuffer.name if direction == "ExternalToLocal" else localBuffer.name,
-            "repeat": shape[0],
-            "size": shape[1],
-            "stride_dest": strideLoc[0] if direction == "ExternalToLocal" else strideExt[0],
-            "stride_src": strideExt[0] if direction == "ExternalToLocal" else strideLoc[0],
+            "dest": localBuffer.name if direction == DmaDirection.ExternalToLocal else externalBuffer.name,
+            "src": externalBuffer.name if direction == DmaDirection.ExternalToLocal else localBuffer.name,
+            "repeat": shape[0],
+            "size": shape[1] * bytes_per_elem,
+            "stride_dest": (strideLoc[0] if direction == DmaDirection.ExternalToLocal else strideExt[0]) * bytes_per_elem,
+            "stride_src": (strideExt[0] if direction == DmaDirection.ExternalToLocal else strideLoc[0]) * bytes_per_elem,
         }
         return operatorRepresentation
Deeploy/Targets/PULPOpen/DMA/MchanDma.py (2)

55-57: Fix off-by-one error in size validation.

17 bits can encode values from 0 to 2^17-1. The assertion should use strict inequality.

Apply this fix:

-        assert mchanTransferSize <= 2**17, (
+        assert mchanTransferSize < 2**17, (
             "The Dma transfer size for mchan should be representable with 17 bits, "
             f"current number of bits required is {math.ceil(math.log2(mchanTransferSize))}")

61-64: Convert element counts to byte counts for DMA parameters.

The DMA hardware expects byte counts, but the code is passing element counts for size_1d and stride_2d.

You need to multiply by the element size in bytes. Add this before line 61:

# Get element size from buffer type
elementSizeInBytes = externalBuffer._type.referencedType.typeWidth // 8

Then update the assignments:

         if transferRank == 2:
-            operatorRepresentation["size_1d"] = shape[1]
-            operatorRepresentation["stride_2d"] = strideExt[0]
+            operatorRepresentation["size_1d"] = shape[1] * elementSizeInBytes
+            operatorRepresentation["stride_2d"] = strideExt[0] * elementSizeInBytes
Deeploy/TilingExtension/CodeTransformationPasses/TilingVariableReplacement.py (1)

53-58: Improve arena allocation robustness and formatting.

The arena allocation method has several issues:

  1. No bounds checking for the offset
  2. F-string embedded in template string makes it less readable
  3. Missing arena size validation

Consider this improved implementation:

 def _arenaAllocate(self, ctxt: NetworkContext, buffer: VariableBuffer, offset: int) -> VariableBuffer:
     arena = ctxt.lookup(self.arenaName)
-    buffer.allocTemplate = NodeTemplate(" \
-    ${type.typeName} ${name} = (${type.typeName}) " + f"((char*){str(arena._instance)} + {offset});")
+    # Add bounds checking if arena has size attribute
+    if hasattr(arena, 'size'):
+        assert 0 <= offset < arena.size, f"Offset {offset} out of bounds for arena {self.arenaName} (size={arena.size})"
+    
+    # Use cleaner template with placeholders
+    buffer.allocTemplate = NodeTemplate("""
+    ${type.typeName} ${name} = (${type.typeName}) ((char*)${arena_ptr} + ${offset});""")
+    buffer._arena_ptr = str(arena._instance)
+    buffer._offset = offset
     buffer.deallocTemplate = NodeTemplate("")
     return buffer

Also update the buffer's _bufferRepresentation method to include these fields:

def _bufferRepresentation(self) -> Dict:
    repr = super()._bufferRepresentation()
    if hasattr(self, '_arena_ptr'):
        repr['arena_ptr'] = self._arena_ptr
        repr['offset'] = self._offset
    return repr
Deeploy/TilingExtension/TilerExtension.py (1)

938-941: Fix off-by-one and avoid private field access for lifetime.

Outputs should be alive through the last step index len(schedule) - 1. Also use the public lifetime property.

-        for tensor in graph.outputs:
-            assert memoryBlockMap[tensor.name]._lifetime[-1] == len(
-                schedule), "Invalid memory map! Output buffer is not alive at the last step!"
+        for tensor in graph.outputs:
+            assert memoryBlockMap[tensor.name].lifetime[-1] == len(schedule) - 1, \
+                "Invalid memory map! Output buffer is not alive at the last step!"
DeeployTest/testUtils/dmaUtils.py (1)

354-370: Validate tileShape vs input shape before generating tiling.

Preempt invalid tilings early (rank, positivity, bounds, divisibility).

 def prepare_deployer_with_custom_tiling(deployer: NetworkDeployer, defaultMemory: str, targetMemory: str,
                                         tileShape: Tuple[int, ...], doublebuffer: bool) -> None:
     # Decomposed deployer.prepare() to enter a custom tiling solution
     deployer.frontEnd()
     super(TilerDeployerWrapper, deployer).bind()
 
+    inputShape = tuple(deployer.graph.inputs[0].shape)
+    assert len(tileShape) == len(inputShape), \
+        f"Tile rank {len(tileShape)} doesn't match input rank {len(inputShape)}"
+    for i, (t, s) in enumerate(zip(tileShape, inputShape)):
+        assert t > 0, f"Tile dim {i} must be > 0"
+        assert t <= s, f"Tile dim {i}: {t} exceeds input dim {s}"
+        assert s % t == 0, f"Input dim {i}={s} not divisible by tile dim {t}"
+
     tilingSolution, memoryMap = generate_tiling(
         ctxt = deployer.ctxt,
         memoryStart = defaultMemory,
         memoryOrder = [defaultMemory, targetMemory],
         memoryHierarchy = deployer.Platform.memoryHierarchy,
-        inputShape = deployer.graph.inputs[0].shape,
+        inputShape = inputShape,
         tileShape = tileShape,
         graph = deployer.graph,
         _type = deployer.inputTypes['input_0'].referencedType,
         doublebuffer = doublebuffer,
     )
Deeploy/CommonExtensions/CodeTransformationPasses/IntrospectiveCodeTransformation.py (2)

49-73: Update template source and refresh parse-tree cache after reconstruction.

Without syncing _source and re-keying parseTreeDict, subsequent mutations use stale trees.

     def _reconstructCode(template: Template, node: TemplateNode) -> Template:
@@
-        module = types.ModuleType(template.module_id)
-        code = compile(source, template.module_id, "exec")
-        exec(code, module.__dict__, module.__dict__)
+        module = types.ModuleType(template.module_id)
+        code = compile(source, template.module_id, "exec")
+        exec(code, module.__dict__, module.__dict__)
 
-        template._code = code
+        # Refresh source and parse-tree cache
+        old_hash = hash(getattr(template, "_source", source))
+        template._source = source
+        new_hash = hash(template._source)
+        d = IntrospectiveCodeTransformationMixIn.parseTreeDict
+        if old_hash in d:
+            try:
+                del d[old_hash]
+            except KeyError:
+                pass
+        d[new_hash] = node
+
+        template._code = code
         template.module = module
         template.callable_ = template.module.render_body
         return template

171-181: Don’t use assert for runtime type checks in library code.

Replace with explicit exception; asserts can be stripped with -O.

-            def _unrollStructReferences(val: Struct) -> List[str]:
-                assert isinstance(val, Struct)
+            def _unrollStructReferences(val: Struct) -> List[str]:
+                if not isinstance(val, Struct):
+                    raise TypeError(f"Expected Struct, got {type(val).__name__}")
Deeploy/TilingExtension/TilingCodegen.py (1)

247-251: Missing documentation and tests for stride calculation remain unaddressed

Based on the past review comment, there are still no unit tests or documentation for the stridesFromShape function. The function assumes row-major ordering, which should be documented and tested.

Would you like me to generate unit tests and documentation for this critical utility function to ensure correct stride calculations across different tensor shapes?

Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py (1)

130-143: Potential template/snippet count mismatch can cause silent failures

The loop assumes each transfer produces the same number of snippets as templates, which could drift and cause misaligned templating if the adapter returns different counts for different tiles.

Add a validation check after generating snippets:

     for rect in transfers:
         snippets = gen.transfer(ctxt, externalBuffer,
                                localBuffer, rect.dims, stridesFromShape(externalBuffer.shape),
                                stridesFromShape(rect.dims), direction, future, math.prod(externalBuffer.shape))
+        if len(snippets) != len(templates):
+            raise RuntimeError(f"Adapter returned {len(snippets)} snippets, expected {len(templates)} "
+                              f"for rectangle {rect}")
         for i, snippet in enumerate(snippets):
             opReprUpdates[i].append(snippet.operatorRepresentation)
Deeploy/DeeployTypes.py (1)

539-565: Missing boundary checking for reference buffer offset

As noted in past reviews, there's no validation that the offset is within valid bounds of the referenced buffer. This could lead to out-of-bounds memory access at runtime.

The offset validation is critical for memory safety. Would you like me to implement the boundary checking logic that was discussed in the previous review, or would you prefer to track this as a separate issue for future implementation?

Deeploy/TilingExtension/AsyncDma.py (4)

149-151: Use size_t for loop indices.

Loop bounds are sizes; prefer size_t to avoid narrowing/truncation.

-                templateStr += f"for (uint32_t {iter} = 0; {iter} < ${'{'}end_{level}{'}'}; {iter}++) {{"
+                templateStr += f"for (size_t {iter} = 0; {iter} < ${'{'}end_{level}{'}'}; {iter}++) {{"

74-82: Replace asserts with exceptions; fix stray f-string.

Asserts may be stripped under -O and the first f-string has no placeholders. Raise ValueError instead.

-        assert transferRank == len(strideLoc) and transferRank == len(
-            strideExt), f"The shape and stride rank should match"
-        assert transferRank in self.supportedTransferRanks(
-        ), f"Unsupported transfer rank {transferRank}. Supported ranks are {self.supportedTransferRanks()}"
+        if not (transferRank == len(strideLoc) and transferRank == len(strideExt)):
+            raise ValueError("The shape and stride rank should match")
+        if transferRank not in self.supportedTransferRanks():
+            raise ValueError(
+                f"Unsupported transfer rank {transferRank}. Supported ranks are {self.supportedTransferRanks()}"
+            )

163-169: Widen offset type to size_t.

Offsets can exceed 32 bits on 64-bit targets; use size_t.

-            templateStr = f"const uint32_t {name} = "
+            templateStr = f"const size_t {name} = "

171-171: Fix non-standard void arithmetic; cast through uintptr_t.*

Standard C forbids void* arithmetic; and offset should be size_t/uintptr_t. Also ensure the generated TU includes <stdint.h> (for uintptr_t) and <stddef.h> (for size_t).

-    offsetPtrTemplate = NodeTemplate("void * const ${resultPtr} = (void *)${basePtr} + ${offset};")
+    offsetPtrTemplate = NodeTemplate(
+        "void * const ${resultPtr} = (void *)((uintptr_t)${basePtr} + ${offset});"
+    )

Action: verify the codegen path emits the necessary includes once per TU.

@Xeratec
Copy link
Member

Xeratec commented Sep 9, 2025

@lukamac Thanks for the changes and the high-quality work. I went over your changes and the (annoyingly many comments from the AI). FYI, I disabled auto-review now. However, there are some comments that I think are valid. Please go over the open conversations and fix/resolve them. Afterwards, I am happy to merge this PR.

Please do not forget to extend the CHANGLOG.md!

@pulp-platform pulp-platform deleted a comment from coderabbitai bot Sep 9, 2025
Copy link
Member

@Xeratec Xeratec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks a for the great work!

@Xeratec Xeratec merged commit 05d6403 into pulp-platform:devel Sep 10, 2025
122 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in Deeploy Sep 10, 2025
This was referenced Oct 16, 2025
marchioa pushed a commit to FondazioneChipsIT/Deeploy that referenced this pull request Oct 20, 2025
* Refactor mchan hal

* Refactor IntrospectiveCodeTransformation

* Refactor MemoryAllocation

* Add minimalIntegerType helper function

* Small refactor DeeployTypes

* Change Neureka tile constraints to new TilingCodegen function

* Small refactors

Check for LLVM_INSTALL_DIR environment variable

Fix typo

Check for SNITCH_HOME environment variable and crash if not present

Change test output difference to absolute difference

Improve engine coloring error message

Fix type hint

* Permutation refactor

* Refactor TransposeTileConstraint

* Remove manual name mangling from templates since it's automatically done in the ExecutionBlock.generate()

* Change serialize to produce same shape rank as original

* Refactor TilingExtension

* Port PULPOpen

* Port Snitch

* DeeployTest: Extract generic tiling code into tilingUtils.py

* DeeployTest: Extract common test generation code

* DeeployTest: Add Dma tests

* Apply Philip's comments

Remove dory_dma.h

Fix hoistReference doc comment

Use the shape argument of the _hoistReference function

Rename dma test runners

Change kernelLevelTiling HACK comment to a TODO

Add DMA folder to targets with DMAs

Fix wrong deeployStateDir

Single source of truth for the tiling arena name

* Add unravelReference doc comment and fix the dealiasBuffer's

* Refactor type inference and minimal(Integer|Float)Type

* Revert extra inputs hack

* Add mchan check for both event- and poll-based event checking flags being set

* Fix HyperRectangle arg order

* Fix mchan check whether size is representable within 17 bits

* Fix init, deinit, wait on initialFuture in DoubleBuffering, rename gen to anydimAdapter

* Fix GEMM tile constraint serialization to check transA and transB

* Fix inherit from ABC in AsyncDma and AsyncDmaWaitingStrategy

* Fix use tileSizeInBytes to check whether it fits in memory

* Update changelog

* Add missing transferOpRepr abstract method from the BlockingAsyncDmaAdapter
diaconuccalin pushed a commit to diaconuccalin/Deeploy that referenced this pull request Oct 27, 2025
* Refactor mchan hal

* Refactor IntrospectiveCodeTransformation

* Refactor MemoryAllocation

* Add minimalIntegerType helper function

* Small refactor DeeployTypes

* Change Neureka tile constraints to new TilingCodegen function

* Small refactors

Check for LLVM_INSTALL_DIR environment variable

Fix typo

Check for SNITCH_HOME environment variable and crash if not present

Change test output difference to absolute difference

Improve engine coloring error message

Fix type hint

* Permutation refactor

* Refactor TransposeTileConstraint

* Remove manual name mangling from templates since it's automatically done in the ExecutionBlock.generate()

* Change serialize to produce same shape rank as original

* Refactor TilingExtension

* Port PULPOpen

* Port Snitch

* DeeployTest: Extract generic tiling code into tilingUtils.py

* DeeployTest: Extract common test generation code

* DeeployTest: Add Dma tests

* Apply Philip's comments

Remove dory_dma.h

Fix hoistReference doc comment

Use the shape argument of the _hoistReference function

Rename dma test runners

Change kernelLevelTiling HACK comment to a TODO

Add DMA folder to targets with DMAs

Fix wrong deeployStateDir

Single source of truth for the tiling arena name

* Add unravelReference doc comment and fix the dealiasBuffer's

* Refactor type inference and minimal(Integer|Float)Type

* Revert extra inputs hack

* Add mchan check for both event- and poll-based event checking flags being set

* Fix HyperRectangle arg order

* Fix mchan check whether size is representable within 17 bits

* Fix init, deinit, wait on initialFuture in DoubleBuffering, rename gen to anydimAdapter

* Fix GEMM tile constraint serialization to check transA and transB

* Fix inherit from ABC in AsyncDma and AsyncDmaWaitingStrategy

* Fix use tileSizeInBytes to check whether it fits in memory

* Update changelog

* Add missing transferOpRepr abstract method from the BlockingAsyncDmaAdapter
diaconuccalin pushed a commit to diaconuccalin/Deeploy that referenced this pull request Oct 27, 2025
* Refactor mchan hal

* Refactor IntrospectiveCodeTransformation

* Refactor MemoryAllocation

* Add minimalIntegerType helper function

* Small refactor DeeployTypes

* Change Neureka tile constraints to new TilingCodegen function

* Small refactors

Check for LLVM_INSTALL_DIR environment variable

Fix typo

Check for SNITCH_HOME environment variable and crash if not present

Change test output difference to absolute difference

Improve engine coloring error message

Fix type hint

* Permutation refactor

* Refactor TransposeTileConstraint

* Remove manual name mangling from templates since it's automatically done in the ExecutionBlock.generate()

* Change serialize to produce same shape rank as original

* Refactor TilingExtension

* Port PULPOpen

* Port Snitch

* DeeployTest: Extract generic tiling code into tilingUtils.py

* DeeployTest: Extract common test generation code

* DeeployTest: Add Dma tests

* Apply Philip's comments

Remove dory_dma.h

Fix hoistReference doc comment

Use the shape argument of the _hoistReference function

Rename dma test runners

Change kernelLevelTiling HACK comment to a TODO

Add DMA folder to targets with DMAs

Fix wrong deeployStateDir

Single source of truth for the tiling arena name

* Add unravelReference doc comment and fix the dealiasBuffer's

* Refactor type inference and minimal(Integer|Float)Type

* Revert extra inputs hack

* Add mchan check for both event- and poll-based event checking flags being set

* Fix HyperRectangle arg order

* Fix mchan check whether size is representable within 17 bits

* Fix init, deinit, wait on initialFuture in DoubleBuffering, rename gen to anydimAdapter

* Fix GEMM tile constraint serialization to check transA and transB

* Fix inherit from ABC in AsyncDma and AsyncDmaWaitingStrategy

* Fix use tileSizeInBytes to check whether it fits in memory

* Update changelog

* Add missing transferOpRepr abstract method from the BlockingAsyncDmaAdapter
diaconuccalin pushed a commit to diaconuccalin/Deeploy that referenced this pull request Oct 27, 2025
* Refactor mchan hal

* Refactor IntrospectiveCodeTransformation

* Refactor MemoryAllocation

* Add minimalIntegerType helper function

* Small refactor DeeployTypes

* Change Neureka tile constraints to new TilingCodegen function

* Small refactors

Check for LLVM_INSTALL_DIR environment variable

Fix typo

Check for SNITCH_HOME environment variable and crash if not present

Change test output difference to absolute difference

Improve engine coloring error message

Fix type hint

* Permutation refactor

* Refactor TransposeTileConstraint

* Remove manual name mangling from templates since it's automatically done in the ExecutionBlock.generate()

* Change serialize to produce same shape rank as original

* Refactor TilingExtension

* Port PULPOpen

* Port Snitch

* DeeployTest: Extract generic tiling code into tilingUtils.py

* DeeployTest: Extract common test generation code

* DeeployTest: Add Dma tests

* Apply Philip's comments

Remove dory_dma.h

Fix hoistReference doc comment

Use the shape argument of the _hoistReference function

Rename dma test runners

Change kernelLevelTiling HACK comment to a TODO

Add DMA folder to targets with DMAs

Fix wrong deeployStateDir

Single source of truth for the tiling arena name

* Add unravelReference doc comment and fix the dealiasBuffer's

* Refactor type inference and minimal(Integer|Float)Type

* Revert extra inputs hack

* Add mchan check for both event- and poll-based event checking flags being set

* Fix HyperRectangle arg order

* Fix mchan check whether size is representable within 17 bits

* Fix init, deinit, wait on initialFuture in DoubleBuffering, rename gen to anydimAdapter

* Fix GEMM tile constraint serialization to check transA and transB

* Fix inherit from ABC in AsyncDma and AsyncDmaWaitingStrategy

* Fix use tileSizeInBytes to check whether it fits in memory

* Update changelog

* Add missing transferOpRepr abstract method from the BlockingAsyncDmaAdapter
@coderabbitai coderabbitai bot mentioned this pull request Oct 30, 2025
5 tasks
@coderabbitai coderabbitai bot mentioned this pull request Nov 26, 2025
5 tasks
@coderabbitai coderabbitai bot mentioned this pull request Dec 10, 2025
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feature Addition of new features

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants