Refactor tiling code generation #105

lukamac · 2025-07-08T13:41:13Z

Main contribution of this PR is abstracting the DMA logic into the AsyncDma class and showing it works with the currently supported DMAs like the PULP's mchan and L3 dma, and Snitch's cluster dma.
The main goal was enabling any-dimensional transfers which got incarnated as the AnydimAsyncDmaTransferAdapter.

Performance comparison

Comparison of network execution on Siracusa emulated in GVSoC

Network	Conf	L1 size	Cycles before	Cycles after	Diff	Perc. diff
Simple regression	SB L2	64k	636738	637758	1020	0.2%
Simple regression	SB L3	64k	1007791	989804	-17987	-1.8%
Simple regression	DB L2	64k	648051	643213	-4838	-0.7%
Simple regression	DB L3	64k	954903	988656	33753	3.5%
MobileNetV2	SB L3	64k	Not runnable	198539808	#VALUE!	#VALUE!
MobileNetV2	DB L3	64k	Not runnable	194710802	#VALUE!	#VALUE!
miniMobileNetV2	SB L2	16k	120580	110003	-10577	-8.8%
miniMobileNetV2	SB L3	16k	392763	361312	-31451	-8.0%
miniMobileNetV2	DB L2	16k	122368	110997	-11371	-9.3%
miniMobileNetV2	DB L3	16k	362411	361283	-1128	-0.3%
microLlama/microLlama1	SB L2	10k	657015	519142	-137873	-21.0%
microLlama/microLlama1	SB L3	10k	4461938	3977350	-484588	-10.9%
microLlama/microLlama1	DB L2	10k	707464	542643	-164821	-23.3%
microLlama/microLlama1	DB L3	10k	3913082	4047015	133933	3.4%
CCT/CCT_1_16_16_8	SB L2	64k	493469	485747	-7722	-1.6%
CCT/CCT_1_16_16_8	SB L3	64k	1255862	1219939	-35923	-2.9%
CCT/CCT_1_16_16_8	DB L2	64k	513180	504340	-8840	-1.7%
CCT/CCT_1_16_16_8	DB L3	64k	1227566	1213998	-13568	-1.1%

I wanted to test also Snitch but the program only prints 0 cycles. Visually checking the code, there is no fundamental change except for the fact that we now emit less barriers.

Added

AsyncDma abstraction of DMA's
test runner per DMA and a script that tests all the DMA's
generic Single/DoubleBufferingTilingCodeGeneration classes
TilingVariableReplacementUpdate class that updates the variable replacement refs
TilingHoistingMixIn class that encapsulates all the hoisting helper functions of tiling
sorting of input memory allocations to allow references that live in the same memory level as the memory they are referencing
a function that tests the tiling solution for correctness which currently only tests buffer allocation for byte alignment
IntrospectiveCodeTransformation: _indexPointer(), indexVars(), dereferenceVars(). The *Vars functions index/dereference a list of variables (useful for tiling)
NetworkContext: unravelReference() that unravels a _ReferenceBuffer until the base buffer
NetworkContext: is_object() - helper function that determines whether the string represents a name of a local or global object
NetworkContext: is_buffer() - helper function that determines whether the string represents a name of a buffer
missing checks for environment variables
_permuteHyperRectangle helper function

Changed

mchan HAL is now reduced to bare-bones
refactor of the IntrospectiveCodeTransformation to work on the Mako template and made imo clearer
refactor of memory allocation code transformation passes
_ReferenceBuffer accepts an optional offset argument to offset the reference
NetworkContext: hoistReference - accepts the actual buffer as reference instead of name, accepts shape, offset, and override_type arguments, and returns the actual buffer, not its name
_mangleNodeRep -> _mangleOpRepr - the canonical way we use is OperatorRepresentation. NodeRep and ParseDict are old iterations of that.
rename of permutation functions to follow this convention: permute is an action that permutes something, permutation is a function that generates a permutation
_permuteList to just _permute
removed manual buffer name mangling since we do it in the ExecutionBlock generate() function, simplifies templates
we now check that buffer shapes/hyperrectangles/tiling ranks match which required changing a few serializeTilingSolution functions to preserve the same shape rank
big refactor of the code generation part of the TilingExtension
port of PULPOpen tiling code generation
port of Snitch tiling code generation
PULPClusterTilingSB and PULPClusterTilingDB now allow for transfers of any rank (dimensionality)
PULP's final output diff is now calculated as absolute error, instead of just subtraction
common code generation code between testMVP/generateNetwork/... was extracted into a single generateTestNetwork function
in some functions, instead of passing the name of a buffer, the actual buffer is just passed
tile function allows overriding the optimizer with external tilingSolution and memoryMap
refactor of the permutation functions for clarity

PR Merge Checklist

The PR is rebased on the latest devel commit and pointing to devel.
Your PR reviewed and approved.
All checks are passing.
The CHANGELOG.md file has been updated.
If the docker was modified, change back its link after review.

Xeratec · 2025-08-23T09:10:33Z

@coderabbitai summary

coderabbitai · 2025-08-23T09:10:47Z

✅ Actions performed

Summary regeneration triggered.

coderabbitai · 2025-08-23T09:14:14Z

Summary by CodeRabbit

New Features
- Async DMA support across targets (L3, MCHAN, Snitch) with single/double-buffer tiling.
- Enhanced tiling/memory pipeline, improved IO discovery, and minimal type helpers.
Refactor
- Major DMA-aware tiling/codegen overhaul; new single/double-buffer generators.
- Updated permutation utilities and template/introspection flow.
- Removed legacy DMA headers/paths; unified offset/rectangle utilities.
Tests
- New DMA test runners and comprehensive matrix harness.
- Unified test network generator; revamped type-mapping.
- Stricter env checks and absolute diff reporting.
Chores
- Added CI job to run DMA tests.
- Updated platform includes and build guards.

Walkthrough

Adds DMA-aware tiling and codegen infrastructure (AsyncDma, single/double buffering), integrates DMA in PULP/Snitch pipelines, refactors tiling/memory mapping APIs, revises permutation/minimization utilities, updates reference/aliasing semantics, adjusts multiple templates to avoid name mangling, replaces/updates PULP DMA headers, introduces new DMA test runners and CI job, and updates tests for new type-mapping API.

Changes

Cohort / File(s)	Summary
CI `.github/workflows/CI.yml`	Adds job `deeploy-test-dmas` to install package and run `DeeployTest/testDmas.py` on dynamic runner/image.
Closure generation `Deeploy/CommonExtensions/CodeTransformationPasses/Closure.py`	Tweaks dynamic reference extraction (named arg `unrollStructs=True`) and removes dedup loops during closure struct gen.
Introspective transformation refactor `Deeploy/CommonExtensions/CodeTransformationPasses/IntrospectiveCodeTransformation.py`	Switches NodeTemplate→Template flow; compiles source via codegen; adds pointer index/deref helpers; revises dynamic-expr extraction signatures and typing.
Memory allocation pass refactor `Deeploy/CommonExtensions/CodeTransformationPasses/MemoryAllocation.py`	Reworks to memory-level buffers, renames ctor arg, adds static classifiers/topo sort, simplifies passthrough; updates imports/types.
Data types helpers `Deeploy/CommonExtensions/DataTypes.py`	Adds `minimalIntegerType` and `minimalFloatType`; typing imports updated.
Permutation utilities and usages `Deeploy/CommonExtensions/OptimizationPasses/TopologyOptimizationPasses/LoweringOptimizationPasses.py`	Introduces generic `_permute`, `_permuteHyperRectangle`, renames permutation helpers, tightens typings; updates call sites.
Types and references `Deeploy/DeeployTypes.py`	Overhauls `_ReferenceBuffer` (offsets), alias resolution, reference hoisting API, object/buffer checks, IO discovery, and op-repr mangling rename.
Engine coloring message `Deeploy/EngineExtension/NetworkDeployers/EngineColoringDeployer.py`	Error now lists uncolored node names and operations.
Name mangling removals (targets) `Deeploy/Targets/CortexM/Templates/CMSISUtils.py`, `Deeploy/Targets/Generic/Templates/DebugPrintTemplate.py`, `.../ITAMaxTemplate.py`, `Deeploy/Targets/MemPool/Templates/ITAMaxTemplate.py`, `.../ITATemplate.py`, `.../GemmTemplate.py`, `.../RQGemmTemplate.py`, `.../RQMatMulTemplate.py`	Replace `ctxt._mangle(...)` with raw names or capture returned transient buffer names where applicable.
Generic tiling constraints updates `Deeploy/Targets/Generic/TileConstraints/TransposeTileConstraint.py`, `.../iRMSNormTileConstraint.py`	Use `_permuteHyperRectangle`; simplify schedule serialization; construct weight cube directly.
Neureka tile constraints offset API `Deeploy/Targets/Neureka/TileConstraints/Neureka{Dense,Depthwise,Pointwise}Constraint.py`	Switch to `calculateFlatOffsetInBytes` for weight offsets.
PULP bindings and platform `Deeploy/Targets/PULPOpen/Bindings.py`, `.../Platform.py`	Integrates L3/MCHAN DMA, adds variable-replacement update pass, reorders pipeline, adds L3 memory generation; replaces `dory_dma.h` include with `mchan_siracusa.h`.
PULP cluster/L3 tiling refactor `Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPClusterTiling.py`, `.../PULPL3Tiling.py`	New constructors `(externalMemory, localMemory, dma)`; adopt new SB/DB base classes; inline class variants; sequential SB/DB in Snitch analog.
Removed legacy PULP tiling modules `Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPClusterTilingSB.py`, `.../PULPClusterTilingDB.py`, `.../PULPL3TilingSB.py`, `.../PULPL3TilingDB.py`	Delete old SB/DB tiling codegen modules.
PULP DMA implementations (new) `Deeploy/Targets/PULPOpen/DMA/L3Dma.py`, `.../MchanDma.py`	Add AsyncDma drivers with futures, templates, checks, and blocking adapter for L3; MCHAN 1D/2D commands and waiting strategy.
PULP auto-transpose DMA `Deeploy/Targets/PULPOpen/CodeTransformationPasses/AutoTransposeUtils.py`	Rework to `_permuteHyperRectangle`, `minimizeRectangle`; adjust stride/shape handling and return values.
Snitch bindings and tiling `Deeploy/Targets/Snitch/Bindings.py`, `.../CodeTransformationPasses/SnitchClusterTiling.py`, `.../DMA/SnitchDma.py`, `.../CodeTransformationPasses/SnitchClusterTilingSB.py`	Add Snitch DMA; refactor tiling to SB/DB classes with `(externalMemory, localMemory, dma)`; remove old SB module.
Tiling core: DMA-based codegen (new) `Deeploy/TilingExtension/AsyncDma.py`, `.../CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py`, `.../DoubleBufferingTilingCodeGeneration.py`	Introduce AsyncDma/Future/primitives, blocking/anydim adapters, and SB/DB tiling codegen passes using DMA futures.
Tiling codegen refactor `Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py`	Rework to DMA-centered transfers, new init signature, helpers, multi-schedule support, and meta info handling.
Tiling hoisting mixin (new) `Deeploy/TilingExtension/CodeTransformationPasses/TilingHoistingMixIn.py`	Add `dictOfArrays`, hoisting/prefix utilities, multi-buffer reference hoisting, tile count/idx hoisting.
Tiling variable replacement refactor `Deeploy/TilingExtension/CodeTransformationPasses/TilingVariableReplacement.py`	Arena-based allocations, updated apply flow, adds `TilingVariableReplacementUpdate`.
Tiling prototypes meta changes `Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py`	`TilingMetaInfo` fields updated (`numTiles`→str, add `totalNumTiles`, `tileIdxPtr`); adjust measurement arrays and loops.
Tiling utilities API changes `Deeploy/TilingExtension/TilingCodegen.py`	Replace minimization/offset APIs; add pad/stride/flat offset helpers; new `computeTileHyperRectangles`.
Tiler extension & scheduler `Deeploy/TilingExtension/TilerExtension.py`, `.../MemoryScheduler.py`, `.../TileConstraint.py`, `.../MemoryConstraints.py`	Separate memory map from tiling solution; add validation/annotation; use `_permute`; switch to `computeTileHyperRectangles`; broaden shape typing.
Tests: new DMA matrix and runners `DeeployTest/testDmas.py`, `.../testRunner_siracusa_mchandma.py`, `.../testRunner_siracusa_l3dma.py`, `.../testRunner_snitch_dma.py`	Add runners for MCHAN/L3/Snitch DMA with pipelines; `testDmas.py` iterates configurations and launches runners.
Tests: type-mapping API change `DeeployTest/testUtils/typeMapping.py`, usages in `DeeployTest/*`	Replace `inferInputType` with `inferTypeAndOffset`; add helpers (minimal type, dtype mapping). Update callers across tests.
Tests: code generation API `DeeployTest/testUtils/codeGenerate.py`, usages in `DeeployTest/*`	Consolidate generation into `generateTestNetwork`; update headers/impl generation signatures and verbosity handling.
Tests: tiling utils `DeeployTest/testUtils/tilingUtils.py`	Add `DBOnlyL3Tiler`, `DBTiler`, `SBTiler` with `multiBufferStrategy`.
Tests: platform mapping typing `DeeployTest/testUtils/platformMapping.py`	Strengthen `inputTypes` to `Dict[str, Type[Pointer]]`.
Tests: minor updates `DeeployTest/Platforms/Siracusa/src/deeploytest.c`, `.../testUtils/testRunner.py`, `.../testMVP.py`, `.../generateNetwork.py`, `.../testSlice_PULP.py`, `.../testSchedulingExtension.py`, `.../testPrintInputOutputTransformation.py`, `.../deeployStateEqualityTest.py`, `.../testTilerExtension.py`	Adjust diff calc to absolute; assert LLVM env; switch to new type-mapping and generation APIs; update flows and logs.
Target libraries (PULP) `TargetLibraries/PULPOpen/inc/mchan_siracusa.h` (add), `.../inc/mchan_v6.h` (add), `.../inc/mchan_v7.h` (add), `.../inc/dory_dma.h` (del), `.../inc/mchan.h` (del), `.../src/dory_dma.c` (del)	Replace DORY DMA with MCHAN v6/v7 headers and Siracusa config; remove dory DMA sources/headers.
Build guards `TargetLibraries/PULPOpen/CMakeLists.txt`, `cmake/snitch/snitch.cmake`	Fix typo; add fatal guard for `SNITCH_HOME` env.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Deployer
  participant TilingPass as TilingCodeGeneration (SB/DB)
  participant Adapter as AnydimAsyncDmaTransferAdapter
  participant DMA as AsyncDma
  participant Future

  User->>Deployer: apply()
  Deployer->>TilingPass: generateTilingLoop(tilingSchedules)
  TilingPass->>Adapter: transfer(external, local, shape, strides, dir, future)
  alt kernelRank == transferRank
    Adapter->>DMA: transfer(..., future)
  else kernelRank < transferRank
    Adapter->>Adapter: emit nested loops + offset ptrs
    Adapter->>DMA: transfer(inner-shape,..., future)
  else kernelRank > transferRank
    Adapter->>DMA: transfer(padded-shape,..., future)
  end
  DMA->>Future: get/init/wait/deinit (via waiting strategy)
  TilingPass-->>Deployer: ExecutionBlock (tile loop + DMA calls)

sequenceDiagram
  autonumber
  participant Pipeline as PULPOpen Bindings Pipeline
  participant TVR as TilingVariableReplacement
  participant TVRUpd as TilingVariableReplacementUpdate
  participant Cluster as PULPClusterTiling("L2","L1", MchanDma)
  participant L3 as PULPL3Tiling("L3","L2", l3DmaHack)
  participant MemGen as MemoryManagementGeneration

  Pipeline->>TVR: apply(arena alloc, tiled refs)
  TVR->>TVRUpd: apply(update refs per tile)
  TVRUpd->>Cluster: apply(SB/DB tiling with DMA)
  Cluster->>L3: apply(L3 tiling with DMA)
  L3->>MemGen: manage L2/L3/default allocations

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~150 minutes

Possibly related PRs

[DeeployTest] Change order of typeMatching entries #68 — Adjusts integer type selection/order; intersects with this PR’s new minimal type inference and type-mapping refactor in DeeployTest/testUtils/typeMapping.py.

Suggested reviewers

Victor-Jung
Xeratec

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Pre-merge checks (3 passed)

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title “Refactor tiling code generation” succinctly captures the primary focus of the changeset, highlighting the core architectural overhaul of the TilingExtension and related DMA abstractions without extraneous detail, and is clear and specific enough for a reviewer to understand the main intent at a glance.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.
Description Check	✅ Passed	The pull request description clearly outlines the core contributions—such as the AsyncDma abstraction, Any-dimensional DMA adapter, and refactored tiling code generation—corresponding directly to the detailed changes presented in the summary, and it provides relevant performance data and lists of added and modified components that match the files changed in the PR.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Xeratec · 2025-08-25T09:13:44Z

@coderabbitai Provide further details about the modification to the tiling extension (All the files under Deeploy/TilingExtension/)

coderabbitai · 2025-08-25T09:14:49Z

Tip

For best results, initiate chat on the files or code changes.

@Xeratec I'll provide a comprehensive analysis of the tiling extension modifications. This represents a major architectural refactor of the entire tiling system.

Major Architectural Changes

1. Async DMA Framework Introduction (`AsyncDma.py`)

New Module: Introduces a complete asynchronous DMA transfer framework
Key Components:
- AsyncDma base class with transfer validation and rank-based templates
- Future abstraction for async operations with init/deinit/wait lifecycle
- Waiting strategies: PerTensorWaitingStrategy and TensorGroupWaitingStrategy
- AnydimAsyncDmaTransferAdapter for arbitrary dimensional transfers using nested loops
- BlockingDmaFromAsyncDmaAdapter to bridge async/blocking semantics

2. Tiling Code Generation Restructure

The core tiling generation has been completely rewritten with three new classes:

Base Class Refactor (`TilingCodeGeneration.py`)

From: Static, single-memory path tiling
To: DMA-driven, tile-aware, hoisted codegen with robust memory constraints
Key Changes:
- Now inherits from TilingHoistingMixIn for memory management
- Constructor signature: __init__(externalMemory, localMemory, dma, bufferCount)
- generateTilingLoop now accepts List[TilingSchedule] instead of single schedule
- Uses TensorMemoryConstraint instead of NodeMemoryConstraint
- Introduces transfer legalization and arbitrary-dimensional support

Single Buffering (`SingleBufferingTilingCodeGeneration.py`)

New Class: Extends TilingCodeGeneration for single-buffer tiling
Features:
- DMA-based transfer scheduling between external/local memories
- Future-based synchronization for ingress/egress paths
- Per-tile code generation with dynamic operator representation updates
- Integrated shape validation and transfer legalization

Double Buffering (`DoubleBufferingTilingCodeGeneration.py`)

New Class: Implements sophisticated double-buffering strategy
Features:
- Two-buffer scheme with modular arithmetic-based selection (TILING_I % 2)
- Simultaneous input/output transfers with async DMA coordination
- Buffer-chaining logic with future synchronization
- Integrated setup/teardown with DMA lifecycle management

3. Memory Management Revolution

Hoisting Infrastructure (`TilingHoistingMixIn.py`)

New Mixin: Provides comprehensive memory-hoisting utilities
Key Features:
- _hoistValues(): Integer sequences as ConstantBuffers with minimal type selection
- _hoistReference(): VariableBuffer references with shape/offset support
- _hoistTileNumAndIdxPtr(): Tile counting and indexing infrastructure
- _hoistOpReprUpdates(): Dynamic operator representation management
- _hoistMultibufferReferences(): Double-buffering setup with validation

Arena-Based Memory (`TilingVariableReplacement.py`)

Deep Refactor: From pointer-level dereference to arena-based allocations
New Features:
- Arena-based allocation via _arenaAllocate()
- Explicit variable tiling with TilingVariableReplacementUpdate class
- Modular handling of transients, variable replacements, and tiled tensors
- Template-based update generation for post-tiling variable updates

4. Profiling and Instrumentation Overhaul (`TilingPrototypes.py`)

Unified Profiling: Converts per-event timing to unified measurement arrays
Key Changes:
- TilingMetaInfo gains totalNumTiles: int and tileIdxPtr: str
- numTiles changes from int to str for macro-driven access
- Measurement storage becomes parameter-driven with single arrays
- Loop indexing becomes macro-driven with string-typed tile counts

5. Shape and Stride Management (`TilingCodegen.py`)

Rectangle Minimization API: minimizeRectangleDims() → minimizeRectangle()
- From buffer-centric to shape-oriented approach
- Returns (HyperRectangle, Tuple[int, ...]) instead of two HyperRectangles
New Utilities:
- padShape(), padOffset(), padStride() for rank normalization
- stridesFromShape(), calculateFlatOffset() for memory layout calculations
- computeTileHyperRectangles() for tile generation with validation

6. Dual-Path Tiling Workflow (`TilerExtension.py`)

Manual Override Support: New MemoryMap type alias and dual-path workflow
API Changes:
- computeTilingSchedule() returns only TilingSolution
- New computeMemoryMap(), annotateMemoryLevel(), testTilingSolutionCorrectness()
- tile() method supports manual tilingSolution and memoryMap override
Validation: Built-in correctness testing for buffer alignment and memory constraints

7. Enhanced Type Safety and Constraints

Memory Constraints (MemoryConstraints.py): Shape annotation updated to Tuple[int, ...] for multi-dimensional support
Tile Constraints (TileConstraint.py): Updated to use computeTileHyperRectangles() and enhanced memory path validation
Memory Scheduler (MemoryScheduler.py): Migrated to new _permute() function for type-safe permutations

Impact and Benefits

Performance

Async DMA enables overlapped computation and data movement
Double buffering maximizes memory bandwidth utilization
Optimized transfer scheduling reduces idle time

Flexibility

Support for arbitrary dimensional transfers (1D, 2D, 3D, N-D)
Pluggable DMA backends with different capabilities
Manual tiling override for custom optimization strategies

Maintainability

Modular architecture with clear separation of concerns
Type-safe interfaces with comprehensive validation
Unified profiling and instrumentation framework

Scalability

Arena-based memory management for efficient allocation
Hoisting infrastructure for optimal memory usage
Support for complex multi-level memory hierarchies

This represents a complete modernization of the tiling system, moving from a basic static approach to a sophisticated, async, multi-buffered, and highly configurable framework suitable for complex heterogeneous computing scenarios.

Xeratec

@lukamac, I reviewed all your changes except for the refactoring of the TilingExtension. There are a lot of changes, and I am not super familiar with the previous code. It is clear that the newer code is nicer and better to understand, but I am struggling with the changes in TilingCodeGeneration.py and TilingVariableReplacement.py.

I would prefer an offline walkthrough through the remaining changes.

TargetLibraries/PULPOpen/src/dory_dma.c

Deeploy/CommonExtensions/DataTypes.py

Deeploy/DeeployTypes.py

Deeploy/Targets/Generic/TileConstraints/iRMSNormTileConstraint.py

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py

DeeployTest/testRunner_L3Dma.py

Deeploy/Targets/Snitch/SnitchDma.py

Deeploy/Targets/PULPOpen/Dmas/L3Dma.py

DeeployTest/testUtils/dmaUtils.py

Deeploy/TilingExtension/CodeTransformationPasses/TilingVariableReplacement.py

coderabbitai

Actionable comments posted: 18

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

Deeploy/CommonExtensions/DataTypes.py (1)
111-116: Fix type annotation mismatch: return a tuple, not a list.

sorted(...) returns a list; annotation declares a Tuple. Wrap with tuple(...).
-IntegerDataTypes: Tuple[Type[IntegerImmediate], ...] = (sorted((
-    *SignedIntegerDataTypes,
-    *UnsignedIntegerDataTypes,
-),
-                                                               key = lambda _type: _type.typeWidth))
+IntegerDataTypes: Tuple[Type[IntegerImmediate], ...] = tuple(sorted(
+    (*SignedIntegerDataTypes, *UnsignedIntegerDataTypes),
+    key = lambda _type: _type.typeWidth
+))
DeeployTest/testUtils/codeGenerate.py (1)

37-50: Remove buffer shape mutation and flatten outputs in _shapeBroadcast
In _shapeBroadcast, drop ctxt.lookup(f"{name}").shape = broadcastNum.shape and change the else branch to broadcastNum = value.reshape(-1) so that both global and local paths emit 1-D data without mutating the buffer’s stored shape. Retain ctxt.is_global for the global check.
Deeploy/Targets/PULPOpen/CodeTransformationPasses/AutoTransposeUtils.py (1)
177-178: Python slicing bug with -0 yields empty tuple.

When len(fixedPermStrides) == 0, rect.dims[:-0] is empty. Return full dims in that case.
-    return struct, remainderStrides, rect.dims[:-len(fixedPermStrides)]
+    drop = len(fixedPermStrides)
+    return struct, remainderStrrides, rect.dims[:-drop] if drop > 0 else rect.dims
DeeployTest/testMVP.py (1)
272-279: Bug: signProp is hard-coded to False, disabling offsetting for sign-prop platforms

This disables the input/output offset path below, likely breaking platforms that rely on sign propagation (and deployers that don’t accept inputOffsets). Compute signProp from platform to keep behavior consistent with setupDeployer().

Apply this diff:
-    platform = deployer.Platform
-    signProp = False
+    platform = deployer.Platform
+    # Keep main() consistent with setupDeployer(): derive platform sign-prop preference
+    _, signProp = mapPlatform(args.platform)

♻️ Duplicate comments (43)

Deeploy/Targets/MemPool/Templates/ITATemplate.py (1)

325-326: Sanitize nodeName before interpolating into C identifiers (prevents invalid/broken codegen).

If nodeName contains '/', '-', '.', spaces, or starts with a digit, the generated identifiers will be invalid C and compilation will fail. Use a C-safe variant when forming ${data_in_array} and ${quant_params_array}.

Apply this diff here:
-        operatorRepresentation['data_in_array'] = f"{nodeName}_data_in_array"
-        operatorRepresentation['quant_params_array'] = f"{nodeName}_quant_params_array"
+        operatorRepresentation['data_in_array'] = f"{nodeName_safe}_data_in_array"
+        operatorRepresentation['quant_params_array'] = f"{nodeName_safe}_quant_params_array"
Add the helper once at module scope and derive nodeName_safe near its first use:
@@
-from typing import Dict, Tuple
+from typing import Dict, Tuple
+import re
 import numpy as np
@@
+def _c_ident(s: str) -> str:
+    s = re.sub(r'[^A-Za-z0-9_]', '_', str(s))
+    if not re.match(r'[A-Za-z_]', s):
+        s = f'_{s}'
+    return s
@@
-        nodeName = operatorRepresentation['nodeName']
+        nodeName = operatorRepresentation['nodeName']
+        nodeName_safe = _c_ident(nodeName)
Optional: consider using nodeName_safe consistently for other identifier-like names built from nodeName in this file for uniform safety.

DeeployTest/testUtils/testRunner.py (2)

324-336: Verify CI/export passes this new check.

Ensure workflows set LLVM_INSTALL_DIR or pass --toolchain_install_dir wherever TestRunner is used.

#!/bin/bash
# Find TestRunner invocations and check if toolchain dir is provided or env is set.
rg -nP '\bpython\s+.*testRunner.*\.py\b' .github/workflows -C3 || true
rg -n "LLVM_INSTALL_DIR" .github/workflows || true

325-327: Replace assert with explicit validation; handle toolchain and path.

Asserts are stripped with -O and Ruff flags the f-string. Validate deterministically and check existence.

-        assert self._args.toolchain_install_dir is not None, f"Environment variable LLVM_INSTALL_DIR is not set"
-        self._dir_toolchain = os.path.normpath(self._args.toolchain_install_dir)
+        # Validate toolchain install dir (LLVM only). Avoid asserts in production.
+        if self._args.toolchain.upper() == "LLVM":
+            if not self._args.toolchain_install_dir:
+                raise ValueError("Missing toolchain install dir: set --toolchain_install_dir or LLVM_INSTALL_DIR.")
+            self._dir_toolchain = os.path.normpath(self._args.toolchain_install_dir)
+            if not os.path.isdir(self._dir_toolchain):
+                raise FileNotFoundError(f"Toolchain directory not found: {self._dir_toolchain}")
+        else:
+            self._dir_toolchain = os.path.normpath(self._args.toolchain_install_dir) if self._args.toolchain_install_dir else ""

Deeploy/Targets/Generic/Templates/ITAMaxTemplate.py (1)

45-49: Use hoisted buffer’s actual name.

hoistTransientBuffer may mangle/alter the name; store and use the returned buffer name.

-        ctxt.hoistTransientBuffer(name, size)
-        operatorRepresentation['ctxtBuffer'] = name
+        buf = ctxt.hoistTransientBuffer(name, size)
+        operatorRepresentation['ctxtBuffer'] = getattr(buf, "name", name)
         operatorRepresentation['ctxtBufferSize'] = size
-
-        return ctxt, operatorRepresentation, [name]
+        return ctxt, operatorRepresentation, [operatorRepresentation['ctxtBuffer']]

DeeployTest/Platforms/Siracusa/src/deeploytest.c (1)

172-179: Avoid unsigned wraparound; widen before subtraction and fix printf casts.

Compute error in wider signed type; print with matching formats to avoid UB.

-        OUTPUTTYPE expected = ((OUTPUTTYPE *)testOutputVector[buf])[i];
-        OUTPUTTYPE actual = ((OUTPUTTYPE *)compbuf)[i];
-        int error = expected - actual;
-        OUTPUTTYPE diff = (OUTPUTTYPE)(error < 0 ? -error : error);
+        OUTPUTTYPE expected = ((OUTPUTTYPE *)testOutputVector[buf])[i];
+        OUTPUTTYPE actual = ((OUTPUTTYPE *)compbuf)[i];
+        int32_t error = (int32_t)expected - (int32_t)actual;
+        uint32_t diff = (uint32_t)(error < 0 ? -error : error);
 
         if (diff) {
           tot_err += 1;
-          printf("Expected: %4d  ", expected);
-          printf("Actual: %4d  ", actual);
-          printf("Diff: %4d at Index %12u in Output %u\r\n", diff, i, buf);
+          printf("Expected: %4d  ", (int)expected);
+          printf("Actual: %4d  ", (int)actual);
+          printf("Diff: %4u at Index %12u in Output %u\r\n", diff, (unsigned)i, (unsigned)buf);
         }

Deeploy/EngineExtension/NetworkDeployers/EngineColoringDeployer.py (1)

65-70: Replace assert with deterministic exception; treat empty/None engine as uncolored.

Asserts can be stripped; raise a stable, informative error.

-        uncoloredNodes = [node for node in graph.nodes if "engine" not in node.attrs]
-        uncoloredOperations = set(node.op for node in uncoloredNodes)
-        assert len(
-            uncoloredNodes
-        ) == 0, f"Missing engine color for nodes {[node.name for node in uncoloredNodes]} with operations {uncoloredOperations}"
+        uncoloredNodes = [n for n in graph.nodes if not isinstance(n.attrs.get("engine"), str) or not n.attrs.get("engine")]
+        if uncoloredNodes:
+            nodeNames = sorted((n.name or "<unnamed>") for n in uncoloredNodes)
+            uncoloredOps = sorted({n.op for n in uncoloredNodes if n.op is not None})
+            raise ValueError(f"Missing engine color for nodes {nodeNames} with operations {uncoloredOps}")

Deeploy/Targets/PULPOpen/Platform.py (1)

256-258: Switch to mchan_siracusa.h looks fine; verify there are no residual dory_dma usages and that headers resolve on all PULP variants.

Run the quick scan below and fix any remaining references if found.
#!/bin/bash
# Residual includes/usages of dory_dma
rg -nP 'dory_dma\.h|dory_dma_' -g '!**/build/**'

# Confirm new header is referenced and present
rg -nP 'mchan_siracusa\.h' -g '!**/build/**'
fd -HI 'mchan_siracusa.h'

# Optional: ensure include paths expose TargetLibraries/PULPOpen/inc
rg -nP 'include_paths|CFLAGS|CPPFLAGS' -n . -g '!**/build/**'

TargetLibraries/PULPOpen/inc/mchan_v6.h (1)

40-47: Good: enforce mutual exclusivity for event vs. polled modes.

Deeploy/Targets/Generic/TileConstraints/iRMSNormTileConstraint.py (1)

31-31: Fix potential overflow and dtype of "size" replacement.

np.prod can exceed 16-bit; hardcoding uint16_t risks truncation. Derive the minimal adequate integer type after computing sizes and ensure Python ints.

Apply:

-from Deeploy.CommonExtensions.DataTypes import uint16_t
+from Deeploy.CommonExtensions.DataTypes import minimalIntegerType

-        replacements = {"size": []}
-        replacementTypes = {"size": PointerClass(uint16_t)}
+        replacements = {"size": []}

-        for cube in outputCubes:
-            newSize = np.prod(cube.dims)
-            replacements["size"].append(newSize)
+        for cube in outputCubes:
+            newSize = int(np.prod(cube.dims))
+            replacements["size"].append(newSize)

-        variableReplacementSchedule = VariableReplacementScheme(replacements, replacementTypes)
+        replacementTypes = {"size": PointerClass(minimalIntegerType(replacements["size"]))}
+        variableReplacementSchedule = VariableReplacementScheme(replacements, replacementTypes)

Also applies to: 79-80, 81-83, 95-97

DeeployTest/testRunner_siracusa_mchandma.py (1)

50-53: Replace asserts with explicit validation and use zip(strict=True).
Same as earlier feedback; asserts can be stripped and messages have a minor grammar issue.

Apply:

-assert len(inputShape) == len(tileShape), \
-    f'Input and tile shape should be of the same dimensionality. Received {len(inputShape)}D input shape vs. {len(tileShape)}D tile shape.'
-assert all(tileDim <= inDim for inDim, tileDim in zip(inputShape, tileShape)), \
-    f'Each tile shape dimension should be smaller then the corresponding input one. Received {tileShape} > {inputShape}'
+if len(inputShape) != len(tileShape):
+    raise ValueError(
+        f"Input and tile shape must have the same dimensionality. Got {len(inputShape)}D vs {len(tileShape)}D.")
+if not all(tileDim <= inDim for inDim, tileDim in zip(inputShape, tileShape, strict=True)):
+    raise ValueError(
+        f"Each tile dimension must be <= the corresponding input one. Got tile={tileShape}, input={inputShape}.")

Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py (1)

39-41: Consider extracting rank-aware cube construction into a shared helper

GEMM/MatMul (and Snitch counterparts) now duplicate rank/prefix handling. A single utility (e.g., make_ranked_cube(offset2d, dims2d, batch_off, batch_size, b_off=0, b_size=1)) would de-duplicate and harden behavior across targets.

I can sketch this helper and migrate both PULPOpen and Snitch variants in a follow-up if desired.

Deeploy/Targets/PULPOpen/TileConstraints/MatMulTileConstraint.py (1)

115-119: Bug: transA/transB ignored in serialization (tiles A/B as non-transposed)

NSize and A/B sub-rectangles don’t account for transpose flags. This mis-tiles when transA/transB == 1.

Apply:

@@
-        buffA = ctxt.lookup(operatorRepresentation['A'])
-        buffB = ctxt.lookup(operatorRepresentation['B'])
+        buffA = ctxt.lookup(operatorRepresentation['A'])
+        buffB = ctxt.lookup(operatorRepresentation['B'])
+        transA = int(operatorRepresentation.get("transA", 0))
+        transB = int(operatorRepresentation.get("transB", 0))
@@
-        NSize = buffA.shape[-1]
+        NSize = buffA.shape[-2] if transA else buffA.shape[-1]
@@
-            AMatrixOffsets = (MOffset, NOffset)
-            AMatrixShape = (MSize, NSize)
+            if transA == 0:
+                AMatrixOffsets = (MOffset, NOffset)
+                AMatrixShape = (MSize, NSize)
+            else:
+                AMatrixOffsets = (NOffset, MOffset)
+                AMatrixShape = (NSize, MSize)
@@
-            BMatrixOffsets = (NOffset, OOffset)
-            BMatrixShape = (NSize, OSize)
+            if transB == 0:
+                BMatrixOffsets = (NOffset, OOffset)
+                BMatrixShape = (NSize, OSize)
+            else:
+                BMatrixOffsets = (OOffset, NOffset)
+                BMatrixShape = (OSize, NSize)

#!/bin/bash
# Check for other MatMul/GEMM serializers still using buffA.shape[-1] for N unconditionally
rg -n -C2 -P 'serializeTilingSolution\(|shape\[-1\]\)' Deeploy/Targets | rg -n -P 'MatMul|GEMM'

Also applies to: 148-153

TargetLibraries/PULPOpen/inc/mchan_v7.h (2)

68-71: Make register pointers uintptr_t-safe and assert 32-bit target.

Prevents silent truncation when compiled on hosts with 64-bit pointers.

Apply this diff:

-#include "pmsis.h"
+#include "pmsis.h"
+#include <stdint.h>
+
+_Static_assert(sizeof(uintptr_t) == 4, "MCHAN assumes 32-bit addresses.");
@@
-static volatile uint32_t *const cmd_ptr =
-    (volatile uint32_t *const)(MCHAN_BASE_ADDR + 0x0);
-static volatile uint32_t *const status_ptr =
-    (volatile uint32_t *const)(MCHAN_BASE_ADDR + 0x4);
+static volatile uint32_t *const cmd_ptr =
+    (volatile uint32_t *const)((uintptr_t)MCHAN_BASE_ADDR + 0x0u);
+static volatile uint32_t *const status_ptr =
+    (volatile uint32_t *const)((uintptr_t)MCHAN_BASE_ADDR + 0x4u);

73-113: Inline helpers and cast pointers via uintptr_t before truncation.

Avoid multiple-definition bloat and UB when casting pointers to 32-bit regs.

Apply this diff:

-static void mchan_transfer_1d(uint32_t cmd, void *loc, void *ext) {
+static inline void mchan_transfer_1d(uint32_t cmd, void *loc, void *ext) {
   // TODO: assert flags are set correctly
-  *cmd_ptr = (uint32_t)cmd;
-  *cmd_ptr = (uint32_t)loc;
-  *cmd_ptr = (uint32_t)ext;
+  *cmd_ptr = (uint32_t)cmd;
+  *cmd_ptr = (uint32_t)(uintptr_t)loc;
+  *cmd_ptr = (uint32_t)(uintptr_t)ext;
 }
@@
-static void mchan_transfer_2d_loc_strided(uint32_t cmd, void *loc, void *ext,
+static inline void mchan_transfer_2d_loc_strided(uint32_t cmd, void *loc, void *ext,
                                           uint32_t loc_size_1d,
                                           uint32_t loc_stride_2d) {
   // TODO: assert flags are set correctly
-  *cmd_ptr = (uint32_t)cmd;
-  *cmd_ptr = (uint32_t)loc;
-  *cmd_ptr = (uint32_t)ext;
+  *cmd_ptr = (uint32_t)cmd;
+  *cmd_ptr = (uint32_t)(uintptr_t)loc;
+  *cmd_ptr = (uint32_t)(uintptr_t)ext;
   *cmd_ptr = (uint32_t)loc_size_1d;
   *cmd_ptr = (uint32_t)loc_stride_2d;
 }
@@
-static void mchan_transfer_2d_ext_strided(uint32_t cmd, void *loc, void *ext,
+static inline void mchan_transfer_2d_ext_strided(uint32_t cmd, void *loc, void *ext,
                                           uint32_t ext_size_1d,
                                           uint32_t ext_stride_2d) {
   // TODO: assert flags are set correctly
-  *cmd_ptr = (uint32_t)cmd;
-  *cmd_ptr = (uint32_t)loc;
-  *cmd_ptr = (uint32_t)ext;
+  *cmd_ptr = (uint32_t)cmd;
+  *cmd_ptr = (uint32_t)(uintptr_t)loc;
+  *cmd_ptr = (uint32_t)(uintptr_t)ext;
   *cmd_ptr = (uint32_t)ext_size_1d;
   *cmd_ptr = (uint32_t)ext_stride_2d;
 }
@@
-static void mchan_transfer_2d_loc_strided_ext_strided(
+static inline void mchan_transfer_2d_loc_strided_ext_strided(
     uint32_t cmd, void *loc, void *ext, uint32_t loc_size_1d,
     uint32_t loc_stride_2d, uint32_t ext_size_1d, uint32_t ext_stride_2d) {
   // TODO: assert flags are set correctly
-  *cmd_ptr = (uint32_t)cmd;
-  *cmd_ptr = (uint32_t)loc;
-  *cmd_ptr = (uint32_t)ext;
+  *cmd_ptr = (uint32_t)cmd;
+  *cmd_ptr = (uint32_t)(uintptr_t)loc;
+  *cmd_ptr = (uint32_t)(uintptr_t)ext;
   *cmd_ptr = (uint32_t)ext_size_1d;
   *cmd_ptr = (uint32_t)ext_stride_2d;
   *cmd_ptr = (uint32_t)loc_size_1d;
   *cmd_ptr = (uint32_t)loc_stride_2d;
 }

DeeployTest/testDmas.py (1)

29-46: Harden subprocess call: build argv list; avoid shell=True.

Prevents shell injection and quoting issues; aligns with static analysis (S602). Also keep "-DNUM_CORES=8" as a single argv element to preserve current parsing.

-    cmd = [f"python {testRunner}", f"-t test{dma}", "-DNUM_CORES=8"]
-    cmd.append(f"--input-shape {' '.join(str(x) for x in inputShape)}")
-    cmd.append(f"--tile-shape {' '.join(str(x) for x in tileShape)}")
-    cmd.append(f"--node-count {nodeCount}")
-    cmd.append(f"--type {dataType}")
+    import sys, shlex
+    cmd = [
+        sys.executable, testRunner,
+        "-t", f"test{dma}",
+        "-DNUM_CORES=8",
+        "--input-shape", *[str(x) for x in inputShape],
+        "--tile-shape", *[str(x) for x in tileShape],
+        "--node-count", str(nodeCount),
+        "--type", dataType,
+    ]
     if doublebuffer:
         cmd.append("--doublebuffer")
 
-    full_cmd = " ".join(cmd)
+    full_cmd = shlex.join(cmd)
 
     print(f"Running command:\n{full_cmd}\n")
 
     try:
-        subprocess.run(full_cmd, shell = True, check = True)
+        subprocess.run(cmd, check = True)
     except subprocess.CalledProcessError:
         print(f"test{dma}: Failed test:" + cfg_str)
         print(f"Rerun with command:\n{full_cmd}")
-        exit(-1)
+        import sys as _sys
+        _sys.exit(1)

DeeployTest/testRunner_snitch_dma.py (1)

88-93: Ensure float inputs are float32; drop unused variable.

np.random.rand returns float64; cast to float32. Remove unused np.iinfo.

-if not testRunner._args.skipgen:
-    if dtype == np.float32:
-        test_inputs = np.random.rand(*inputShape)
-    else:
-        info = np.iinfo(dtype)
-        test_inputs = np.arange(stop = np.prod(inputShape), dtype = dtype).reshape(inputShape)
+if not testRunner._args.skipgen:
+    if dtype == np.float32:
+        test_inputs = np.random.rand(*inputShape).astype(np.float32)
+    else:
+        test_inputs = np.arange(stop = np.prod(inputShape), dtype = dtype).reshape(inputShape)

DeeployTest/testRunner_siracusa_l3dma.py (2)

50-53: Replace asserts with explicit validation and use zip(strict=...)

Asserts can be stripped with -O and misshape pairs won’t be detected. Use explicit checks.

-assert len(inputShape) == len(tileShape), \
-    f'Input and tile shape should be of the same dimensionality. Received {len(inputShape)}D input shape vs. {len(tileShape)}D tile shape.'
-assert all(tileDim <= inDim for inDim, tileDim in zip(inputShape, tileShape)), \
-    f'Each tile shape dimension should be smaller then the corresponding input one. Received {tileShape} > {inputShape}'
+if len(inputShape) != len(tileShape):
+    raise ValueError(
+        f'Input and tile shape must have the same dimensionality. Received {len(inputShape)}D vs. {len(tileShape)}D.'
+    )
+# If Python < 3.10, drop strict=True.
+if not all(tileDim <= inDim for inDim, tileDim in zip(inputShape, tileShape, strict=True)):
+    raise ValueError(
+        f'Each tile dimension must be <= the corresponding input one. Received tiles {tileShape} > input {inputShape}.'
+    )

81-86: Fix float32 dtype and remove unused variable

np.random.rand yields float64; cast to match float32 tensors. Remove the unused info.

-if dtype == np.float32:
-    test_inputs = np.random.rand(*inputShape)
+if dtype == np.float32:
+    test_inputs = np.random.rand(*inputShape).astype(dtype)
 else:
-    info = np.iinfo(dtype)
     test_inputs = np.arange(stop = np.prod(inputShape), dtype = dtype).reshape(inputShape)

Deeploy/Targets/PULPOpen/DMA/L3Dma.py (2)

33-36: Replace asserts with runtime checks; fix device name and typos (duplicate).

Asserts can be stripped with -O and messages reference “Mchan” and “contigous”. Use explicit exceptions, correct device name to L3Dma, and “contiguous”.

-        assert strideExt[-1] == 1, \
-            "Mchan supports only contigous transfers of the innermost dimension for external memory"
-        assert strideLoc[0] == shape[1] and strideLoc[1] == 1, \
-            f"Mchan supports only contigous transfers for local memory. Received local shape: {shape}, stride: {strideLoc}"
+        if strideExt[-1] != 1:
+            raise ValueError("L3Dma supports only contiguous transfers of the innermost dimension for external memory")
+        if not (len(shape) >= 2 and strideLoc[0] == shape[1] and strideLoc[1] == 1):
+            raise ValueError(
+                f"L3Dma supports only contiguous transfers for local memory. Received local shape: {shape}, stride: {strideLoc}"
+            )

43-48: Pass DMA sizes in bytes, not elements (duplicate).

pi_cl_ram_copy_2d expects size/stride/length in bytes. Multiply by element width.

-        operatorRepresentation.update({
-            "ext2loc": 1 if direction == "ExternalToLocal" else 0,
-            "transfer_size": math.prod(shape),
-            "length": shape[1],
-            "stride": strideExt[0],
-        })
+        bytes_per_elem = externalBuffer._type.referencedType.typeWidth // 8
+        operatorRepresentation.update({
+            "ext2loc": 1 if direction == "ExternalToLocal" else 0,
+            "transfer_size": math.prod(shape) * bytes_per_elem,
+            "length": shape[1] * bytes_per_elem,
+            "stride": strideExt[0] * bytes_per_elem,
+        })

Deeploy/Targets/PULPOpen/CodeTransformationPasses/AutoTransposeUtils.py (1)

41-42: Bug: HyperRectangle ctor args swapped (offset vs dims).

Constructor expects (offset, dims); passing (dims, offset) breaks minimization/stride derivation.
-    maxTransferRect = HyperRectangle(maxTransferDims, inRect.offset)
+    maxTransferRect = HyperRectangle(offset=inRect.offset, dims=maxTransferDims)

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (2)

78-88: Shared structure with SB path — consider factoring common pieces into base.

Ingress/egress scheduling, reference hoisting, and future orchestration duplicate SB logic. A small helper in the base could reduce drift.

132-140: Fix DMA prefetch future: not initialized/waited/deinitialized — race with first tile.

Reuse the main future so the existing init/wait/deinit covers the prefetch. The standalone initialFuture is never init'ed/waited/deinit'ed.

-        gen = AnydimAsyncDmaTransferAdapter(self.dma)
-
-        initialFuture = self.dma.getFuture(tensorName + "_init")
-        initialDmaTransferCalls = gen.transfer(ctxt, externalBufferRef, localBuffer, rectangles[0].dims,
-                                               stridesFromShape(externalBufferShape),
-                                               stridesFromShape(rectangles[0].dims), "ExternalToLocal",
-                                               initialFuture, math.prod(externalBufferShape))
+        gen = AnydimAsyncDmaTransferAdapter(self.dma)
+        # Reuse the same future so it's properly init'ed, waited, and deinit'ed
+        initialDmaTransferCalls = gen.transfer(ctxt, externalBufferRef, localBuffer, rectangles[0].dims,
+                                               stridesFromShape(externalBufferShape),
+                                               stridesFromShape(rectangles[0].dims), "ExternalToLocal",
+                                               future, math.prod(externalBufferShape))

Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (4)

111-118: Enforce tile-count invariants; keep single source of truth for counts

Validate ingress/egress lengths match and ensure totalNumTiles reflects the schedule length.

-        metaInfo = TilingMetaInfo(
+        if len(tilingSchedule.inputLoadSchedule) != len(tilingSchedule.outputLoadSchedule):
+            raise ValueError(
+                f"Tiling schedule ingress/egress length mismatch: "
+                f"{len(tilingSchedule.inputLoadSchedule)} vs {len(tilingSchedule.outputLoadSchedule)}"
+            )
+        metaInfo = TilingMetaInfo(
             nodeName = operatorRepresentation['nodeName'] + f"_{self.externalMemory}",
             nodeOps = operatorRepresentation['nodeOps'],
             numTiles = operatorRepresentation['numTiles'],
             totalNumTiles = len(tilingSchedule.outputLoadSchedule),
             tileIdxPtr = operatorRepresentation['tileIdxPtr'],
             tileIdxVar = "TILING_I",

Would you like a follow-up patch to assert equality with the value used by _hoistTileNumAndIdxPtr?

27-27: Replace Set-based futures with ordered, de-duplicated List; remove unused variable; ensure deterministic init/deinit/wait order

Sets make codegen order non-deterministic and break stable teardown ordering; also referenceUpdates is unused. Switch to List[Future], de-duplicate while preserving insertion order, and reverse deinit.

-from typing import Dict, List, Set, Tuple
+from typing import Dict, List, Tuple
@@
-            tileIdxVar: str, direction: DmaDirection) -> Tuple[NetworkContext, List[CodeSnippet], Set[Future]]:
+            tileIdxVar: str, direction: DmaDirection) -> Tuple[NetworkContext, List[CodeSnippet], List[Future]]:
         callStack: List[CodeSnippet] = []
-        referenceUpdates: List[CodeSnippet] = []
-        futures: Set[Future] = set()
+        futures: List[Future] = []
@@
-            future = self.dma.getFuture(tensorName)
-            futures.add(future)
+            future = self.dma.getFuture(tensorName)
+            if future not in futures:
+                futures.append(future)
@@
-        ingressDmaWaitStatements = [future.wait() for future in ingressFutures]
-        egressDmaWaitStatements = [future.wait() for future in egressFutures]
+        ingressDmaWaitStatements = [future.wait() for future in ingressFutures]
+        egressDmaWaitStatements = [future.wait() for future in egressFutures]
@@
-        setupStatements = self.dma.setup()
-        setupStatements += [f.init() for f in ingressFutures | egressFutures]
-
-        teardownStatements = self.dma.teardown()
-        teardownStatements.extend(f.deinit() for f in ingressFutures | egressFutures)
+        setupStatements = self.dma.setup()
+        allFutures: List[Future] = []
+        allFutures.extend(ingressFutures)
+        for f in egressFutures:
+            if f not in allFutures:
+                allFutures.append(f)
+        setupStatements += [f.init() for f in allFutures]
+
+        teardownStatements = self.dma.teardown()
+        teardownStatements.extend(f.deinit() for f in reversed(allFutures))

Also applies to: 49-53, 99-107

56-63: Replace asserts with explicit checks; guard missing constraints/shape

Asserts can be stripped with -O and indexing without get can KeyError. Raise clear exceptions and validate presence.

-            assert localBuffer._memoryLevel == self.localMemory
-            assert isinstance(localBuffer, _ReferenceBuffer)
-            externalBuffer = ctxt.lookup(localBuffer._referenceName)
-            assert isinstance(externalBuffer, VariableBuffer)
-            tensorMemoryConstraint = tensorMemoryConstraintDict[externalBuffer.name]
-            externalBufferShape = tensorMemoryConstraint.memoryConstraints[self.externalMemory].shape
-            assert externalBufferShape is not None
+            if localBuffer._memoryLevel != self.localMemory:
+                raise ValueError(
+                    f"Local buffer '{localBuffer.name}' expected in '{self.localMemory}', "
+                    f"got '{localBuffer._memoryLevel}'"
+                )
+            if not isinstance(localBuffer, _ReferenceBuffer):
+                raise TypeError(f"Expected _ReferenceBuffer for '{localBuffer.name}', got {type(localBuffer).__name__}")
+            externalBuffer = ctxt.lookup(localBuffer._referenceName)
+            if not isinstance(externalBuffer, VariableBuffer):
+                raise TypeError(
+                    f"Expected VariableBuffer for external reference '{localBuffer._referenceName}', "
+                    f"got {type(externalBuffer).__name__}"
+                )
+            tensorMemoryConstraint = tensorMemoryConstraintDict.get(externalBuffer.name)
+            if tensorMemoryConstraint is None:
+                raise KeyError(f"Missing TensorMemoryConstraint for '{externalBuffer.name}'")
+            memCnstr = tensorMemoryConstraint.memoryConstraints.get(self.externalMemory)
+            if memCnstr is None or memCnstr.shape is None:
+                raise KeyError(
+                    f"Missing shape for '{externalBuffer.name}' in memory level '{self.externalMemory}'"
+                )
+            externalBufferShape = memCnstr.shape

92-98: Use DmaDirection enum, not string literals

Passing strings breaks type expectations and templates.

-        ctxt, ingressDmaTransferCalls, ingressFutures = self._generateTransferScheduleCalls(
-            ctxt, operatorRepresentation, tilingSchedule.inputLoadSchedule,
-            nodeMemoryConstraint.inputTensorMemoryConstraints, "TILING_I", "ExternalToLocal")
+        ctxt, ingressDmaTransferCalls, ingressFutures = self._generateTransferScheduleCalls(
+            ctxt, operatorRepresentation, tilingSchedule.inputLoadSchedule,
+            nodeMemoryConstraint.inputTensorMemoryConstraints, "TILING_I", DmaDirection.ExternalToLocal)
@@
-        ctxt, egressDmaTransferCalls, egressFutures = self._generateTransferScheduleCalls(
-            ctxt, operatorRepresentation, tilingSchedule.outputLoadSchedule,
-            nodeMemoryConstraint.outputTensorMemoryConstraints, "TILING_I", "LocalToExternal")
+        ctxt, egressDmaTransferCalls, egressFutures = self._generateTransferScheduleCalls(
+            ctxt, operatorRepresentation, tilingSchedule.outputLoadSchedule,
+            nodeMemoryConstraint.outputTensorMemoryConstraints, "TILING_I", DmaDirection.LocalToExternal)

Deeploy/Targets/Snitch/DMA/SnitchDma.py (2)

33-38: Replace assert; fix typo and stray f-string

Use explicit exceptions; “contiguous” spelling; drop unnecessary f-prefix.

     def checkTransfer(self, ctxt: NetworkContext, externalBuffer: VariableBuffer, localBuffer: VariableBuffer,
                       shape: Tuple[int, ...], strideExt: Tuple[int, ...], strideLoc: Tuple[int, ...],
                       direction: DmaDirection) -> None:
         super().checkTransfer(ctxt, externalBuffer, localBuffer, shape, strideExt, strideLoc, direction)
-        assert strideLoc[1] == 1 and strideExt[1] == 1, f"Supports only contigous transfers in the innermost dimension"
+        if not (strideLoc[1] == 1 and strideExt[1] == 1):
+            raise ValueError("Supports only contiguous transfers in the innermost dimension")

39-51: Use DmaDirection enum and convert element counts to byte counts for DMA API

snrt expects sizes/strides in bytes. Multiply by element size and avoid string directions.

     def transferOpRepr(self, externalBuffer: VariableBuffer, localBuffer: VariableBuffer, shape: Tuple[int, ...],
                        strideExt: Tuple[int, ...], strideLoc: Tuple[int, ...], direction: DmaDirection,
                        future: Future) -> OperatorRepresentation:
         _ = future
+        bytes_per_elem = localBuffer._type.referencedType.typeWidth // 8
         operatorRepresentation: OperatorRepresentation = {
-            "dest": localBuffer.name if direction == "ExternalToLocal" else externalBuffer.name,
-            "src": externalBuffer.name if direction == "ExternalToLocal" else localBuffer.name,
-            "repeat": shape[0],
-            "size": shape[1],
-            "stride_dest": strideLoc[0] if direction == "ExternalToLocal" else strideExt[0],
-            "stride_src": strideExt[0] if direction == "ExternalToLocal" else strideLoc[0],
+            "dest": localBuffer.name if direction == DmaDirection.ExternalToLocal else externalBuffer.name,
+            "src": externalBuffer.name if direction == DmaDirection.ExternalToLocal else localBuffer.name,
+            "repeat": shape[0],
+            "size": shape[1] * bytes_per_elem,
+            "stride_dest": (strideLoc[0] if direction == DmaDirection.ExternalToLocal else strideExt[0]) * bytes_per_elem,
+            "stride_src": (strideExt[0] if direction == DmaDirection.ExternalToLocal else strideLoc[0]) * bytes_per_elem,
         }
         return operatorRepresentation

Deeploy/Targets/PULPOpen/DMA/MchanDma.py (2)

55-57: Fix off-by-one error in size validation.

17 bits can encode values from 0 to 2^17-1. The assertion should use strict inequality.

Apply this fix:
-        assert mchanTransferSize <= 2**17, (
+        assert mchanTransferSize < 2**17, (
             "The Dma transfer size for mchan should be representable with 17 bits, "
             f"current number of bits required is {math.ceil(math.log2(mchanTransferSize))}")
61-64: Convert element counts to byte counts for DMA parameters.

The DMA hardware expects byte counts, but the code is passing element counts for size_1d and stride_2d.

You need to multiply by the element size in bytes. Add this before line 61:
# Get element size from buffer type
elementSizeInBytes = externalBuffer._type.referencedType.typeWidth // 8
Then update the assignments:
         if transferRank == 2:
-            operatorRepresentation["size_1d"] = shape[1]
-            operatorRepresentation["stride_2d"] = strideExt[0]
+            operatorRepresentation["size_1d"] = shape[1] * elementSizeInBytes
+            operatorRepresentation["stride_2d"] = strideExt[0] * elementSizeInBytes

Deeploy/TilingExtension/CodeTransformationPasses/TilingVariableReplacement.py (1)

53-58: Improve arena allocation robustness and formatting.

The arena allocation method has several issues:

No bounds checking for the offset
F-string embedded in template string makes it less readable
Missing arena size validation

Consider this improved implementation:

 def _arenaAllocate(self, ctxt: NetworkContext, buffer: VariableBuffer, offset: int) -> VariableBuffer:
     arena = ctxt.lookup(self.arenaName)
-    buffer.allocTemplate = NodeTemplate(" \
-    ${type.typeName} ${name} = (${type.typeName}) " + f"((char*){str(arena._instance)} + {offset});")
+    # Add bounds checking if arena has size attribute
+    if hasattr(arena, 'size'):
+        assert 0 <= offset < arena.size, f"Offset {offset} out of bounds for arena {self.arenaName} (size={arena.size})"
+    
+    # Use cleaner template with placeholders
+    buffer.allocTemplate = NodeTemplate("""
+    ${type.typeName} ${name} = (${type.typeName}) ((char*)${arena_ptr} + ${offset});""")
+    buffer._arena_ptr = str(arena._instance)
+    buffer._offset = offset
     buffer.deallocTemplate = NodeTemplate("")
     return buffer

Also update the buffer's _bufferRepresentation method to include these fields:

def _bufferRepresentation(self) -> Dict:
    repr = super()._bufferRepresentation()
    if hasattr(self, '_arena_ptr'):
        repr['arena_ptr'] = self._arena_ptr
        repr['offset'] = self._offset
    return repr

Deeploy/TilingExtension/TilerExtension.py (1)

938-941: Fix off-by-one and avoid private field access for lifetime.

Outputs should be alive through the last step index len(schedule) - 1. Also use the public lifetime property.

-        for tensor in graph.outputs:
-            assert memoryBlockMap[tensor.name]._lifetime[-1] == len(
-                schedule), "Invalid memory map! Output buffer is not alive at the last step!"
+        for tensor in graph.outputs:
+            assert memoryBlockMap[tensor.name].lifetime[-1] == len(schedule) - 1, \
+                "Invalid memory map! Output buffer is not alive at the last step!"

DeeployTest/testUtils/dmaUtils.py (1)

354-370: Validate tileShape vs input shape before generating tiling.

Preempt invalid tilings early (rank, positivity, bounds, divisibility).

 def prepare_deployer_with_custom_tiling(deployer: NetworkDeployer, defaultMemory: str, targetMemory: str,
                                         tileShape: Tuple[int, ...], doublebuffer: bool) -> None:
     # Decomposed deployer.prepare() to enter a custom tiling solution
     deployer.frontEnd()
     super(TilerDeployerWrapper, deployer).bind()
 
+    inputShape = tuple(deployer.graph.inputs[0].shape)
+    assert len(tileShape) == len(inputShape), \
+        f"Tile rank {len(tileShape)} doesn't match input rank {len(inputShape)}"
+    for i, (t, s) in enumerate(zip(tileShape, inputShape)):
+        assert t > 0, f"Tile dim {i} must be > 0"
+        assert t <= s, f"Tile dim {i}: {t} exceeds input dim {s}"
+        assert s % t == 0, f"Input dim {i}={s} not divisible by tile dim {t}"
+
     tilingSolution, memoryMap = generate_tiling(
         ctxt = deployer.ctxt,
         memoryStart = defaultMemory,
         memoryOrder = [defaultMemory, targetMemory],
         memoryHierarchy = deployer.Platform.memoryHierarchy,
-        inputShape = deployer.graph.inputs[0].shape,
+        inputShape = inputShape,
         tileShape = tileShape,
         graph = deployer.graph,
         _type = deployer.inputTypes['input_0'].referencedType,
         doublebuffer = doublebuffer,
     )

Deeploy/CommonExtensions/CodeTransformationPasses/IntrospectiveCodeTransformation.py (2)

49-73: Update template source and refresh parse-tree cache after reconstruction.

Without syncing _source and re-keying parseTreeDict, subsequent mutations use stale trees.

     def _reconstructCode(template: Template, node: TemplateNode) -> Template:
@@
-        module = types.ModuleType(template.module_id)
-        code = compile(source, template.module_id, "exec")
-        exec(code, module.__dict__, module.__dict__)
+        module = types.ModuleType(template.module_id)
+        code = compile(source, template.module_id, "exec")
+        exec(code, module.__dict__, module.__dict__)
 
-        template._code = code
+        # Refresh source and parse-tree cache
+        old_hash = hash(getattr(template, "_source", source))
+        template._source = source
+        new_hash = hash(template._source)
+        d = IntrospectiveCodeTransformationMixIn.parseTreeDict
+        if old_hash in d:
+            try:
+                del d[old_hash]
+            except KeyError:
+                pass
+        d[new_hash] = node
+
+        template._code = code
         template.module = module
         template.callable_ = template.module.render_body
         return template

171-181: Don’t use assert for runtime type checks in library code.

Replace with explicit exception; asserts can be stripped with -O.

-            def _unrollStructReferences(val: Struct) -> List[str]:
-                assert isinstance(val, Struct)
+            def _unrollStructReferences(val: Struct) -> List[str]:
+                if not isinstance(val, Struct):
+                    raise TypeError(f"Expected Struct, got {type(val).__name__}")

Deeploy/TilingExtension/TilingCodegen.py (1)

247-251: Missing documentation and tests for stride calculation remain unaddressed

Based on the past review comment, there are still no unit tests or documentation for the stridesFromShape function. The function assumes row-major ordering, which should be documented and tested.

Would you like me to generate unit tests and documentation for this critical utility function to ensure correct stride calculations across different tensor shapes?

Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py (1)

130-143: Potential template/snippet count mismatch can cause silent failures

The loop assumes each transfer produces the same number of snippets as templates, which could drift and cause misaligned templating if the adapter returns different counts for different tiles.

Add a validation check after generating snippets:

     for rect in transfers:
         snippets = gen.transfer(ctxt, externalBuffer,
                                localBuffer, rect.dims, stridesFromShape(externalBuffer.shape),
                                stridesFromShape(rect.dims), direction, future, math.prod(externalBuffer.shape))
+        if len(snippets) != len(templates):
+            raise RuntimeError(f"Adapter returned {len(snippets)} snippets, expected {len(templates)} "
+                              f"for rectangle {rect}")
         for i, snippet in enumerate(snippets):
             opReprUpdates[i].append(snippet.operatorRepresentation)

Deeploy/DeeployTypes.py (1)

539-565: Missing boundary checking for reference buffer offset

As noted in past reviews, there's no validation that the offset is within valid bounds of the referenced buffer. This could lead to out-of-bounds memory access at runtime.

The offset validation is critical for memory safety. Would you like me to implement the boundary checking logic that was discussed in the previous review, or would you prefer to track this as a separate issue for future implementation?

Deeploy/TilingExtension/AsyncDma.py (4)

149-151: Use size_t for loop indices.

Loop bounds are sizes; prefer size_t to avoid narrowing/truncation.

-                templateStr += f"for (uint32_t {iter} = 0; {iter} < ${'{'}end_{level}{'}'}; {iter}++) {{"
+                templateStr += f"for (size_t {iter} = 0; {iter} < ${'{'}end_{level}{'}'}; {iter}++) {{"

74-82: Replace asserts with exceptions; fix stray f-string.

Asserts may be stripped under -O and the first f-string has no placeholders. Raise ValueError instead.

-        assert transferRank == len(strideLoc) and transferRank == len(
-            strideExt), f"The shape and stride rank should match"
-        assert transferRank in self.supportedTransferRanks(
-        ), f"Unsupported transfer rank {transferRank}. Supported ranks are {self.supportedTransferRanks()}"
+        if not (transferRank == len(strideLoc) and transferRank == len(strideExt)):
+            raise ValueError("The shape and stride rank should match")
+        if transferRank not in self.supportedTransferRanks():
+            raise ValueError(
+                f"Unsupported transfer rank {transferRank}. Supported ranks are {self.supportedTransferRanks()}"
+            )

163-169: Widen offset type to size_t.

Offsets can exceed 32 bits on 64-bit targets; use size_t.

-            templateStr = f"const uint32_t {name} = "
+            templateStr = f"const size_t {name} = "

171-171: Fix non-standard void arithmetic; cast through uintptr_t.*

Standard C forbids void* arithmetic; and offset should be size_t/uintptr_t. Also ensure the generated TU includes <stdint.h> (for uintptr_t) and <stddef.h> (for size_t).

-    offsetPtrTemplate = NodeTemplate("void * const ${resultPtr} = (void *)${basePtr} + ${offset};")
+    offsetPtrTemplate = NodeTemplate(
+        "void * const ${resultPtr} = (void *)((uintptr_t)${basePtr} + ${offset});"
+    )

Action: verify the codegen path emits the necessary includes once per TU.

Deeploy/CommonExtensions/CodeTransformationPasses/MemoryAllocation.py

Deeploy/Targets/Generic/TileConstraints/TransposeTileConstraint.py

Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py

DeeployTest/testUtils/codeGenerate.py

DeeployTest/testUtils/dmaUtils.py

DeeployTest/testUtils/typeMapping.py

TargetLibraries/PULPOpen/inc/mchan_v6.h

Xeratec · 2025-09-09T15:44:06Z

@lukamac Thanks for the changes and the high-quality work. I went over your changes and the (annoyingly many comments from the AI). FYI, I disabled auto-review now. However, there are some comments that I think are valid. Please go over the open conversations and fix/resolve them. Afterwards, I am happy to merge this PR.

Please do not forget to extend the CHANGLOG.md!

…n to anydimAdapter

…dapter

Xeratec

LGTM, Thanks a for the great work!

* Refactor mchan hal * Refactor IntrospectiveCodeTransformation * Refactor MemoryAllocation * Add minimalIntegerType helper function * Small refactor DeeployTypes * Change Neureka tile constraints to new TilingCodegen function * Small refactors Check for LLVM_INSTALL_DIR environment variable Fix typo Check for SNITCH_HOME environment variable and crash if not present Change test output difference to absolute difference Improve engine coloring error message Fix type hint * Permutation refactor * Refactor TransposeTileConstraint * Remove manual name mangling from templates since it's automatically done in the ExecutionBlock.generate() * Change serialize to produce same shape rank as original * Refactor TilingExtension * Port PULPOpen * Port Snitch * DeeployTest: Extract generic tiling code into tilingUtils.py * DeeployTest: Extract common test generation code * DeeployTest: Add Dma tests * Apply Philip's comments Remove dory_dma.h Fix hoistReference doc comment Use the shape argument of the _hoistReference function Rename dma test runners Change kernelLevelTiling HACK comment to a TODO Add DMA folder to targets with DMAs Fix wrong deeployStateDir Single source of truth for the tiling arena name * Add unravelReference doc comment and fix the dealiasBuffer's * Refactor type inference and minimal(Integer|Float)Type * Revert extra inputs hack * Add mchan check for both event- and poll-based event checking flags being set * Fix HyperRectangle arg order * Fix mchan check whether size is representable within 17 bits * Fix init, deinit, wait on initialFuture in DoubleBuffering, rename gen to anydimAdapter * Fix GEMM tile constraint serialization to check transA and transB * Fix inherit from ABC in AsyncDma and AsyncDmaWaitingStrategy * Fix use tileSizeInBytes to check whether it fits in memory * Update changelog * Add missing transferOpRepr abstract method from the BlockingAsyncDmaAdapter

lukamac requested review from Victor-Jung and Xeratec as code owners July 8, 2025 13:41

Xeratec assigned lukamac Jul 8, 2025

Xeratec added the Feature Addition of new features label Jul 8, 2025

Xeratec added this to the Release 0.2.1 milestone Jul 8, 2025

Xeratec added this to Deeploy Jul 8, 2025

Xeratec moved this to In review in Deeploy Jul 8, 2025

Xeratec moved this from In review to In progress in Deeploy Jul 30, 2025

lukamac marked this pull request as draft July 31, 2025 11:18

lukamac changed the title ~~PULPOpen: Allow for any-dimensional transfers between L1 and L2~~ Refactor tiling code generation Jul 31, 2025

lukamac force-pushed the rewrite-pulp-dma branch 5 times, most recently from 32d9155 to e8869ed Compare August 5, 2025 16:43

lukamac force-pushed the rewrite-pulp-dma branch 2 times, most recently from 0f96aea to 2147f44 Compare August 8, 2025 08:13

lukamac marked this pull request as ready for review August 8, 2025 11:29

lukamac force-pushed the rewrite-pulp-dma branch from 2147f44 to 1c89a37 Compare August 8, 2025 11:34

Xeratec moved this from In progress to In review in Deeploy Aug 11, 2025

Xeratec reviewed Aug 25, 2025

View reviewed changes

Xeratec reviewed Aug 28, 2025

View reviewed changes

Deeploy/TilingExtension/CodeTransformationPasses/TilingVariableReplacement.py Outdated Show resolved Hide resolved

lukamac added 3 commits September 1, 2025 11:45

Refactor mchan hal

da89fa9

Refactor IntrospectiveCodeTransformation

863fe6c

Refactor MemoryAllocation

0cfb2b1

lukamac force-pushed the rewrite-pulp-dma branch from ce73778 to f35377b Compare September 9, 2025 11:25

coderabbitai bot reviewed Sep 9, 2025

View reviewed changes

pulp-platform deleted a comment from coderabbitai bot Sep 9, 2025

lukamac added 8 commits September 9, 2025 18:26

Fix HyperRectangle arg order

c4cc70a

Fix mchan check whether size is representable within 17 bits

3f75b10

Fix init, deinit, wait on initialFuture in DoubleBuffering, rename ge…

b0e0953

…n to anydimAdapter

Fix GEMM tile constraint serialization to check transA and transB

ffb7316

Fix inherit from ABC in AsyncDma and AsyncDmaWaitingStrategy

4bce88c

Fix use tileSizeInBytes to check whether it fits in memory

09f742a

Update changelog

2ea6a12

Add missing transferOpRepr abstract method from the BlockingAsyncDmaA…

dc71c80

…dapter

Xeratec approved these changes Sep 10, 2025

View reviewed changes

Xeratec merged commit 05d6403 into pulp-platform:devel Sep 10, 2025
122 checks passed

github-project-automation bot moved this from In review to Done in Deeploy Sep 10, 2025

This was referenced Sep 11, 2025

Split CI Workflows by Platform and Task, Improve Formatting and Linting Reliability #108

Merged

Bug fixes, API Cleanup and Reduce Compiler Warning on PULP #112

Merged

This was referenced Sep 23, 2025

Refactor Logging for Improved Debugging #115

Merged

TinyViT on non-tiled Siracusa #117

Merged

Support Fully Asynchronous DMAs #114

Merged

This was referenced Oct 16, 2025

Remove memory-aware node bindings #123

Merged

Fix aliasing #125

Merged

coderabbitai bot mentioned this pull request Oct 30, 2025

Refactors and fixes #131

Open

5 tasks

coderabbitai bot mentioned this pull request Nov 18, 2025

Demo TinyViT compatibility with tiled Siracusa #124

Merged

5 tasks

coderabbitai bot mentioned this pull request Nov 26, 2025

CCT Attention Training on Siracusa #69

Merged

5 tasks

coderabbitai bot mentioned this pull request Dec 10, 2025

FP32 ReduceMean operator improvement #137

Merged

5 tasks

Refactor tiling code generation #105

Refactor tiling code generation #105

Uh oh!

Conversation

lukamac commented Jul 8, 2025 • edited by Xeratec Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance comparison

Added

Changed

PR Merge Checklist

Uh oh!

Xeratec commented Aug 23, 2025

Uh oh!

coderabbitai bot commented Aug 23, 2025

Uh oh!

coderabbitai bot commented Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks (3 passed)

Uh oh!

Xeratec commented Aug 25, 2025

Uh oh!

coderabbitai bot commented Aug 25, 2025

Major Architectural Changes

1. Async DMA Framework Introduction (AsyncDma.py)

2. Tiling Code Generation Restructure

Base Class Refactor (TilingCodeGeneration.py)

Single Buffering (SingleBufferingTilingCodeGeneration.py)

Double Buffering (DoubleBufferingTilingCodeGeneration.py)

3. Memory Management Revolution

Hoisting Infrastructure (TilingHoistingMixIn.py)

Arena-Based Memory (TilingVariableReplacement.py)

4. Profiling and Instrumentation Overhaul (TilingPrototypes.py)

5. Shape and Stride Management (TilingCodegen.py)

6. Dual-Path Tiling Workflow (TilerExtension.py)

7. Enhanced Type Safety and Constraints

Impact and Benefits

Performance

Flexibility

Maintainability

Scalability

Uh oh!

Xeratec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Xeratec commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Xeratec left a comment

Choose a reason for hiding this comment

lukamac commented Jul 8, 2025 •

edited by Xeratec

Loading

coderabbitai bot commented Aug 23, 2025 •

edited

Loading

1. Async DMA Framework Introduction (`AsyncDma.py`)

Base Class Refactor (`TilingCodeGeneration.py`)

Single Buffering (`SingleBufferingTilingCodeGeneration.py`)

Double Buffering (`DoubleBufferingTilingCodeGeneration.py`)

Hoisting Infrastructure (`TilingHoistingMixIn.py`)

Arena-Based Memory (`TilingVariableReplacement.py`)

4. Profiling and Instrumentation Overhaul (`TilingPrototypes.py`)

5. Shape and Stride Management (`TilingCodegen.py`)

6. Dual-Path Tiling Workflow (`TilerExtension.py`)

Xeratec commented Sep 9, 2025 •

edited

Loading