-
Notifications
You must be signed in to change notification settings - Fork 32
Refactor tiling code generation #105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
32d9155 to
e8869ed
Compare
0f96aea to
2147f44
Compare
2147f44 to
1c89a37
Compare
|
@coderabbitai summary |
✅ Actions performedSummary regeneration triggered. |
Summary by CodeRabbit
WalkthroughAdds DMA-aware tiling and codegen infrastructure (AsyncDma, single/double buffering), integrates DMA in PULP/Snitch pipelines, refactors tiling/memory mapping APIs, revises permutation/minimization utilities, updates reference/aliasing semantics, adjusts multiple templates to avoid name mangling, replaces/updates PULP DMA headers, introduces new DMA test runners and CI job, and updates tests for new type-mapping API. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant Deployer
participant TilingPass as TilingCodeGeneration (SB/DB)
participant Adapter as AnydimAsyncDmaTransferAdapter
participant DMA as AsyncDma
participant Future
User->>Deployer: apply()
Deployer->>TilingPass: generateTilingLoop(tilingSchedules)
TilingPass->>Adapter: transfer(external, local, shape, strides, dir, future)
alt kernelRank == transferRank
Adapter->>DMA: transfer(..., future)
else kernelRank < transferRank
Adapter->>Adapter: emit nested loops + offset ptrs
Adapter->>DMA: transfer(inner-shape,..., future)
else kernelRank > transferRank
Adapter->>DMA: transfer(padded-shape,..., future)
end
DMA->>Future: get/init/wait/deinit (via waiting strategy)
TilingPass-->>Deployer: ExecutionBlock (tile loop + DMA calls)
sequenceDiagram
autonumber
participant Pipeline as PULPOpen Bindings Pipeline
participant TVR as TilingVariableReplacement
participant TVRUpd as TilingVariableReplacementUpdate
participant Cluster as PULPClusterTiling("L2","L1", MchanDma)
participant L3 as PULPL3Tiling("L3","L2", l3DmaHack)
participant MemGen as MemoryManagementGeneration
Pipeline->>TVR: apply(arena alloc, tiled refs)
TVR->>TVRUpd: apply(update refs per tile)
TVRUpd->>Cluster: apply(SB/DB tiling with DMA)
Cluster->>L3: apply(L3 tiling with DMA)
L3->>MemGen: manage L2/L3/default allocations
Estimated code review effort🎯 5 (Critical) | ⏱️ ~150 minutes Possibly related PRs
Suggested reviewers
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. Pre-merge checks (3 passed)✅ Passed checks (3 passed)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai Provide further details about the modification to the tiling extension (All the files under |
|
Tip For best results, initiate chat on the files or code changes. @Xeratec I'll provide a comprehensive analysis of the tiling extension modifications. This represents a major architectural refactor of the entire tiling system. Major Architectural Changes1. Async DMA Framework Introduction (
|
Xeratec
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lukamac, I reviewed all your changes except for the refactoring of the TilingExtension. There are a lot of changes, and I am not super familiar with the previous code. It is clear that the newer code is nicer and better to understand, but I am struggling with the changes in TilingCodeGeneration.py and TilingVariableReplacement.py.
I would prefer an offline walkthrough through the remaining changes.
Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py
Outdated
Show resolved
Hide resolved
Deeploy/TilingExtension/CodeTransformationPasses/TilingVariableReplacement.py
Outdated
Show resolved
Hide resolved
ce73778 to
f35377b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 18
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
Deeploy/CommonExtensions/DataTypes.py (1)
111-116: Fix type annotation mismatch: return a tuple, not a list.sorted(...) returns a list; annotation declares a Tuple. Wrap with tuple(...).
-IntegerDataTypes: Tuple[Type[IntegerImmediate], ...] = (sorted(( - *SignedIntegerDataTypes, - *UnsignedIntegerDataTypes, -), - key = lambda _type: _type.typeWidth)) +IntegerDataTypes: Tuple[Type[IntegerImmediate], ...] = tuple(sorted( + (*SignedIntegerDataTypes, *UnsignedIntegerDataTypes), + key = lambda _type: _type.typeWidth +))DeeployTest/testUtils/codeGenerate.py (1)
37-50: Remove buffer shape mutation and flatten outputs in _shapeBroadcast
In_shapeBroadcast, dropctxt.lookup(f"{name}").shape = broadcastNum.shapeand change the else branch tobroadcastNum = value.reshape(-1)so that both global and local paths emit 1-D data without mutating the buffer’s stored shape. Retainctxt.is_globalfor the global check.Deeploy/Targets/PULPOpen/CodeTransformationPasses/AutoTransposeUtils.py (1)
177-178: Python slicing bug with -0 yields empty tuple.When len(fixedPermStrides) == 0, rect.dims[:-0] is empty. Return full dims in that case.
- return struct, remainderStrides, rect.dims[:-len(fixedPermStrides)] + drop = len(fixedPermStrides) + return struct, remainderStrrides, rect.dims[:-drop] if drop > 0 else rect.dimsDeeployTest/testMVP.py (1)
272-279: Bug: signProp is hard-coded to False, disabling offsetting for sign-prop platformsThis disables the input/output offset path below, likely breaking platforms that rely on sign propagation (and deployers that don’t accept inputOffsets). Compute signProp from platform to keep behavior consistent with setupDeployer().
Apply this diff:
- platform = deployer.Platform - signProp = False + platform = deployer.Platform + # Keep main() consistent with setupDeployer(): derive platform sign-prop preference + _, signProp = mapPlatform(args.platform)
♻️ Duplicate comments (43)
Deeploy/Targets/MemPool/Templates/ITATemplate.py (1)
325-326: Sanitize nodeName before interpolating into C identifiers (prevents invalid/broken codegen).If nodeName contains '/', '-', '.', spaces, or starts with a digit, the generated identifiers will be invalid C and compilation will fail. Use a C-safe variant when forming ${data_in_array} and ${quant_params_array}.
Apply this diff here:
- operatorRepresentation['data_in_array'] = f"{nodeName}_data_in_array" - operatorRepresentation['quant_params_array'] = f"{nodeName}_quant_params_array" + operatorRepresentation['data_in_array'] = f"{nodeName_safe}_data_in_array" + operatorRepresentation['quant_params_array'] = f"{nodeName_safe}_quant_params_array"Add the helper once at module scope and derive nodeName_safe near its first use:
@@ -from typing import Dict, Tuple +from typing import Dict, Tuple +import re import numpy as np @@ +def _c_ident(s: str) -> str: + s = re.sub(r'[^A-Za-z0-9_]', '_', str(s)) + if not re.match(r'[A-Za-z_]', s): + s = f'_{s}' + return s @@ - nodeName = operatorRepresentation['nodeName'] + nodeName = operatorRepresentation['nodeName'] + nodeName_safe = _c_ident(nodeName)Optional: consider using nodeName_safe consistently for other identifier-like names built from nodeName in this file for uniform safety.
DeeployTest/testUtils/testRunner.py (2)
324-336: Verify CI/export passes this new check.Ensure workflows set LLVM_INSTALL_DIR or pass --toolchain_install_dir wherever TestRunner is used.
#!/bin/bash # Find TestRunner invocations and check if toolchain dir is provided or env is set. rg -nP '\bpython\s+.*testRunner.*\.py\b' .github/workflows -C3 || true rg -n "LLVM_INSTALL_DIR" .github/workflows || true
325-327: Replace assert with explicit validation; handle toolchain and path.Asserts are stripped with -O and Ruff flags the f-string. Validate deterministically and check existence.
- assert self._args.toolchain_install_dir is not None, f"Environment variable LLVM_INSTALL_DIR is not set" - self._dir_toolchain = os.path.normpath(self._args.toolchain_install_dir) + # Validate toolchain install dir (LLVM only). Avoid asserts in production. + if self._args.toolchain.upper() == "LLVM": + if not self._args.toolchain_install_dir: + raise ValueError("Missing toolchain install dir: set --toolchain_install_dir or LLVM_INSTALL_DIR.") + self._dir_toolchain = os.path.normpath(self._args.toolchain_install_dir) + if not os.path.isdir(self._dir_toolchain): + raise FileNotFoundError(f"Toolchain directory not found: {self._dir_toolchain}") + else: + self._dir_toolchain = os.path.normpath(self._args.toolchain_install_dir) if self._args.toolchain_install_dir else ""Deeploy/Targets/Generic/Templates/ITAMaxTemplate.py (1)
45-49: Use hoisted buffer’s actual name.hoistTransientBuffer may mangle/alter the name; store and use the returned buffer name.
- ctxt.hoistTransientBuffer(name, size) - operatorRepresentation['ctxtBuffer'] = name + buf = ctxt.hoistTransientBuffer(name, size) + operatorRepresentation['ctxtBuffer'] = getattr(buf, "name", name) operatorRepresentation['ctxtBufferSize'] = size - - return ctxt, operatorRepresentation, [name] + return ctxt, operatorRepresentation, [operatorRepresentation['ctxtBuffer']]DeeployTest/Platforms/Siracusa/src/deeploytest.c (1)
172-179: Avoid unsigned wraparound; widen before subtraction and fix printf casts.Compute error in wider signed type; print with matching formats to avoid UB.
- OUTPUTTYPE expected = ((OUTPUTTYPE *)testOutputVector[buf])[i]; - OUTPUTTYPE actual = ((OUTPUTTYPE *)compbuf)[i]; - int error = expected - actual; - OUTPUTTYPE diff = (OUTPUTTYPE)(error < 0 ? -error : error); + OUTPUTTYPE expected = ((OUTPUTTYPE *)testOutputVector[buf])[i]; + OUTPUTTYPE actual = ((OUTPUTTYPE *)compbuf)[i]; + int32_t error = (int32_t)expected - (int32_t)actual; + uint32_t diff = (uint32_t)(error < 0 ? -error : error); if (diff) { tot_err += 1; - printf("Expected: %4d ", expected); - printf("Actual: %4d ", actual); - printf("Diff: %4d at Index %12u in Output %u\r\n", diff, i, buf); + printf("Expected: %4d ", (int)expected); + printf("Actual: %4d ", (int)actual); + printf("Diff: %4u at Index %12u in Output %u\r\n", diff, (unsigned)i, (unsigned)buf); }Deeploy/EngineExtension/NetworkDeployers/EngineColoringDeployer.py (1)
65-70: Replace assert with deterministic exception; treat empty/None engine as uncolored.Asserts can be stripped; raise a stable, informative error.
- uncoloredNodes = [node for node in graph.nodes if "engine" not in node.attrs] - uncoloredOperations = set(node.op for node in uncoloredNodes) - assert len( - uncoloredNodes - ) == 0, f"Missing engine color for nodes {[node.name for node in uncoloredNodes]} with operations {uncoloredOperations}" + uncoloredNodes = [n for n in graph.nodes if not isinstance(n.attrs.get("engine"), str) or not n.attrs.get("engine")] + if uncoloredNodes: + nodeNames = sorted((n.name or "<unnamed>") for n in uncoloredNodes) + uncoloredOps = sorted({n.op for n in uncoloredNodes if n.op is not None}) + raise ValueError(f"Missing engine color for nodes {nodeNames} with operations {uncoloredOps}")Deeploy/Targets/PULPOpen/Platform.py (1)
256-258: Switch to mchan_siracusa.h looks fine; verify there are no residual dory_dma usages and that headers resolve on all PULP variants.Run the quick scan below and fix any remaining references if found.
#!/bin/bash # Residual includes/usages of dory_dma rg -nP 'dory_dma\.h|dory_dma_' -g '!**/build/**' # Confirm new header is referenced and present rg -nP 'mchan_siracusa\.h' -g '!**/build/**' fd -HI 'mchan_siracusa.h' # Optional: ensure include paths expose TargetLibraries/PULPOpen/inc rg -nP 'include_paths|CFLAGS|CPPFLAGS' -n . -g '!**/build/**'TargetLibraries/PULPOpen/inc/mchan_v6.h (1)
40-47: Good: enforce mutual exclusivity for event vs. polled modes.Deeploy/Targets/Generic/TileConstraints/iRMSNormTileConstraint.py (1)
31-31: Fix potential overflow and dtype of "size" replacement.np.prod can exceed 16-bit; hardcoding uint16_t risks truncation. Derive the minimal adequate integer type after computing sizes and ensure Python ints.
Apply:
-from Deeploy.CommonExtensions.DataTypes import uint16_t +from Deeploy.CommonExtensions.DataTypes import minimalIntegerType- replacements = {"size": []} - replacementTypes = {"size": PointerClass(uint16_t)} + replacements = {"size": []}- for cube in outputCubes: - newSize = np.prod(cube.dims) - replacements["size"].append(newSize) + for cube in outputCubes: + newSize = int(np.prod(cube.dims)) + replacements["size"].append(newSize)- variableReplacementSchedule = VariableReplacementScheme(replacements, replacementTypes) + replacementTypes = {"size": PointerClass(minimalIntegerType(replacements["size"]))} + variableReplacementSchedule = VariableReplacementScheme(replacements, replacementTypes)Also applies to: 79-80, 81-83, 95-97
DeeployTest/testRunner_siracusa_mchandma.py (1)
50-53: Replace asserts with explicit validation and use zip(strict=True).
Same as earlier feedback; asserts can be stripped and messages have a minor grammar issue.Apply:
-assert len(inputShape) == len(tileShape), \ - f'Input and tile shape should be of the same dimensionality. Received {len(inputShape)}D input shape vs. {len(tileShape)}D tile shape.' -assert all(tileDim <= inDim for inDim, tileDim in zip(inputShape, tileShape)), \ - f'Each tile shape dimension should be smaller then the corresponding input one. Received {tileShape} > {inputShape}' +if len(inputShape) != len(tileShape): + raise ValueError( + f"Input and tile shape must have the same dimensionality. Got {len(inputShape)}D vs {len(tileShape)}D.") +if not all(tileDim <= inDim for inDim, tileDim in zip(inputShape, tileShape, strict=True)): + raise ValueError( + f"Each tile dimension must be <= the corresponding input one. Got tile={tileShape}, input={inputShape}.")Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py (1)
39-41: Consider extracting rank-aware cube construction into a shared helperGEMM/MatMul (and Snitch counterparts) now duplicate rank/prefix handling. A single utility (e.g., make_ranked_cube(offset2d, dims2d, batch_off, batch_size, b_off=0, b_size=1)) would de-duplicate and harden behavior across targets.
I can sketch this helper and migrate both PULPOpen and Snitch variants in a follow-up if desired.
Deeploy/Targets/PULPOpen/TileConstraints/MatMulTileConstraint.py (1)
115-119: Bug: transA/transB ignored in serialization (tiles A/B as non-transposed)NSize and A/B sub-rectangles don’t account for transpose flags. This mis-tiles when transA/transB == 1.
Apply:
@@ - buffA = ctxt.lookup(operatorRepresentation['A']) - buffB = ctxt.lookup(operatorRepresentation['B']) + buffA = ctxt.lookup(operatorRepresentation['A']) + buffB = ctxt.lookup(operatorRepresentation['B']) + transA = int(operatorRepresentation.get("transA", 0)) + transB = int(operatorRepresentation.get("transB", 0)) @@ - NSize = buffA.shape[-1] + NSize = buffA.shape[-2] if transA else buffA.shape[-1] @@ - AMatrixOffsets = (MOffset, NOffset) - AMatrixShape = (MSize, NSize) + if transA == 0: + AMatrixOffsets = (MOffset, NOffset) + AMatrixShape = (MSize, NSize) + else: + AMatrixOffsets = (NOffset, MOffset) + AMatrixShape = (NSize, MSize) @@ - BMatrixOffsets = (NOffset, OOffset) - BMatrixShape = (NSize, OSize) + if transB == 0: + BMatrixOffsets = (NOffset, OOffset) + BMatrixShape = (NSize, OSize) + else: + BMatrixOffsets = (OOffset, NOffset) + BMatrixShape = (OSize, NSize)#!/bin/bash # Check for other MatMul/GEMM serializers still using buffA.shape[-1] for N unconditionally rg -n -C2 -P 'serializeTilingSolution\(|shape\[-1\]\)' Deeploy/Targets | rg -n -P 'MatMul|GEMM'Also applies to: 148-153
TargetLibraries/PULPOpen/inc/mchan_v7.h (2)
68-71: Make register pointers uintptr_t-safe and assert 32-bit target.Prevents silent truncation when compiled on hosts with 64-bit pointers.
Apply this diff:
-#include "pmsis.h" +#include "pmsis.h" +#include <stdint.h> + +_Static_assert(sizeof(uintptr_t) == 4, "MCHAN assumes 32-bit addresses."); @@ -static volatile uint32_t *const cmd_ptr = - (volatile uint32_t *const)(MCHAN_BASE_ADDR + 0x0); -static volatile uint32_t *const status_ptr = - (volatile uint32_t *const)(MCHAN_BASE_ADDR + 0x4); +static volatile uint32_t *const cmd_ptr = + (volatile uint32_t *const)((uintptr_t)MCHAN_BASE_ADDR + 0x0u); +static volatile uint32_t *const status_ptr = + (volatile uint32_t *const)((uintptr_t)MCHAN_BASE_ADDR + 0x4u);
73-113: Inline helpers and cast pointers via uintptr_t before truncation.Avoid multiple-definition bloat and UB when casting pointers to 32-bit regs.
Apply this diff:
-static void mchan_transfer_1d(uint32_t cmd, void *loc, void *ext) { +static inline void mchan_transfer_1d(uint32_t cmd, void *loc, void *ext) { // TODO: assert flags are set correctly - *cmd_ptr = (uint32_t)cmd; - *cmd_ptr = (uint32_t)loc; - *cmd_ptr = (uint32_t)ext; + *cmd_ptr = (uint32_t)cmd; + *cmd_ptr = (uint32_t)(uintptr_t)loc; + *cmd_ptr = (uint32_t)(uintptr_t)ext; } @@ -static void mchan_transfer_2d_loc_strided(uint32_t cmd, void *loc, void *ext, +static inline void mchan_transfer_2d_loc_strided(uint32_t cmd, void *loc, void *ext, uint32_t loc_size_1d, uint32_t loc_stride_2d) { // TODO: assert flags are set correctly - *cmd_ptr = (uint32_t)cmd; - *cmd_ptr = (uint32_t)loc; - *cmd_ptr = (uint32_t)ext; + *cmd_ptr = (uint32_t)cmd; + *cmd_ptr = (uint32_t)(uintptr_t)loc; + *cmd_ptr = (uint32_t)(uintptr_t)ext; *cmd_ptr = (uint32_t)loc_size_1d; *cmd_ptr = (uint32_t)loc_stride_2d; } @@ -static void mchan_transfer_2d_ext_strided(uint32_t cmd, void *loc, void *ext, +static inline void mchan_transfer_2d_ext_strided(uint32_t cmd, void *loc, void *ext, uint32_t ext_size_1d, uint32_t ext_stride_2d) { // TODO: assert flags are set correctly - *cmd_ptr = (uint32_t)cmd; - *cmd_ptr = (uint32_t)loc; - *cmd_ptr = (uint32_t)ext; + *cmd_ptr = (uint32_t)cmd; + *cmd_ptr = (uint32_t)(uintptr_t)loc; + *cmd_ptr = (uint32_t)(uintptr_t)ext; *cmd_ptr = (uint32_t)ext_size_1d; *cmd_ptr = (uint32_t)ext_stride_2d; } @@ -static void mchan_transfer_2d_loc_strided_ext_strided( +static inline void mchan_transfer_2d_loc_strided_ext_strided( uint32_t cmd, void *loc, void *ext, uint32_t loc_size_1d, uint32_t loc_stride_2d, uint32_t ext_size_1d, uint32_t ext_stride_2d) { // TODO: assert flags are set correctly - *cmd_ptr = (uint32_t)cmd; - *cmd_ptr = (uint32_t)loc; - *cmd_ptr = (uint32_t)ext; + *cmd_ptr = (uint32_t)cmd; + *cmd_ptr = (uint32_t)(uintptr_t)loc; + *cmd_ptr = (uint32_t)(uintptr_t)ext; *cmd_ptr = (uint32_t)ext_size_1d; *cmd_ptr = (uint32_t)ext_stride_2d; *cmd_ptr = (uint32_t)loc_size_1d; *cmd_ptr = (uint32_t)loc_stride_2d; }DeeployTest/testDmas.py (1)
29-46: Harden subprocess call: build argv list; avoid shell=True.Prevents shell injection and quoting issues; aligns with static analysis (S602). Also keep "-DNUM_CORES=8" as a single argv element to preserve current parsing.
- cmd = [f"python {testRunner}", f"-t test{dma}", "-DNUM_CORES=8"] - cmd.append(f"--input-shape {' '.join(str(x) for x in inputShape)}") - cmd.append(f"--tile-shape {' '.join(str(x) for x in tileShape)}") - cmd.append(f"--node-count {nodeCount}") - cmd.append(f"--type {dataType}") + import sys, shlex + cmd = [ + sys.executable, testRunner, + "-t", f"test{dma}", + "-DNUM_CORES=8", + "--input-shape", *[str(x) for x in inputShape], + "--tile-shape", *[str(x) for x in tileShape], + "--node-count", str(nodeCount), + "--type", dataType, + ] if doublebuffer: cmd.append("--doublebuffer") - full_cmd = " ".join(cmd) + full_cmd = shlex.join(cmd) print(f"Running command:\n{full_cmd}\n") try: - subprocess.run(full_cmd, shell = True, check = True) + subprocess.run(cmd, check = True) except subprocess.CalledProcessError: print(f"test{dma}: Failed test:" + cfg_str) print(f"Rerun with command:\n{full_cmd}") - exit(-1) + import sys as _sys + _sys.exit(1)DeeployTest/testRunner_snitch_dma.py (1)
88-93: Ensure float inputs are float32; drop unused variable.np.random.rand returns float64; cast to float32. Remove unused np.iinfo.
-if not testRunner._args.skipgen: - if dtype == np.float32: - test_inputs = np.random.rand(*inputShape) - else: - info = np.iinfo(dtype) - test_inputs = np.arange(stop = np.prod(inputShape), dtype = dtype).reshape(inputShape) +if not testRunner._args.skipgen: + if dtype == np.float32: + test_inputs = np.random.rand(*inputShape).astype(np.float32) + else: + test_inputs = np.arange(stop = np.prod(inputShape), dtype = dtype).reshape(inputShape)DeeployTest/testRunner_siracusa_l3dma.py (2)
50-53: Replace asserts with explicit validation and use zip(strict=...)Asserts can be stripped with -O and misshape pairs won’t be detected. Use explicit checks.
-assert len(inputShape) == len(tileShape), \ - f'Input and tile shape should be of the same dimensionality. Received {len(inputShape)}D input shape vs. {len(tileShape)}D tile shape.' -assert all(tileDim <= inDim for inDim, tileDim in zip(inputShape, tileShape)), \ - f'Each tile shape dimension should be smaller then the corresponding input one. Received {tileShape} > {inputShape}' +if len(inputShape) != len(tileShape): + raise ValueError( + f'Input and tile shape must have the same dimensionality. Received {len(inputShape)}D vs. {len(tileShape)}D.' + ) +# If Python < 3.10, drop strict=True. +if not all(tileDim <= inDim for inDim, tileDim in zip(inputShape, tileShape, strict=True)): + raise ValueError( + f'Each tile dimension must be <= the corresponding input one. Received tiles {tileShape} > input {inputShape}.' + )
81-86: Fix float32 dtype and remove unused variable
np.random.randyields float64; cast to match float32 tensors. Remove the unusedinfo.-if dtype == np.float32: - test_inputs = np.random.rand(*inputShape) +if dtype == np.float32: + test_inputs = np.random.rand(*inputShape).astype(dtype) else: - info = np.iinfo(dtype) test_inputs = np.arange(stop = np.prod(inputShape), dtype = dtype).reshape(inputShape)Deeploy/Targets/PULPOpen/DMA/L3Dma.py (2)
33-36: Replace asserts with runtime checks; fix device name and typos (duplicate).Asserts can be stripped with -O and messages reference “Mchan” and “contigous”. Use explicit exceptions, correct device name to L3Dma, and “contiguous”.
- assert strideExt[-1] == 1, \ - "Mchan supports only contigous transfers of the innermost dimension for external memory" - assert strideLoc[0] == shape[1] and strideLoc[1] == 1, \ - f"Mchan supports only contigous transfers for local memory. Received local shape: {shape}, stride: {strideLoc}" + if strideExt[-1] != 1: + raise ValueError("L3Dma supports only contiguous transfers of the innermost dimension for external memory") + if not (len(shape) >= 2 and strideLoc[0] == shape[1] and strideLoc[1] == 1): + raise ValueError( + f"L3Dma supports only contiguous transfers for local memory. Received local shape: {shape}, stride: {strideLoc}" + )
43-48: Pass DMA sizes in bytes, not elements (duplicate).pi_cl_ram_copy_2d expects size/stride/length in bytes. Multiply by element width.
- operatorRepresentation.update({ - "ext2loc": 1 if direction == "ExternalToLocal" else 0, - "transfer_size": math.prod(shape), - "length": shape[1], - "stride": strideExt[0], - }) + bytes_per_elem = externalBuffer._type.referencedType.typeWidth // 8 + operatorRepresentation.update({ + "ext2loc": 1 if direction == "ExternalToLocal" else 0, + "transfer_size": math.prod(shape) * bytes_per_elem, + "length": shape[1] * bytes_per_elem, + "stride": strideExt[0] * bytes_per_elem, + })Deeploy/Targets/PULPOpen/CodeTransformationPasses/AutoTransposeUtils.py (1)
41-42: Bug: HyperRectangle ctor args swapped (offset vs dims).Constructor expects (offset, dims); passing (dims, offset) breaks minimization/stride derivation.
- maxTransferRect = HyperRectangle(maxTransferDims, inRect.offset) + maxTransferRect = HyperRectangle(offset=inRect.offset, dims=maxTransferDims)Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (2)
78-88: Shared structure with SB path — consider factoring common pieces into base.Ingress/egress scheduling, reference hoisting, and future orchestration duplicate SB logic. A small helper in the base could reduce drift.
132-140: Fix DMA prefetch future: not initialized/waited/deinitialized — race with first tile.Reuse the main
futureso the existing init/wait/deinit covers the prefetch. The standaloneinitialFutureis never init'ed/waited/deinit'ed.- gen = AnydimAsyncDmaTransferAdapter(self.dma) - - initialFuture = self.dma.getFuture(tensorName + "_init") - initialDmaTransferCalls = gen.transfer(ctxt, externalBufferRef, localBuffer, rectangles[0].dims, - stridesFromShape(externalBufferShape), - stridesFromShape(rectangles[0].dims), "ExternalToLocal", - initialFuture, math.prod(externalBufferShape)) + gen = AnydimAsyncDmaTransferAdapter(self.dma) + # Reuse the same future so it's properly init'ed, waited, and deinit'ed + initialDmaTransferCalls = gen.transfer(ctxt, externalBufferRef, localBuffer, rectangles[0].dims, + stridesFromShape(externalBufferShape), + stridesFromShape(rectangles[0].dims), "ExternalToLocal", + future, math.prod(externalBufferShape))Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (4)
111-118: Enforce tile-count invariants; keep single source of truth for countsValidate ingress/egress lengths match and ensure
totalNumTilesreflects the schedule length.- metaInfo = TilingMetaInfo( + if len(tilingSchedule.inputLoadSchedule) != len(tilingSchedule.outputLoadSchedule): + raise ValueError( + f"Tiling schedule ingress/egress length mismatch: " + f"{len(tilingSchedule.inputLoadSchedule)} vs {len(tilingSchedule.outputLoadSchedule)}" + ) + metaInfo = TilingMetaInfo( nodeName = operatorRepresentation['nodeName'] + f"_{self.externalMemory}", nodeOps = operatorRepresentation['nodeOps'], numTiles = operatorRepresentation['numTiles'], totalNumTiles = len(tilingSchedule.outputLoadSchedule), tileIdxPtr = operatorRepresentation['tileIdxPtr'], tileIdxVar = "TILING_I",Would you like a follow-up patch to assert equality with the value used by
_hoistTileNumAndIdxPtr?
27-27: Replace Set-based futures with ordered, de-duplicated List; remove unused variable; ensure deterministic init/deinit/wait orderSets make codegen order non-deterministic and break stable teardown ordering; also
referenceUpdatesis unused. Switch toList[Future], de-duplicate while preserving insertion order, and reverse deinit.-from typing import Dict, List, Set, Tuple +from typing import Dict, List, Tuple @@ - tileIdxVar: str, direction: DmaDirection) -> Tuple[NetworkContext, List[CodeSnippet], Set[Future]]: + tileIdxVar: str, direction: DmaDirection) -> Tuple[NetworkContext, List[CodeSnippet], List[Future]]: callStack: List[CodeSnippet] = [] - referenceUpdates: List[CodeSnippet] = [] - futures: Set[Future] = set() + futures: List[Future] = [] @@ - future = self.dma.getFuture(tensorName) - futures.add(future) + future = self.dma.getFuture(tensorName) + if future not in futures: + futures.append(future) @@ - ingressDmaWaitStatements = [future.wait() for future in ingressFutures] - egressDmaWaitStatements = [future.wait() for future in egressFutures] + ingressDmaWaitStatements = [future.wait() for future in ingressFutures] + egressDmaWaitStatements = [future.wait() for future in egressFutures] @@ - setupStatements = self.dma.setup() - setupStatements += [f.init() for f in ingressFutures | egressFutures] - - teardownStatements = self.dma.teardown() - teardownStatements.extend(f.deinit() for f in ingressFutures | egressFutures) + setupStatements = self.dma.setup() + allFutures: List[Future] = [] + allFutures.extend(ingressFutures) + for f in egressFutures: + if f not in allFutures: + allFutures.append(f) + setupStatements += [f.init() for f in allFutures] + + teardownStatements = self.dma.teardown() + teardownStatements.extend(f.deinit() for f in reversed(allFutures))Also applies to: 49-53, 99-107
56-63: Replace asserts with explicit checks; guard missing constraints/shapeAsserts can be stripped with -O and indexing without
getcan KeyError. Raise clear exceptions and validate presence.- assert localBuffer._memoryLevel == self.localMemory - assert isinstance(localBuffer, _ReferenceBuffer) - externalBuffer = ctxt.lookup(localBuffer._referenceName) - assert isinstance(externalBuffer, VariableBuffer) - tensorMemoryConstraint = tensorMemoryConstraintDict[externalBuffer.name] - externalBufferShape = tensorMemoryConstraint.memoryConstraints[self.externalMemory].shape - assert externalBufferShape is not None + if localBuffer._memoryLevel != self.localMemory: + raise ValueError( + f"Local buffer '{localBuffer.name}' expected in '{self.localMemory}', " + f"got '{localBuffer._memoryLevel}'" + ) + if not isinstance(localBuffer, _ReferenceBuffer): + raise TypeError(f"Expected _ReferenceBuffer for '{localBuffer.name}', got {type(localBuffer).__name__}") + externalBuffer = ctxt.lookup(localBuffer._referenceName) + if not isinstance(externalBuffer, VariableBuffer): + raise TypeError( + f"Expected VariableBuffer for external reference '{localBuffer._referenceName}', " + f"got {type(externalBuffer).__name__}" + ) + tensorMemoryConstraint = tensorMemoryConstraintDict.get(externalBuffer.name) + if tensorMemoryConstraint is None: + raise KeyError(f"Missing TensorMemoryConstraint for '{externalBuffer.name}'") + memCnstr = tensorMemoryConstraint.memoryConstraints.get(self.externalMemory) + if memCnstr is None or memCnstr.shape is None: + raise KeyError( + f"Missing shape for '{externalBuffer.name}' in memory level '{self.externalMemory}'" + ) + externalBufferShape = memCnstr.shape
92-98: Use DmaDirection enum, not string literalsPassing strings breaks type expectations and templates.
- ctxt, ingressDmaTransferCalls, ingressFutures = self._generateTransferScheduleCalls( - ctxt, operatorRepresentation, tilingSchedule.inputLoadSchedule, - nodeMemoryConstraint.inputTensorMemoryConstraints, "TILING_I", "ExternalToLocal") + ctxt, ingressDmaTransferCalls, ingressFutures = self._generateTransferScheduleCalls( + ctxt, operatorRepresentation, tilingSchedule.inputLoadSchedule, + nodeMemoryConstraint.inputTensorMemoryConstraints, "TILING_I", DmaDirection.ExternalToLocal) @@ - ctxt, egressDmaTransferCalls, egressFutures = self._generateTransferScheduleCalls( - ctxt, operatorRepresentation, tilingSchedule.outputLoadSchedule, - nodeMemoryConstraint.outputTensorMemoryConstraints, "TILING_I", "LocalToExternal") + ctxt, egressDmaTransferCalls, egressFutures = self._generateTransferScheduleCalls( + ctxt, operatorRepresentation, tilingSchedule.outputLoadSchedule, + nodeMemoryConstraint.outputTensorMemoryConstraints, "TILING_I", DmaDirection.LocalToExternal)Deeploy/Targets/Snitch/DMA/SnitchDma.py (2)
33-38: Replace assert; fix typo and stray f-stringUse explicit exceptions; “contiguous” spelling; drop unnecessary f-prefix.
def checkTransfer(self, ctxt: NetworkContext, externalBuffer: VariableBuffer, localBuffer: VariableBuffer, shape: Tuple[int, ...], strideExt: Tuple[int, ...], strideLoc: Tuple[int, ...], direction: DmaDirection) -> None: super().checkTransfer(ctxt, externalBuffer, localBuffer, shape, strideExt, strideLoc, direction) - assert strideLoc[1] == 1 and strideExt[1] == 1, f"Supports only contigous transfers in the innermost dimension" + if not (strideLoc[1] == 1 and strideExt[1] == 1): + raise ValueError("Supports only contiguous transfers in the innermost dimension")
39-51: Use DmaDirection enum and convert element counts to byte counts for DMA APIsnrt expects sizes/strides in bytes. Multiply by element size and avoid string directions.
def transferOpRepr(self, externalBuffer: VariableBuffer, localBuffer: VariableBuffer, shape: Tuple[int, ...], strideExt: Tuple[int, ...], strideLoc: Tuple[int, ...], direction: DmaDirection, future: Future) -> OperatorRepresentation: _ = future + bytes_per_elem = localBuffer._type.referencedType.typeWidth // 8 operatorRepresentation: OperatorRepresentation = { - "dest": localBuffer.name if direction == "ExternalToLocal" else externalBuffer.name, - "src": externalBuffer.name if direction == "ExternalToLocal" else localBuffer.name, - "repeat": shape[0], - "size": shape[1], - "stride_dest": strideLoc[0] if direction == "ExternalToLocal" else strideExt[0], - "stride_src": strideExt[0] if direction == "ExternalToLocal" else strideLoc[0], + "dest": localBuffer.name if direction == DmaDirection.ExternalToLocal else externalBuffer.name, + "src": externalBuffer.name if direction == DmaDirection.ExternalToLocal else localBuffer.name, + "repeat": shape[0], + "size": shape[1] * bytes_per_elem, + "stride_dest": (strideLoc[0] if direction == DmaDirection.ExternalToLocal else strideExt[0]) * bytes_per_elem, + "stride_src": (strideExt[0] if direction == DmaDirection.ExternalToLocal else strideLoc[0]) * bytes_per_elem, } return operatorRepresentationDeeploy/Targets/PULPOpen/DMA/MchanDma.py (2)
55-57: Fix off-by-one error in size validation.17 bits can encode values from 0 to 2^17-1. The assertion should use strict inequality.
Apply this fix:
- assert mchanTransferSize <= 2**17, ( + assert mchanTransferSize < 2**17, ( "The Dma transfer size for mchan should be representable with 17 bits, " f"current number of bits required is {math.ceil(math.log2(mchanTransferSize))}")
61-64: Convert element counts to byte counts for DMA parameters.The DMA hardware expects byte counts, but the code is passing element counts for
size_1dandstride_2d.You need to multiply by the element size in bytes. Add this before line 61:
# Get element size from buffer type elementSizeInBytes = externalBuffer._type.referencedType.typeWidth // 8Then update the assignments:
if transferRank == 2: - operatorRepresentation["size_1d"] = shape[1] - operatorRepresentation["stride_2d"] = strideExt[0] + operatorRepresentation["size_1d"] = shape[1] * elementSizeInBytes + operatorRepresentation["stride_2d"] = strideExt[0] * elementSizeInBytesDeeploy/TilingExtension/CodeTransformationPasses/TilingVariableReplacement.py (1)
53-58: Improve arena allocation robustness and formatting.The arena allocation method has several issues:
- No bounds checking for the offset
- F-string embedded in template string makes it less readable
- Missing arena size validation
Consider this improved implementation:
def _arenaAllocate(self, ctxt: NetworkContext, buffer: VariableBuffer, offset: int) -> VariableBuffer: arena = ctxt.lookup(self.arenaName) - buffer.allocTemplate = NodeTemplate(" \ - ${type.typeName} ${name} = (${type.typeName}) " + f"((char*){str(arena._instance)} + {offset});") + # Add bounds checking if arena has size attribute + if hasattr(arena, 'size'): + assert 0 <= offset < arena.size, f"Offset {offset} out of bounds for arena {self.arenaName} (size={arena.size})" + + # Use cleaner template with placeholders + buffer.allocTemplate = NodeTemplate(""" + ${type.typeName} ${name} = (${type.typeName}) ((char*)${arena_ptr} + ${offset});""") + buffer._arena_ptr = str(arena._instance) + buffer._offset = offset buffer.deallocTemplate = NodeTemplate("") return bufferAlso update the buffer's
_bufferRepresentationmethod to include these fields:def _bufferRepresentation(self) -> Dict: repr = super()._bufferRepresentation() if hasattr(self, '_arena_ptr'): repr['arena_ptr'] = self._arena_ptr repr['offset'] = self._offset return reprDeeploy/TilingExtension/TilerExtension.py (1)
938-941: Fix off-by-one and avoid private field access for lifetime.Outputs should be alive through the last step index len(schedule) - 1. Also use the public lifetime property.
- for tensor in graph.outputs: - assert memoryBlockMap[tensor.name]._lifetime[-1] == len( - schedule), "Invalid memory map! Output buffer is not alive at the last step!" + for tensor in graph.outputs: + assert memoryBlockMap[tensor.name].lifetime[-1] == len(schedule) - 1, \ + "Invalid memory map! Output buffer is not alive at the last step!"DeeployTest/testUtils/dmaUtils.py (1)
354-370: Validate tileShape vs input shape before generating tiling.Preempt invalid tilings early (rank, positivity, bounds, divisibility).
def prepare_deployer_with_custom_tiling(deployer: NetworkDeployer, defaultMemory: str, targetMemory: str, tileShape: Tuple[int, ...], doublebuffer: bool) -> None: # Decomposed deployer.prepare() to enter a custom tiling solution deployer.frontEnd() super(TilerDeployerWrapper, deployer).bind() + inputShape = tuple(deployer.graph.inputs[0].shape) + assert len(tileShape) == len(inputShape), \ + f"Tile rank {len(tileShape)} doesn't match input rank {len(inputShape)}" + for i, (t, s) in enumerate(zip(tileShape, inputShape)): + assert t > 0, f"Tile dim {i} must be > 0" + assert t <= s, f"Tile dim {i}: {t} exceeds input dim {s}" + assert s % t == 0, f"Input dim {i}={s} not divisible by tile dim {t}" + tilingSolution, memoryMap = generate_tiling( ctxt = deployer.ctxt, memoryStart = defaultMemory, memoryOrder = [defaultMemory, targetMemory], memoryHierarchy = deployer.Platform.memoryHierarchy, - inputShape = deployer.graph.inputs[0].shape, + inputShape = inputShape, tileShape = tileShape, graph = deployer.graph, _type = deployer.inputTypes['input_0'].referencedType, doublebuffer = doublebuffer, )Deeploy/CommonExtensions/CodeTransformationPasses/IntrospectiveCodeTransformation.py (2)
49-73: Update template source and refresh parse-tree cache after reconstruction.Without syncing _source and re-keying parseTreeDict, subsequent mutations use stale trees.
def _reconstructCode(template: Template, node: TemplateNode) -> Template: @@ - module = types.ModuleType(template.module_id) - code = compile(source, template.module_id, "exec") - exec(code, module.__dict__, module.__dict__) + module = types.ModuleType(template.module_id) + code = compile(source, template.module_id, "exec") + exec(code, module.__dict__, module.__dict__) - template._code = code + # Refresh source and parse-tree cache + old_hash = hash(getattr(template, "_source", source)) + template._source = source + new_hash = hash(template._source) + d = IntrospectiveCodeTransformationMixIn.parseTreeDict + if old_hash in d: + try: + del d[old_hash] + except KeyError: + pass + d[new_hash] = node + + template._code = code template.module = module template.callable_ = template.module.render_body return template
171-181: Don’t use assert for runtime type checks in library code.Replace with explicit exception; asserts can be stripped with -O.
- def _unrollStructReferences(val: Struct) -> List[str]: - assert isinstance(val, Struct) + def _unrollStructReferences(val: Struct) -> List[str]: + if not isinstance(val, Struct): + raise TypeError(f"Expected Struct, got {type(val).__name__}")Deeploy/TilingExtension/TilingCodegen.py (1)
247-251: Missing documentation and tests for stride calculation remain unaddressedBased on the past review comment, there are still no unit tests or documentation for the
stridesFromShapefunction. The function assumes row-major ordering, which should be documented and tested.Would you like me to generate unit tests and documentation for this critical utility function to ensure correct stride calculations across different tensor shapes?
Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py (1)
130-143: Potential template/snippet count mismatch can cause silent failuresThe loop assumes each transfer produces the same number of snippets as templates, which could drift and cause misaligned templating if the adapter returns different counts for different tiles.
Add a validation check after generating snippets:
for rect in transfers: snippets = gen.transfer(ctxt, externalBuffer, localBuffer, rect.dims, stridesFromShape(externalBuffer.shape), stridesFromShape(rect.dims), direction, future, math.prod(externalBuffer.shape)) + if len(snippets) != len(templates): + raise RuntimeError(f"Adapter returned {len(snippets)} snippets, expected {len(templates)} " + f"for rectangle {rect}") for i, snippet in enumerate(snippets): opReprUpdates[i].append(snippet.operatorRepresentation)Deeploy/DeeployTypes.py (1)
539-565: Missing boundary checking for reference buffer offsetAs noted in past reviews, there's no validation that the offset is within valid bounds of the referenced buffer. This could lead to out-of-bounds memory access at runtime.
The offset validation is critical for memory safety. Would you like me to implement the boundary checking logic that was discussed in the previous review, or would you prefer to track this as a separate issue for future implementation?
Deeploy/TilingExtension/AsyncDma.py (4)
149-151: Use size_t for loop indices.Loop bounds are sizes; prefer size_t to avoid narrowing/truncation.
- templateStr += f"for (uint32_t {iter} = 0; {iter} < ${'{'}end_{level}{'}'}; {iter}++) {{" + templateStr += f"for (size_t {iter} = 0; {iter} < ${'{'}end_{level}{'}'}; {iter}++) {{"
74-82: Replace asserts with exceptions; fix stray f-string.Asserts may be stripped under -O and the first f-string has no placeholders. Raise ValueError instead.
- assert transferRank == len(strideLoc) and transferRank == len( - strideExt), f"The shape and stride rank should match" - assert transferRank in self.supportedTransferRanks( - ), f"Unsupported transfer rank {transferRank}. Supported ranks are {self.supportedTransferRanks()}" + if not (transferRank == len(strideLoc) and transferRank == len(strideExt)): + raise ValueError("The shape and stride rank should match") + if transferRank not in self.supportedTransferRanks(): + raise ValueError( + f"Unsupported transfer rank {transferRank}. Supported ranks are {self.supportedTransferRanks()}" + )
163-169: Widen offset type to size_t.Offsets can exceed 32 bits on 64-bit targets; use size_t.
- templateStr = f"const uint32_t {name} = " + templateStr = f"const size_t {name} = "
171-171: Fix non-standard void arithmetic; cast through uintptr_t.*Standard C forbids void* arithmetic; and offset should be size_t/uintptr_t. Also ensure the generated TU includes <stdint.h> (for uintptr_t) and <stddef.h> (for size_t).
- offsetPtrTemplate = NodeTemplate("void * const ${resultPtr} = (void *)${basePtr} + ${offset};") + offsetPtrTemplate = NodeTemplate( + "void * const ${resultPtr} = (void *)((uintptr_t)${basePtr} + ${offset});" + )Action: verify the codegen path emits the necessary includes once per TU.
|
@lukamac Thanks for the changes and the high-quality work. I went over your changes and the (annoyingly many comments from the AI). FYI, I disabled auto-review now. However, there are some comments that I think are valid. Please go over the open conversations and fix/resolve them. Afterwards, I am happy to merge this PR. Please do not forget to extend the CHANGLOG.md! |
…n to anydimAdapter
Xeratec
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thanks a for the great work!
* Refactor mchan hal * Refactor IntrospectiveCodeTransformation * Refactor MemoryAllocation * Add minimalIntegerType helper function * Small refactor DeeployTypes * Change Neureka tile constraints to new TilingCodegen function * Small refactors Check for LLVM_INSTALL_DIR environment variable Fix typo Check for SNITCH_HOME environment variable and crash if not present Change test output difference to absolute difference Improve engine coloring error message Fix type hint * Permutation refactor * Refactor TransposeTileConstraint * Remove manual name mangling from templates since it's automatically done in the ExecutionBlock.generate() * Change serialize to produce same shape rank as original * Refactor TilingExtension * Port PULPOpen * Port Snitch * DeeployTest: Extract generic tiling code into tilingUtils.py * DeeployTest: Extract common test generation code * DeeployTest: Add Dma tests * Apply Philip's comments Remove dory_dma.h Fix hoistReference doc comment Use the shape argument of the _hoistReference function Rename dma test runners Change kernelLevelTiling HACK comment to a TODO Add DMA folder to targets with DMAs Fix wrong deeployStateDir Single source of truth for the tiling arena name * Add unravelReference doc comment and fix the dealiasBuffer's * Refactor type inference and minimal(Integer|Float)Type * Revert extra inputs hack * Add mchan check for both event- and poll-based event checking flags being set * Fix HyperRectangle arg order * Fix mchan check whether size is representable within 17 bits * Fix init, deinit, wait on initialFuture in DoubleBuffering, rename gen to anydimAdapter * Fix GEMM tile constraint serialization to check transA and transB * Fix inherit from ABC in AsyncDma and AsyncDmaWaitingStrategy * Fix use tileSizeInBytes to check whether it fits in memory * Update changelog * Add missing transferOpRepr abstract method from the BlockingAsyncDmaAdapter
* Refactor mchan hal * Refactor IntrospectiveCodeTransformation * Refactor MemoryAllocation * Add minimalIntegerType helper function * Small refactor DeeployTypes * Change Neureka tile constraints to new TilingCodegen function * Small refactors Check for LLVM_INSTALL_DIR environment variable Fix typo Check for SNITCH_HOME environment variable and crash if not present Change test output difference to absolute difference Improve engine coloring error message Fix type hint * Permutation refactor * Refactor TransposeTileConstraint * Remove manual name mangling from templates since it's automatically done in the ExecutionBlock.generate() * Change serialize to produce same shape rank as original * Refactor TilingExtension * Port PULPOpen * Port Snitch * DeeployTest: Extract generic tiling code into tilingUtils.py * DeeployTest: Extract common test generation code * DeeployTest: Add Dma tests * Apply Philip's comments Remove dory_dma.h Fix hoistReference doc comment Use the shape argument of the _hoistReference function Rename dma test runners Change kernelLevelTiling HACK comment to a TODO Add DMA folder to targets with DMAs Fix wrong deeployStateDir Single source of truth for the tiling arena name * Add unravelReference doc comment and fix the dealiasBuffer's * Refactor type inference and minimal(Integer|Float)Type * Revert extra inputs hack * Add mchan check for both event- and poll-based event checking flags being set * Fix HyperRectangle arg order * Fix mchan check whether size is representable within 17 bits * Fix init, deinit, wait on initialFuture in DoubleBuffering, rename gen to anydimAdapter * Fix GEMM tile constraint serialization to check transA and transB * Fix inherit from ABC in AsyncDma and AsyncDmaWaitingStrategy * Fix use tileSizeInBytes to check whether it fits in memory * Update changelog * Add missing transferOpRepr abstract method from the BlockingAsyncDmaAdapter
* Refactor mchan hal * Refactor IntrospectiveCodeTransformation * Refactor MemoryAllocation * Add minimalIntegerType helper function * Small refactor DeeployTypes * Change Neureka tile constraints to new TilingCodegen function * Small refactors Check for LLVM_INSTALL_DIR environment variable Fix typo Check for SNITCH_HOME environment variable and crash if not present Change test output difference to absolute difference Improve engine coloring error message Fix type hint * Permutation refactor * Refactor TransposeTileConstraint * Remove manual name mangling from templates since it's automatically done in the ExecutionBlock.generate() * Change serialize to produce same shape rank as original * Refactor TilingExtension * Port PULPOpen * Port Snitch * DeeployTest: Extract generic tiling code into tilingUtils.py * DeeployTest: Extract common test generation code * DeeployTest: Add Dma tests * Apply Philip's comments Remove dory_dma.h Fix hoistReference doc comment Use the shape argument of the _hoistReference function Rename dma test runners Change kernelLevelTiling HACK comment to a TODO Add DMA folder to targets with DMAs Fix wrong deeployStateDir Single source of truth for the tiling arena name * Add unravelReference doc comment and fix the dealiasBuffer's * Refactor type inference and minimal(Integer|Float)Type * Revert extra inputs hack * Add mchan check for both event- and poll-based event checking flags being set * Fix HyperRectangle arg order * Fix mchan check whether size is representable within 17 bits * Fix init, deinit, wait on initialFuture in DoubleBuffering, rename gen to anydimAdapter * Fix GEMM tile constraint serialization to check transA and transB * Fix inherit from ABC in AsyncDma and AsyncDmaWaitingStrategy * Fix use tileSizeInBytes to check whether it fits in memory * Update changelog * Add missing transferOpRepr abstract method from the BlockingAsyncDmaAdapter
* Refactor mchan hal * Refactor IntrospectiveCodeTransformation * Refactor MemoryAllocation * Add minimalIntegerType helper function * Small refactor DeeployTypes * Change Neureka tile constraints to new TilingCodegen function * Small refactors Check for LLVM_INSTALL_DIR environment variable Fix typo Check for SNITCH_HOME environment variable and crash if not present Change test output difference to absolute difference Improve engine coloring error message Fix type hint * Permutation refactor * Refactor TransposeTileConstraint * Remove manual name mangling from templates since it's automatically done in the ExecutionBlock.generate() * Change serialize to produce same shape rank as original * Refactor TilingExtension * Port PULPOpen * Port Snitch * DeeployTest: Extract generic tiling code into tilingUtils.py * DeeployTest: Extract common test generation code * DeeployTest: Add Dma tests * Apply Philip's comments Remove dory_dma.h Fix hoistReference doc comment Use the shape argument of the _hoistReference function Rename dma test runners Change kernelLevelTiling HACK comment to a TODO Add DMA folder to targets with DMAs Fix wrong deeployStateDir Single source of truth for the tiling arena name * Add unravelReference doc comment and fix the dealiasBuffer's * Refactor type inference and minimal(Integer|Float)Type * Revert extra inputs hack * Add mchan check for both event- and poll-based event checking flags being set * Fix HyperRectangle arg order * Fix mchan check whether size is representable within 17 bits * Fix init, deinit, wait on initialFuture in DoubleBuffering, rename gen to anydimAdapter * Fix GEMM tile constraint serialization to check transA and transB * Fix inherit from ABC in AsyncDma and AsyncDmaWaitingStrategy * Fix use tileSizeInBytes to check whether it fits in memory * Update changelog * Add missing transferOpRepr abstract method from the BlockingAsyncDmaAdapter
Main contribution of this PR is abstracting the DMA logic into the
AsyncDmaclass and showing it works with the currently supported DMAs like the PULP's mchan and L3 dma, and Snitch's cluster dma.The main goal was enabling any-dimensional transfers which got incarnated as the
AnydimAsyncDmaTransferAdapter.Performance comparison
Comparison of network execution on Siracusa emulated in GVSoC
I wanted to test also Snitch but the program only prints 0 cycles. Visually checking the code, there is no fundamental change except for the fact that we now emit less barriers.
Added
_indexPointer(),indexVars(),dereferenceVars(). The*Varsfunctions index/dereference a list of variables (useful for tiling)unravelReference()that unravels a_ReferenceBufferuntil the base bufferis_object()- helper function that determines whether the string represents a name of a local or global objectis_buffer()- helper function that determines whether the string represents a name of a buffer_permuteHyperRectanglehelper functionChanged
offsetargument to offset the referencehoistReference- accepts the actual buffer as reference instead of name, accepts shape, offset, and override_type arguments, and returns the actual buffer, not its name_mangleNodeRep->_mangleOpRepr- the canonical way we use isOperatorRepresentation.NodeRepandParseDictare old iterations of that.permuteis an action that permutes something,permutationis a function that generates a permutation_permuteListto just_permuteserializeTilingSolutionfunctions to preserve the same shape rankgenerateTestNetworkfunctionPR Merge Checklist
develcommit and pointing todevel.CHANGELOG.mdfile has been updated.