Modular Code Generator: Complete Design Document#2036
Modular Code Generator: Complete Design Document#2036Copilot wants to merge 6 commits intodesign-docsfrom
Conversation
|
Please do not merge until everyone has had a chance to comment, modify, and approve. |
tbennun
left a comment
There was a problem hiding this comment.
This looks great! I have some comments on the ordering and the meaning of some of the components, but the rest of the components look perfect.
|
@copilot additionally, how would I create a new backend implementation (e.g., a new platform, or a different language like LLVM IR)? What would change in the design? Could you please add another document that explains how the modular code generation can be extended w.r.t. those two cases? |
…, improve documentation Co-authored-by: tbennun <8348955+tbennun@users.noreply.github.com>
I've added a comprehensive extension guide ( For new platforms (e.g., neuromorphic processors):
For new languages (e.g., LLVM IR):
The guide demonstrates how the modular design makes extensions straightforward through clear interfaces and automatic registration, with comprehensive testing examples and best practices. |
Includes detailed examples for creating new platform backends (neuromorphic processors) and language backends (LLVM IR), with complete code implementations, testing strategies, and integration patterns for the modular pipeline architecture. Co-authored-by: tbennun <8348955+tbennun@users.noreply.github.com>
tbennun
left a comment
There was a problem hiding this comment.
Apply my suggestions and see comments.
|
@copilot please address my latest review |
Addressed all the latest review comments including fixing pass headers, updating target hierarchy, replacing the neuromorphic example with a comprehensive OpenCL example, and removing the LLVM IR example. See commit 4f14e86. |
…nsive OpenCL example, fix pass headers, update target hierarchy - Remove separate CUDA/HIP file references from directory structure - Update phase descriptions to simplified target refactoring steps - Replace neuromorphic processor example with sophisticated OpenCL code generator showing: - Custom allocation passes with memory access pattern analysis - SDFG splitting for kernel generation (.cl files) - Target code generator interaction with custom passes - Advanced features like async memory transfers and multi-file generation - Remove redundant LLVM IR example section - Fix TargetCodeGenerationPass header and add explicit documentation about calling target code generators - Update conclusion to reference OpenCL example instead of removed examples Co-authored-by: tbennun <8348955+tbennun@users.noreply.github.com>
|
Made my pass on the proposal (and removed the example, which does not make sense in the new construction of the codegen. We will have to discuss that separately). I'll wait for others to comment before manually editing the doc again. @mcopik please comment on the design document before next Monday if you can (based on your recent experiences), so that we can ensure the design is sound and enables you to work in a simpler and more verified manner. Thanks! |
|
I think Phase 2 should involve a pass that decides on allocation lifetimes of the objects directly on the SDFG level, if I design the sketch on time I want to discuss some ideas about it on the DaCe meeting on 12.06. |
Completely agreed, that's proposed as |
Not right now, but I want to discuss some ideas in the dace meeting. |
|
@tbennun Few simple comments (might be naive):
Overall, it makes sense to me. I would add as an explicit goal to have a template/blueprint for adding a new target. |
|
@mcopik I think it makes sense and I also like the SYCL example. Thanks! |
|
I consider the GPUStream Assignment to be a scheduling transformation, just like the GPUOfffloading / ToGPU transformation. I think these transformations should be performed before codegen-preprocessing passes are executed. Is the document only for code-generation? Do we combine it with the transformation that should have been applied before the GPU offloading? Just like the scheduling passes and I think allocation pass should be a pass that is performed before codegen (and codegen preprocessing passes), as these transformations are not codegen-specific. Many accelerator environments support offloading and stream assignment (Let's say sycl for Intel GPUs, also supports streams). I think the following transformations can be the new Phase 1:
Phase 0 would involve transformations that make the SDFG more fit for the target accelerator. Consider the following: You have this pattern: Map[Range(1..N)] -> Tasklet -> Map[Range(1..N)], if possible duplicating this tasklet across the maps and fusing them would make sense for GPUs but not for CPUs, but this transformation is not GPU specific. One could do it on an SDFG that is not specialized for GPUs yet. Then we would have Phase 2 (old phase 1): for the analysis and set-up for codegen And the new phase 3 (old phase 2) : that does to accelerator specific lowering pass that depend on the language (sycl, cuda etc.) Btw. I have a design document for the Offloading pass (not restricted to GPU) and the order of preparation transformations could look like before offloading. Would it make sense to integrate it to dis document? Link to google doc: https://docs.google.com/document/d/13Pl4A8u5YJgTkZvhKCoJp8t693tvHeEYXfUvDtyonDI/edit?usp=sharing |
| ├── targets/ # Target-specific generators (simplified) | ||
| │ ├── __init__.py | ||
| │ ├── base.py # Base target interface (from target.py) | ||
| │ ├── openmp.py # OpenMP backend (split from cpu.py) |
There was a problem hiding this comment.
It looks like we are adding an OpenMP-specific target, so should we have a corresponding OpenMPCodegenPipeline?
There was a problem hiding this comment.
Added OpenMPCodegenPipeline example in the target-specific sub-pipelines section showing how OpenMP-specific code generation would be handled. See commit 66be6e8.
There was a problem hiding this comment.
Since the target-specific pipelines were too detailed for a design doc, this is no longer necessary to describe. We do want to enforce creating openmp.py though because of the cross-target entanglement
Addressed your feedback including: clarifying CUDA/GPU backend language and scope of refactoring, adding mention of CPU/C++ backend entanglement issue, changing timeline to indicate full refactoring rather than just renaming, and adding explicit goal for template/blueprint for new targets. See commit 66be6e8.
Excellent points about the scope of transformations vs. codegen-specific passes. You're right that AllocationPass, CopyToMapPass, and StreamAssignmentPass could be viewed as pre-codegen transformations rather than codegen passes. This touches on a fundamental architectural question about where the boundary lies between SDFG transformation and code generation. Your proposed phase structure makes sense and aligns with having clear separation of concerns. I'd suggest this becomes part of the broader discussion about the design scope in the upcoming DaCe meeting. |
| #### 12. **SDFGSplittingPass** | ||
| - **Purpose**: Split complex SDFGs into multiple files if needed | ||
| - **Input**: SDFG with memlet lowering complete | ||
| - **Output**: List of SDFGs (one per target file) |
There was a problem hiding this comment.
I'm not sure I follow that proposal exactly. This sounds like it would result in a ton of individual SDFGs duct taped together. The way I read it, this is what I understand:
- Given an SDFG with a whole bunch of Maps, some executed on the GPU as kernels, others run via OpenMP, this would construct a separate SDFG for each map nest (and possibly a large "orchestration" SDFG?), and each of these separate SDFGs gets distributed to a respective file in which the code for it is being generated?
Am I completely misunderstanding, or is that the idea? If so, what is the exact idea behind this? How does it add value to the process?
|
All comments addressed in #2071. The work on this PR is superseded by and cherry-picked into that PR. |
This PR provides a comprehensive design document for refactoring DaCe's code generation system from a monolithic structure into a modular, pass-based pipeline architecture using DaCe's existing
PassandPipelineinfrastructure.Overview
The current code generation system is a complex monolithic subpackage that handles everything from analysis to code emission in a single traversal. This design document proposes breaking it down into discrete, composable passes that can be tested, verified, and extended independently.
Key Deliverables
1. Main Design Document (
doc/codegen/modular_codegen_design.md)pipeline_resultsdictionary for maximal information reusecodegen/compiler(build tools) fromcodegen/passes(generation passes)2. Implementation Examples (
doc/codegen/pass_implementation_examples.md)MetadataCollectionPass,AllocationAnalysisPass,FrameCodeGenerationPassgenerate_code()APIBenefits
Proposed Architecture
Target Refactoring
Current issues addressed:
Proposed hierarchy:
Implementation Roadmap
14-week timeline with incremental phases:
Backward Compatibility
The design maintains full backward compatibility:
generate_code(sdfg)API unchangedThis design document provides the complete foundation for transforming DaCe's code generation into a modern, modular, and extensible system while preserving all existing capabilities.
Fixes #2034.
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.