@@ -71,61 +71,103 @@ Hyesun Hong,
7171
7272* PIM/PNM technology enables computation directly on memory
7373* Prevents data movement improving performance and reducing consumption
74- * PIM operates directly on memory banks by reading and storing on rows and columns
74+ * Operates directly on memory banks by reading and storing on rows and columns
7575* Aquabolt-XL is the first demonstrator
7676* Can be drop in on any memory controller
7777* CXL-PNM is the CXL variant for PNM, can work with multiple PIM
7878
7979SYCL Extension for PIM/PNM
80- * Goals
81- * Seamlessly integrate PIM/PNM operation into SYCL
82- * Allow combination of xGPU and PIM/PNM in one device kernel
83- * Not specific to one hardware
84- * Design
85- * Vector operation seem like natural fit, but no convergence guarantee and vector size explicit
86- * Model as special function unit
87- * Aligns with trends to model special functional units inside accelerators
88- * Compiler automatic mapping often not possible
89- * joint_matrix
90- * Group functions
91- * Easy to use
92- * Can easily be combined with device code
93- * Give necessary convergence guarantees
94- * Recap of SYCL work-item, work-group and group functions
95- * Group functions must be encountered in converged control flow
80+ * Work in collaboration with Codeplay Software team
81+ * Goals
82+
83+ * Seamlessly integrate PIM/PNM operation into SYCL
84+ * Allow combination of xGPU and PIM/PNM in one device kernel
85+ * Not specific to one hardware
86+
87+ * Design
88+
89+ * Vector operation seem like natural fit
90+ * no convergence guarantee and vector size explicit
91+
92+ * Model as special function unit
93+
94+ * Aligns with trends to model special functional units inside accelerators
95+ * Compiler automatic mapping often not possible
96+ * joint_matrix-like interface
97+
98+
99+ * Group functions
100+
101+ * Easy to use
102+ * Can easily be combined with device code
103+ * Give necessary convergence guarantees
104+
105+
106+ * Recap of SYCL work-item, work-group and group functions
107+
108+ * Group functions must be encountered in converged control flow
109+
96110* Extension
97- * Extended group functions with additional overload of joint_reduce and new joint_transform and joint_inner_product
98- * Block size as template parameter, number of blocks as runtime parameter -> allows calculation of number of elements to process
111+
112+ * Extended group functions with additional overload of joint_reduce
113+ * and new joint_transform and joint_inner_product
114+ * Block size as template parameter, number of blocks as runtime parameter
115+ * allows calculation of number of elements to process
116+
99117* Extension for PNM
100- * Added new overloads of joint_exclusive_scan, joint_inclusive_scan, reduce_over_group
101- * PNM standalone has less opportunity for parallelism, also limited by memory controller
102- * -> Combine PNM and PIM, PNM generates commands for PIM blocks
118+
119+ * Added new overloads of joint_exclusive_scan,
120+ * joint_inclusive_scan, reduce_over_group
121+
122+ * PNM standalone has less opportunity for parallelism
123+
124+ * limited by memory controller
125+ * -> Combine PNM and PIM, PNM generates commands for PIM blocks
126+
103127* Two modes
128+
104129 * PIM mode: PIM blocks can operate independently, can choose number of blocks
105130 * PNM mode: Synchronized execution on multiple PIM blocks
131+
106132* Mapping
133+
107134 * Every PIM block is one work-item
108135 * PNM with attached PIM blocks forms one work-group
136+
109137* Execution
110- * Work-item operations map to PIM operation
111- * Group functions map to PNM operation
138+
139+ * Work-item operations map to PIM operation
140+ * Group functions map to PNM operation
141+
112142* Example
143+
113144 * work-item execution maps to PIM
114145 * group function maps to PNM
146+
115147* Conclusion
148+
116149 * Integrate support for PIM/PNM into SYCL
117150
118151Q&A
119- * Are the proposed functions specific to PIM or could also be used with other HW?
120- * Can also be used with other hardware. Semantics not PIM-specific, but translation of C++ to SYCL
121- * Can also map nicely to other types of hardware, for example vector processor
152+ * Are the proposed functions specific to PIM, could also be used with other HW?
153+
154+ * Can also be used with other hardware.
155+ * Semantics not PIM-specific, but translation of C++ to SYCL
156+ * Can also map nicely to other types of hardware, e.g. vector processor
157+
122158* Why have the user explicitly specify a block-size?
123- * Not a hardware detail
124- * Rather a promise by the user that data-blocks will always be at least that big
125- * Promise allows device compiler to perform optimizations, efficient looping inside PIM unit
126- * Could num_blocks runtime parameter be replaced by iterator, requiring to be divisable by block-size
127- * Yes, that is possible, mainly a design question
128- * Current version might have additional implications regarding alignment
159+
160+ * Not a hardware detail
161+ * Rather a promise by the user that data-blocks
162+ will always be at least that big
163+ * Promise allows device compiler to perform optimizations,
164+ efficient looping inside PIM unit
165+
166+ * Could num_blocks runtime parameter be replaced by iterator?
167+
168+ * requires to be divisable by block-size
169+ * Yes, that is possible, mainly a design question
170+ * Current version might have additional implications regarding alignment
129171
130172
1311732023-06-05
0 commit comments