You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _sources/autoapi/tilelang/language/allocate/index.rst.txt
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -110,22 +110,22 @@ Module Contents
110
110
111
111
.. py:function:: alloc_tmem(shape, dtype)
112
112
113
-
Allocate a Tensor Memory (TMEM) buffer for use with 5th generation Tensor Core operations (e.g., UMMA).
113
+
Allocate a Tensor Memory (TMEM) buffer for use with 5th generation Tensor Core operations (e.g., TCGEN5.MMA).
114
114
115
115
TMEM is a dedicated on-chip memory introduced in Hopper GPUs, designed to reduce register pressure and enable asynchronous, single-threaded MMA operations. It is organized as a 2D array of 512 columns by 128 rows (lanes), with each cell being 32 bits. Allocation is performed in units of columns, and every lane of a column is allocated together.
116
116
117
117
Key properties and requirements:
118
118
- The number of columns allocated must be a power of 2 and at least 32.
119
119
- TMEM allocations are dynamic and must be explicitly deallocated.
120
120
- Both allocation and deallocation must be performed by the same warp.
121
-
- The base address of the TMEM allocation is stored in shared memory and used as the offset for UMMA accumulator tensors.
122
-
- Only UMMA and specific TMEM load/store instructions can access TMEM; all pre-processing must occur before data is loaded into TMEM, and all post-processing after data is retrieved.
121
+
- The base address of the TMEM allocation is stored in shared memory and used as the offset for TCGEN5.MMA accumulator tensors.
122
+
- Only TCGEN5.MMA and specific TMEM load/store instructions can access TMEM; all pre-processing must occur before data is loaded into TMEM, and all post-processing after data is retrieved.
123
123
- The number of columns allocated should not increase between any two allocations in the execution order within the CTA.
124
124
125
125
:param num_cols: Number of columns to allocate in TMEM. Must be a power of 2 and >= 32 but less than or equal to 512.
126
126
:type num_cols: int
127
127
128
-
:returns: A TVM buffer object allocated in TMEM scope, suitable for use as an accumulator or operand in UMMA operations.
128
+
:returns: A TVM buffer object allocated in TMEM scope, suitable for use as an accumulator or operand in TCGEN5.MMA operations.
<spanclass="sig-prename descclassname"><spanclass="pre">tilelang.language.allocate.</span></span><spanclass="sig-name descname"><spanclass="pre">alloc_tmem</span></span><spanclass="sig-paren">(</span><emclass="sig-param"><spanclass="n"><spanclass="pre">shape</span></span></em>, <emclass="sig-param"><spanclass="n"><spanclass="pre">dtype</span></span></em><spanclass="sig-paren">)</span><aclass="headerlink" href="#tilelang.language.allocate.alloc_tmem" title="Link to this definition">¶</a></dt>
617
-
<dd><p>Allocate a Tensor Memory (TMEM) buffer for use with 5th generation Tensor Core operations (e.g., UMMA).</p>
617
+
<dd><p>Allocate a Tensor Memory (TMEM) buffer for use with 5th generation Tensor Core operations (e.g., TCGEN5.MMA).</p>
618
618
<p>TMEM is a dedicated on-chip memory introduced in Hopper GPUs, designed to reduce register pressure and enable asynchronous, single-threaded MMA operations. It is organized as a 2D array of 512 columns by 128 rows (lanes), with each cell being 32 bits. Allocation is performed in units of columns, and every lane of a column is allocated together.</p>
619
619
<dlclass="simple">
620
620
<dt>Key properties and requirements:</dt><dd><ulclass="simple">
621
621
<li><p>The number of columns allocated must be a power of 2 and at least 32.</p></li>
622
622
<li><p>TMEM allocations are dynamic and must be explicitly deallocated.</p></li>
623
623
<li><p>Both allocation and deallocation must be performed by the same warp.</p></li>
624
-
<li><p>The base address of the TMEM allocation is stored in shared memory and used as the offset for UMMA accumulator tensors.</p></li>
625
-
<li><p>Only UMMA and specific TMEM load/store instructions can access TMEM; all pre-processing must occur before data is loaded into TMEM, and all post-processing after data is retrieved.</p></li>
624
+
<li><p>The base address of the TMEM allocation is stored in shared memory and used as the offset for TCGEN5.MMA accumulator tensors.</p></li>
625
+
<li><p>Only TCGEN5.MMA and specific TMEM load/store instructions can access TMEM; all pre-processing must occur before data is loaded into TMEM, and all post-processing after data is retrieved.</p></li>
626
626
<li><p>The number of columns allocated should not increase between any two allocations in the execution order within the CTA.</p></li>
627
627
</ul>
628
628
</dd>
@@ -632,7 +632,7 @@ <h2>Module Contents<a class="headerlink" href="#module-contents" title="Link to
632
632
<ddclass="field-odd"><p><strong>num_cols</strong> (<em>int</em>) – Number of columns to allocate in TMEM. Must be a power of 2 and >= 32 but less than or equal to 512.</p>
Copy file name to clipboardExpand all lines: autoapi/tilelang/language/gemm/index.html
+2-2Lines changed: 2 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -501,8 +501,8 @@ <h2>Module Contents<a class="headerlink" href="#module-contents" title="Link to
501
501
<li><p><strong>k_pack</strong> (<em>int</em><em>, </em><em>optional</em>) – Number of k dimensions packed into a single warp. Defaults to 1.</p></li>
502
502
<li><p><strong>wg_wait</strong> (<em>int</em><em>, </em><em>optional</em>) – Warp group wait count. Defaults to 0.
503
503
On hopper it is equivalent to <cite>wgmma.wait_group.sync.aligned <wg_wait></cite> if wg_wait is not -1
504
-
On sm100 (datacenter blackwell), <cite>wg_wait</cite> can only be 0 or -1. <cite>mbarrier_wait(UTCMMA barrier)</cite> will be appended if wg_wait is 0.</p></li>
505
-
<li><p><strong>mbar</strong> (<em>tir.Buffer</em><em>, </em><em>optional</em>) – mbarrier for UTCMMA synchronization</p></li>
504
+
On sm100, <cite>wg_wait</cite> can only be 0 or -1. <cite>mbarrier_wait(TCGEN5MMA barrier)</cite> will be appended if wg_wait is 0.</p></li>
505
+
<li><p><strong>mbar</strong> (<em>tir.Buffer</em><em>, </em><em>optional</em>) – mbarrier for TCGEN5MMA synchronization</p></li>
0 commit comments