@@ -127,7 +127,7 @@ def AMDGPU_ScaledExtPacked816Op
127127 FixedVectorOfShapeAndType<[4], F8E8M0FNU>:$scale,
128128 ConfinedAttr<I32Attr, [IsValidBlockSize]>:$blockSize,
129129 ConfinedAttr<I32Attr, [IntMinValue<0>, IntMaxValue<1>]>:$firstScaleLane,
130- ConfinedAttr<I32Attr, [IntMinValue<0>, IntMaxValue<2 >]>:$firstScaleByte)>,
130+ ConfinedAttr<I32Attr, [IntMinValue<0>, IntMaxValue<3 >]>:$firstScaleByte)>,
131131 Results<(
132132 outs AnyTypeOf<[FixedVectorOfShapeAndType<[8], F32>,
133133 FixedVectorOfShapeAndType<[8], F16>,
@@ -139,17 +139,21 @@ def AMDGPU_ScaledExtPacked816Op
139139 let summary = "Extend a vector of packed floating point values";
140140
141141 let description = [{
142- The scales applied to the input microfloats are stored in two bytes which
142+ The scales applied to the input microfloats are stored in bytes which
143143 come from the `scales` input provided in a *half* of the wave identified
144- by `firstScaleLane`. The pair of bytes used is selected by
145- `firstScaleByte `. The 16 vectors in consecutive lanes starting from
144+ by `firstScaleLane`. The bytes used is selected by `firstScaleByte` and depends
145+ on the type of `source `. The 16 vectors in consecutive lanes starting from
146146 `firstScaleLane` (which we'll call the scale vectors) will be used by both
147- halves of the wave (with lane L reading from L % 16'th scale vector), but
148- each half will use a different byte.
147+ halves of the wave (with lane L reading from L % 16'th scale vector).
148+
149+ When `source` is either F4E2M1FN, F6E2M3FN, or F6E3M2FN each half of the
150+ wave will use a different byte. The first one being `firstScaleByte` and
151+ the second one being `firstScaleByte` + 1. When the block size is 32,
152+ `firstScaleByte` can be either 0 or 2, selecting halves of the scale vectors.
153+ Lanes 0-15 will read from `firstScaleByte` and lanes 16-31 will read
154+ from `firstScaleByte` + 1.
155+
149156
150- When the block size is 32, `firstScaleByte` can be either 0 or 2,
151- selecting halves of the scale vectors. Lanes 0-15 will read from
152- `firstScaleByte` and lanes 16-31 will read from `firstScaleByte` + 1.
153157 For example:
154158 ```mlir
155159 // Input: 8-element vector of F8E4M3FN, converting to F32
@@ -165,7 +169,8 @@ def AMDGPU_ScaledExtPacked816Op
165169 : vector<16xf6E2M3FN>, vector<4xf8E8M0FNU> -> vector<16xf16>
166170 ```
167171
168- However, when the block size is 16, `firstScaleByte` can be 0 or 1.
172+ When `source` is either F4E2M1FN, F6E2M3FN, or F6E3M2FN and
173+ the block size is 16, `firstScaleByte` can be 0 or 1.
169174 Lanes 0-15 read from the `firstScaleByte`th element of the scale vectors,
170175 while lanes 16-31 read from `firstScaleByte` + 2.
171176 For example:
@@ -187,6 +192,16 @@ def AMDGPU_ScaledExtPacked816Op
187192 instructions use for matix scales. These selection operands allows
188193 one to choose portions of the matrix to convert.
189194
195+ When `source` is either F8E4M3FN or F8E5M2 and `blockSize` is 32,
196+ then the same byte will be used by both halves of the wave.
197+ In this case, `firstScaleByte` can be any value from 0 to 3.
198+
199+ When `source` is either F8E4M3FN or F8E5M2 and `blockSize` is 16,
200+ following combinations are allowed:
201+ * `firstScaleLane(0), firstScaleByte(0)`
202+ * `firstScaleLane(1), firstScaleByte(2)`
203+ all other combinations are reserved.
204+
190205 Available on gfx1250+.
191206 }];
192207
0 commit comments