Skip to content

Commit eeca341

Browse files
Add Vectorize VecMatMult Windows instruction (#2425)
Signed-off-by: Ang, Yee Teng <[email protected]>
1 parent 47c7588 commit eeca341

File tree

1 file changed

+42
-0
lines changed
  • DirectProgramming/Fortran/DenseLinearAlgebra/vectorize-vecmatmult

1 file changed

+42
-0
lines changed

DirectProgramming/Fortran/DenseLinearAlgebra/vectorize-vecmatmult/README.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -257,6 +257,48 @@ The compiler may be able to perform additional optimizations if it can optimize
257257
```
258258
2. Run the program, and record the execution time.
259259

260+
### On Windows*
261+
#### Step 1. Establish a Performance Baseline
262+
1. Open an Intel oneAPI command window.
263+
2. Change to the sample directory.
264+
3. Compile the sources with the following command.
265+
```
266+
ifx /real-size:64 -O1 src\matvec.f90 src\driver.f90 -o MatVector
267+
```
268+
4. Run generated `MatVector.exe`.
269+
```
270+
MatVector.exe
271+
```
272+
5. Record the execution time reported in the output. This is the baseline against which subsequent improvements will be measured.
273+
274+
#### Step 2. Generate a Vectorization Report
275+
1. Compile the sources with following command.
276+
```
277+
ifx /real-size:64 -O2 /Qopt-report=1 src\matvec.f90 src\driver.f90 -o MatVectors
278+
```
279+
2. Run generated `MatVector.exe` again.
280+
3. Record the new execution time.
281+
4. Recompile your project with the **/Qopt-report=2** option.
282+
```
283+
ifx /real-size:64 -O2 /Qopt-report=2 src\matvec.f90 src\driver.f90 -o MatVectors
284+
```
285+
5. Run `MatVector.exe` and record the new execution time.
286+
287+
#### Step 3. Improve Performance by Aligning Data
288+
1. Recompile the program after adding the ALIGNED macro to ensure consistently aligned data.
289+
```
290+
ifx /real-size:64 /Qopt-report=2 -D ALIGNED src\matvec.f90 src\driver.f90 -o MatVector
291+
```
292+
2. Run `MatVector.exe` again, and record the new execution time.
293+
294+
#### Step 4. Improve Performance with Interprocedural Optimization
295+
1. Recompile the program using the `/Qipo` option to enable interprocedural optimization.
296+
```
297+
ifx /real-size:64 /Qopt-report=2 -D ALIGNED /Qipo src\matvec.f90 src\driver.f90 -o MatVector
298+
```
299+
2. Run the program, and record the execution time.
300+
301+
260302
### Additional Exercises
261303

262304
The previous examples made use of double-precision arrays. You could build same examples with single precision arrays by changing the command-line option **-real-size 64** to **-real-size 32**. The non-vectorized versions of the loop execute only slightly faster than the double-precision version; however, the vectorized versions are substantially faster. This is because a packed SIMD instruction operating on a 32-byte vector register operates on eight single-precision data elements at once instead of four double-precision data elements.

0 commit comments

Comments
 (0)