-
Notifications
You must be signed in to change notification settings - Fork 175
Kernel Runtime increase with long verifications #2944
Description
I noticed, that my Kernel runtimes were much higher if my verification ran long and I call the kernel multiple times
To prove this I took the vector_reduce_add example and did some minimal modifications.
MakeFile:
add "--warmup 1 --iters 5" to run target
test.cpp
Print NPU runtime for each run
add std::this_thread::sleep_for(std::chrono::milliseconds(10000)); as the last instruction to the for loop (to simulate long verification)
I also put in a check if run.wait(); returns the right value, but that should not be needed
Then run: use_placed=1 make run
With the sleep the output is:
./vector_reduce_add.exe -x build/final.xclbin -i build/insts.bin -k MLIR_AIE --warmup 1 --iters 5
NPU time: 119us.
NPU time: 136242us.
NPU time: 134709us.
NPU time: 136238us.
NPU time: 121016us.
Avg NPU time: 105665us.
Min NPU time: 119us.
Max NPU time: 136242us.
PASS!
Without the sleep it is:
./vector_reduce_add.exe -x build/final.xclbin -i build/insts.bin -k MLIR_AIE --warmup 1 --iters 5
NPU time: 132us.
NPU time: 107us.
NPU time: 117us.
NPU time: 148us.
NPU time: 117us.
Avg NPU time: 124.2us.
Min NPU time: 107us.
Max NPU time: 148us.
PASS!
I assume that I measure some kind of reset, because if I introduce a buffer (initiated to 0) into the Iron design I even get different result.
Any help clarification would be appreciated.
I am using the latest wheels and the driver form upstream packages on an AMD Ryzen 7 8700G
Here are the files to reproduce. Only the Makefile and test.cpp were modified
CMakeLists.txt
test.cpp
vector_reduce_add_placed.py
vector_reduce_add.py
Makefile.txt
(had to rename the Makefile to Makefile.txt so I can upload it)