Skip to content

Kernel Runtime increase with long verifications #2944

@ilot95

Description

@ilot95

I noticed, that my Kernel runtimes were much higher if my verification ran long and I call the kernel multiple times

To prove this I took the vector_reduce_add example and did some minimal modifications.
MakeFile:
add "--warmup 1 --iters 5" to run target
test.cpp
Print NPU runtime for each run
add std::this_thread::sleep_for(std::chrono::milliseconds(10000)); as the last instruction to the for loop (to simulate long verification)
I also put in a check if run.wait(); returns the right value, but that should not be needed

Then run: use_placed=1 make run

With the sleep the output is:

./vector_reduce_add.exe -x build/final.xclbin -i build/insts.bin -k MLIR_AIE --warmup 1 --iters 5
NPU time: 119us.
NPU time: 136242us.
NPU time: 134709us.
NPU time: 136238us.
NPU time: 121016us.
Avg NPU time: 105665us.
Min NPU time: 119us.
Max NPU time: 136242us.
PASS!

Without the sleep it is:

./vector_reduce_add.exe -x build/final.xclbin -i build/insts.bin -k MLIR_AIE --warmup 1 --iters 5
NPU time: 132us.
NPU time: 107us.
NPU time: 117us.
NPU time: 148us.
NPU time: 117us.
Avg NPU time: 124.2us.
Min NPU time: 107us.
Max NPU time: 148us.
PASS!

I assume that I measure some kind of reset, because if I introduce a buffer (initiated to 0) into the Iron design I even get different result.
Any help clarification would be appreciated.

I am using the latest wheels and the driver form upstream packages on an AMD Ryzen 7 8700G

Here are the files to reproduce. Only the Makefile and test.cpp were modified
CMakeLists.txt
test.cpp
vector_reduce_add_placed.py
vector_reduce_add.py
Makefile.txt
(had to rename the Makefile to Makefile.txt so I can upload it)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions