C++ Instruction Set Simulator for RISC-V RV32IMC & custom SIMD instructions with cache and branch predictor models, C/ASM workloads, and Python analysis tools
- RISC-V ISA simulator
- Getting the project
- Overview
- Example use-case: Dhrystone
- Hardware model sweeps
- Building RISC-V programs
- GTest
- Building the Simulator
- Custom packed SIMD ISA
Project relies on a few external libraries and tools. Clone recursively with
git clone --recurse-submodules git@github.com:AleksandarLilic/ama-riscv-sim.gitSubmodules are pulled automatically with --recurse-submodules:
- GCC >= 10 (C++17,
gnu++17) - Make
- RISC-V GNU toolchain (
riscv64-unknown-elf-gcc,objcopy,objdump,size) - set viaRV_GNU_LATESTenv var - bin2hex (
riscv64-unknown-elf-bin2hex) - binary to hex conversion for RTL simulation; not part of the standard toolchain
Python 3 packages:
- matplotlib
- numpy
- pandas
- ruamel.yaml
- plotly
- pyelftools
Other tools:
- Perl
- Graphviz
- Google Test (
libgtest-devor equivalent) - lcov and genhtml (for coverage reports)
- Optional: valgrind
To check that everything is available and working as expected:
- build a single test
- build the simulator
- run the test and check the logs
# build test
cd sw/baremetal/dhrystone
make DHRY_ITERS=1000 # override number of iterations for testing only
cd -
# build ISA sim
cd src
make -j8
# optionally, use non-default gcc
# make CXX=g++-12 -j8
# create a separate workdir
cd .. && mkdir workdir && cd workdir
# run test, profile from the beginning, log executed instructions
../src/build/ama-riscv-sim ../sw/baremetal/dhrystone/dhrystone.elf --prof_pc_start 0x40000 -tl
# check the execution log
vim dhrystone_dhrystone_out/exec.logSimulator stdout
Running: ../sw/baremetal/dhrystone/dhrystone.elf
Profiling start PC: 0x40000
Logging enabled
SIMULATION STARTED
=== UART START ===
...
=== UART END ===
SIMULATION FINISHED
Instruction Counters: executed: 521118, profiled: 521118
0x051e tohost : 0x00000001
Profiler - Inst:
All: 521118
Control: B: 63985(12.28%), JAL: 21746(4.17%), JALR: 22705(4.36%)
Memory: MEM: 166065(31.87%) - L/S: 87869/78196(16.86%/15.01%)
Compute: ALU: 245609(47.13%), MUL: 1000(0.19%), DIV: 0(0.00%)
Bitmanip: Zbb: 0(0.00%)
SIMD arith: ALU: 0(0.00%), WMUL: 0(0.00%), DOT: 0(0.00%)
SIMD data fmt: WIDEN: 0(0.00%), NARROW: 0(0.00%), TXP: 0(0.00%)
SIMD vector-scalar: DUP: 0(0.00%), VINS: 0(0.00%), VEXT: 0(0.00%)
Hint: SCP: 0(0.00%)
Other: NOP: 0(0.00%), Misc: 8(0.00%)
Profiler - Sparsity:
ANY: 355615/20719(5.83%)
ALU: 238689/16132(6.76%)
MEM_L: 87869/4583(5.22%)
MEM_S: 78196/0(0.00%)
SIMD_DOT: 0/0(0.00%)
SIMD_ALU: 0/0(0.00%)
SIMD_DATA_FMT: 0/0(0.00%)
Profiler - Stack:
Peak usage: 368 B
Accesses: 80424(48.43%) - L/S: 40458/39966(24.36%/24.07%)
Profiler - Perf:
Event: inst, Samples: 521118
Profiler - Fusion:
LEA opportunities: 1000
Branch stats:
Unique branches: 110
bpred
bimodal (12 B): P: 55284, M: 8701, ACC: 86.40%, MPKI: 16.70
icache (S/W: 32/2, D/T/M: 4096/41/33 B):
Ref: 529819, H: 526687(526687/0), M: 3132(3132/0), R: 3068, HR: 99.41%; CT (R/W): core 2.0/0.0 MB, mem 195.8/0.0 KB
dcache (S/W: 16/4, D/T/M: 4096/49/41 B):
Ref: 162717, H: 162511(86166/76345), M: 206(29/177), R: 142, WB: 142, HR: 99.87%; CT (R/W): core 297/279 KB, mem 12.9/8.9 KB
Simulation performance: 0.42 MIPS (521118 instructions in 1.24s)
Outputs:
ls dhrystone_dhrystone_out
callstack_folded_inst.txt exec.log hw_stats.json inst_profile.json trace.bin uart.logSome simulator options depend on the build switches. This doc assumes default build switches which is the standalone mode with all available features included
Core functionalities:
- Standalone ISA simulator capable of running applications targeting
RV32IMC_zicsr_zifencei_zicntr_zihpm_xsimd(rv drom) - API for single step execution, aimed at DPI verification environments
Profilers:
- Records the folded callstack based on the user-specified event source (
callstack_folded_inst.txt) - Records breakdown of executed instruction (
inst_profile.json) - Provides stack usage stats (
inst_profile.jsonandstdout) - Records execution trace (
trace.bin) - Records register file usage (
rf_usage.bin) - Provides sparsity check for the custom SIMD extension (
stdout) - Can be run in the timed environment (e.g. DPI) and use its time source for instruction cycles and callstack profiler source
Logging:
- Records each executed instruction, the current callstack, and registers & memory locations changed by the instruction (
exec.log) - Optionally records entire architectural state after each executed instruction
- Optionally records the entire execution (from reset to the end of simulation), instead of just the profiling range
Hardware models:
- Provides L1I and L1D caches as both statistical (metadata only) or functional (with data storage) models.
- number of sets - parametrizable from the CLI
- number of ways - parametrizable from the CLI
- replacement policy - only LRU is available
- Records separate cache stats for the user defined region of interest (ROI)
- Provides various branch predictor models
- The user specified one is completely configurable from the CLI
- The rest are hardcoded (provides variety), and can optionally be run (in parallel)
- Any number of branch predictors can be run in parallel, but only the user specified one will drive the L1I cache - the 'active' predictor
- Provides branch stats for the number of unique branches (
stdout) and performance of each of the predictors for the given unique branch (branches.csv) - Records runtime hardware statistics in the same region as the profilers (
hw_stats.jsonandstdout)
Analysis scripts:
- FlameGraphs
- Detailed execution visualization
- Performance estimates
- Hardware models parameters sweep
Emulator is used through DPI for verification of the SystemVerilog implementation: ama-riscv
The only required user argument is a path to the RISC-V executable, every other argument can use its default value
Full usage available in examples/ama-riscv-sim.help
Example use-case which includes generated log files from the simulator and the applicable analysis outputs are available under examples/dhrystone_dhrystone_out. The stdout redirected to a file is also available
The following paragraphs will go into detail about each of the logs, analysis, and visualization
ISA sim and Dhrystone are assumed to have been built as described in the Quick start, but the provided elf at examples/dhrystone.elf can also be used
To generate all available outputs, run
../src/build/ama-riscv-sim ../sw/baremetal/dhrystone/dhrystone.elf --prof_pc_start 0x40000 -tl --rf_usage --bp_dump_csvOutputs:
ls dhrystone_dhrystone_out
branches.csv callstack_folded_inst.txt exec.log hw_stats.json inst_profile.json rf_usage.bin trace.bin uart.logA more common way of profiling is to only focus on one part of the workload at the time. In case of Dhrystone, the following will profile only a single loop, 500th iteration
../src/build/ama-riscv-sim ../sw/baremetal/dhrystone/dhrystone.elf --prof_pc_start 41570 --prof_pc_stop 41644 --prof_pc_single_match 500Dropping --prof_pc_single_match 500 would profile all iterations
Cache and branch predictor models are still running while profiling is inactive, but don't record any stats
Another useful option when analyzing cache behavior is to profile a specific memory region. For example, mlp benefits from having high hit rate for the input image in the first layer, and its hit rate can be analyzed with
../src/build/ama-riscv-sim ../sw/baremetal/mlp/w8a8.elf --roi_start 0x00017900 --roi_size 256After the simulations finishes, the stdout has one more line with ROI stats:
...
ROI: (0x17900 - 0x17a00): R: 4100, H: 4096, M: 4, E: 4, WB: 0, HR: 99.90%; CT (R/W): core 16.0/0.0 KB, mem 256.0/0.0 B
...
Memory location of the image might not always be at this address. Search the *.dasm for input_img_0 symbol start.
Perf events can also be changed, which can help when analyzing e.g. tricky branch prediction section (--perf_event bp_mispredict), or heavy traffic and cache misses (--perf_event dcache_miss). Complete list of events is listed in the Usage chapter.
Execution log, saved as exec.log, logs instructions as they are executed. Compact version (default options) is saved and it includes:
- the current callstack
- instruction count during profiling (omitted when profiling is inactive)
- PC
- Instruction and its disassembly
- modified register(s), memory locations, CSRs (instruction dependent)
Snippets taken from the examples/dhrystone_dhrystone_out/exec.log
_start;
1: 40000: 00000093 addi x1,x0,0 x1 : 0x00000000
2: 40004: 00000113 addi x2,x0,0 x2 : 0x00000000
3: 40008: 00000193 addi x3,x0,0 x3 : 0x00000000
...
33: 40080: 07028293 addi x5,x5,112 x5 : 0x000400ec
34: 40084: 30529073 csrrw x0,mtvec,x5 0x0305 mtvec : 0x000400ec
35: 40088: 00005197 auipc x3, 0x5 x3 : 0x00045088
...
clear_bss_w_loop;
47: 400b8: 00052023 sw x0,0(x10) mem[4412c] <- x0 (0x00000000)
48: 400bc: 00450513 addi x10,x10,4 x10: 0x00044130
49: 400c0: fff68693 addi x13,x13,-1 x13: 0x00000a5d
50: 400c4: fe069ae3 bne x13,x0,400b8
...
call_main;
10664: 400dc: 5a1020ef jal x1,42e7c x1 : 0x000400e0
call_main;__libc_init_array;
10665: 42e7c: ff010113 addi x2,x2,-16 x2 : 0x0004fff0
10666: 42e80: 00812423 sw x8,8(x2) mem[4fff8] <- x8 (0x00000000)
10667: 42e84: 01212023 sw x18,0(x2) mem[4fff0] <- x18 (0x00000000)
10668: 42e88: 00000793 addi x15,x0,0 x15: 0x00000000
10669: 42e8c: 00000713 addi x14,x0,0 x14: 0x00000000
10670: 42e90: 00112623 sw x1,12(x2) mem[4fffc] <- x1 (0x000400e0)
10671: 42e94: 00912223 sw x9,4(x2) mem[4fff4] <- x9 (0x00000000)
10672: 42e98: 40e78933 sub x18,x15,x14 x18: 0x00000000
10673: 42e9c: 02e78263 beq x15,x14,42ec0
10674: 42ec0: 00000793 addi x15,x0,0 x15: 0x00000000
10675: 42ec4: 00000713 addi x14,x0,0 x14: 0x00000000
10676: 42ec8: 40e78933 sub x18,x15,x14 x18: 0x00000000
10677: 42ecc: 40295913 srai x18,x18,0x2 x18: 0x00000000
10678: 42ed0: 02e78063 beq x15,x14,42ef0
10679: 42ef0: 00c12083 lw x1,12(x2) x1 : 0x000400e0 <- mem[4fffc]
10680: 42ef4: 00812403 lw x8,8(x2) x8 : 0x00000000 <- mem[4fff8]
10681: 42ef8: 00412483 lw x9,4(x2) x9 : 0x00000000 <- mem[4fff4]
10682: 42efc: 00012903 lw x18,0(x2) x18: 0x00000000 <- mem[4fff0]
10683: 42f00: 01010113 addi x2,x2,16 x2 : 0x00050000
10684: 42f04: 00008067 jalr x0,0(x1) # ret
call_main;
10685: 400e0: 398010ef jal x1,41478 x1 : 0x000400e4
call_main;main;
10686: 41478: f9010113 addi x2,x2,-112 x2 : 0x0004ff90
10687: 4147c: 03000513 addi x10,x0,48 x10: 0x00000030
10688: 41480: 06112623 sw x1,108(x2) mem[4fffc] <- x1 (0x000400e4)
10689: 41484: 06812423 sw x8,104(x2) mem[4fff8] <- x8 (0x00000000)
10690: 41488: 06912223 sw x9,100(x2) mem[4fff4] <- x9 (0x00000000)
10691: 4148c: 07212023 sw x18,96(x2) mem[4fff0] <- x18 (0x00000000)
10692: 41490: 05312e23 sw x19,92(x2) mem[4ffec] <- x19 (0x00000000)
10693: 41494: 05412c23 sw x20,88(x2) mem[4ffe8] <- x20 (0x00000000)
10694: 41498: 05512a23 sw x21,84(x2) mem[4ffe4] <- x21 (0x00000000)
10695: 4149c: 05612823 sw x22,80(x2) mem[4ffe0] <- x22 (0x00000000)
10696: 414a0: 15c010ef jal x1,425fc x1 : 0x000414a4
call_main;main;malloc;
10697: 425fc: 00050593 addi x11,x10,0 x11: 0x00000030
10698: 42600: 8101a503 lw x10,-2032(x3) x10: 0x00043ff8 <- mem[44128]
10699: 42604: 0040006f jal x0,42608
call_main;main;_malloc_r;
10700: 42608: fd010113 addi x2,x2,-48 x2 : 0x0004ff60
10701: 4260c: 01312e23 sw x19,28(x2) mem[4ff7c] <- x19 (0x00000000)
...
A detailed version can be generated by adding --log_state. This will include the entire architectural state after each executed instruction, and all CSRs when any one is modified. It's indented such that it can be completely folded inside the code editor. When folded, it's as compact as the version without state log. Note that this will significantly increase runtime and log size.
_start;
1: 40000: 00000093 addi x1,x0,0 x1 : 0x00000000
PC: 40000
x0 : 0x00000000 x1 : 0x00000000 x2 : 0x00c0ffee x3 : 0x00c0ffee
x4 : 0x00c0ffee x5 : 0x00c0ffee x6 : 0x00c0ffee x7 : 0x00c0ffee
x8 : 0x00c0ffee x9 : 0x00c0ffee x10: 0x00c0ffee x11: 0x00c0ffee
x12: 0x00c0ffee x13: 0x00c0ffee x14: 0x00c0ffee x15: 0x00c0ffee
x16: 0x00c0ffee x17: 0x00c0ffee x18: 0x00c0ffee x19: 0x00c0ffee
x20: 0x00c0ffee x21: 0x00c0ffee x22: 0x00c0ffee x23: 0x00c0ffee
x24: 0x00c0ffee x25: 0x00c0ffee x26: 0x00c0ffee x27: 0x00c0ffee
x28: 0x00c0ffee x29: 0x00c0ffee x30: 0x00c0ffee x31: 0x00c0ffee
...
Folded callstack is saved as callstack_folded_inst.txt and is ready to be used with script/FlameGraph/flamegraph.pl. If other perf events are used, the output will be saved under the appropriate name, e.g. callstack_folded_bp_mispredict.txt
Example snippet from the folded callstack
...
call_main;main;__udivsi3; 22000
call_main;main;__divsi3; 2000
call_main;main;Proc_1;Proc_6; 20000
call_main;main;Proc_1;Proc_7; 4000
call_main;main;Proc_8; 25000
call_main;main;Proc_7; 8000
...
Profiled instructions summary is saved as inst_profile.json. Execution of each supported instruction is counted. Control flow instructions are further broken down based on the direction, and if taken or not. Finally, stack usage is logged together with number of profiled instructions
{
"add": {"count": 26500},
"sub": {"count": 4274},
"sll": {"count": 8},
...
"beq": {"count": 25642, "breakdown": {"taken": 9070, "taken_fwd": 3056, "taken_bwd": 6014, "not_taken": 16572, "not_taken_fwd": 11857, "not_taken_bwd": 4715}},
"bne": {"count": 24472, "breakdown": {"taken": 6610, "taken_fwd": 0, "taken_bwd": 6610, "not_taken": 17862, "not_taken_fwd": 15105, "not_taken_bwd": 2757}},
...
"_max_sp_usage": 368,
"_profiled_instructions": 521118
}Execution trace, saved as trace.bin, contains trace_entry struct for each executed instruction, and it's needed as an input for the analysis scripts (below)
Similarly, register file usage is saved as rf_usage.bin. As each instruction is executed, appropriate counters are incremented based on the used register(s), and the type of usage (destination or source)
Stats for three hardware models are available as hw_stats.json
{
"icache": {
"references": 529819,
"hits": {"reads": 526687, "writes": 0},
"misses": {"reads": 3132, "writes": 0},
"replacements": 3068,
"writebacks": 0,
"ct_core": {"reads": 2119276, "writes": 0},
"ct_mem": {"reads": 200448, "writes": 0},
"size": {"data": 4096, "tags": 41, "metadata": 33, "sets": 32, "ways": 2, "line_size": 64}
},
"dcache": {
"references": 162717,
"hits": {"reads": 86166, "writes": 76345},
"misses": {"reads": 29, "writes": 177},
"replacements": 142,
"writebacks": 142,
"ct_core": {"reads": 304181, "writes": 285631},
"ct_mem": {"reads": 13184, "writes": 9088},
"size": {"data": 4096, "tags": 49, "metadata": 41, "sets": 16, "ways": 4, "line_size": 64}
},
"bpred": {
"type": "bimodal",
"branches": 63985,
"predicted": 55284,
"predicted_fwd": 36173,
"predicted_bwd": 19111,
"mispredicted": 8701,
"mispredicted_fwd": 4412,
"mispredicted_bwd": 4289,
"accuracy": 86.40,
"mpki": 16.70,
"size": 12
},
"profiled_inst": 521118,
"_done": true
} Additionally, detailed breakdowns of all branches and predictor stats for each are available in branches.csv
PC,Direction,Funct3,Funct3_mn,Taken,Not_Taken,All,Taken%,P_bimodal,P_bimodal%,Best,P_best,P_best%,Pattern
4020c,F,1,bne,0,1000,1000,0.0,1000,100.0,bimodal,1000,100.0,1000N
40230,F,1,bne,0,1000,1000,0.0,999,99.9,bimodal,999,99.9,1000N
40234,F,1,bne,0,1000,1000,0.0,1000,100.0,bimodal,1000,100.0,1000N
40250,F,1,bne,0,1000,1000,0.0,999,99.9,bimodal,999,99.9,1000N
...
411d4,B,6,bltu,0,1000,1000,0.0,999,99.9,bimodal,999,99.9,1000N
411dc,F,6,bltu,1000,1000,2000,50.0,1000,50.0,bimodal,1000,50.0,1T 1N 1T 1N 1T 1N 1T 1N 1T ...
411f0,B,1,bne,1000,1000,2000,50.0,1000,50.0,bimodal,1000,50.0,1T 1N 1T 1N 1T 1N 1T 1N 1T ...
...Collection of custom and open source tools are provided for profiling, analysis, and visualization
Similar to the GNU Profiler gprof, flat profile script provides samples/time spent in all executed functions, and prints it to the stdout
./script/prof_stats.py -t examples/dhrystone_dhrystone_out/callstack_folded_inst.txt -e inst --plotOutput (also available under examples/dhrystone_dhrystone_out/prof_stats_exec.txt):
Profile - Inst
% cumulative self total symbol
16.54 16.54 86172 86172 strcpy
12.15 28.68 63291 510433 main
10.94 39.62 57000 57000 strcmp
10.36 49.98 54000 90000 Proc_1
6.00 55.98 31272 41316 _write
5.37 61.36 28000 90000 Func_2
4.88 66.24 25428 25428 __udivsi3
4.80 71.03 25000 25000 Proc_8
4.12 75.15 21475 90087 mini_vpprintf
3.84 78.99 20000 23000 Proc_6
3.83 82.82 19968 61284 __puts_uart
2.88 85.70 15000 15000 Func_1
2.30 88.00 12000 12000 Proc_7
2.04 90.04 10616 10616 clear_bss_w_loop
1.93 91.97 10044 10044 send_byte_uart0
1.73 93.70 9000 9000 Proc_2
1.73 95.42 9000 9000 Proc_3
1.73 97.15 9000 9000 Proc_4
total_samples : 521118
(Showing top 18 of 40 entries after filtering - Threshold: 1%)
A lightweight wrapper is provided for chart formatting and annotation around the open source FlameGraph scripts
./script/get_flamegraph.py examples/dhrystone_dhrystone_out/callstack_folded_inst.txtOpen the generated interactive flamegraph_inst.svg in the web browser
Similarly, a lightweight wrapper around the open source gprof2dot tool is provided for generating Call Graphs from the same folded callstack
Generate DOT and PNG with
./script/get_call_graph.py examples/dhrystone_dhrystone_out/callstack_folded_inst.txt --pngWrapper additionally provides a few other options for saving the output call graph, other than the mandatory .dot format. Full usage available in examples/get_call_graph.help
The .dot format can be opened directly with
xdot callstack_folded_inst.dot
Main analysis script. Full usage available in examples/run_analysis.help
Register file usage. Full usage available in examples/rf_usage.help
Note
Running any of the below commands with -b will open (or host with --host) interactive session in the browser
Get timeline plot
./script/run_analysis.py -t examples/dhrystone_dhrystone_out/trace.bin --dasm sw/baremetal/dhrystone/dhrystone.dasm --timelineGet stats trace (adjust window sizes as needed)
./script/run_analysis.py -t examples/dhrystone_dhrystone_out/trace.bin --dasm sw/baremetal/dhrystone/dhrystone.dasm --stats_trace --win_size_stats 512 --win_size_hw 64 --save_decoded_traceGet execution breakdown
./script/run_analysis.py -i examples/dhrystone_dhrystone_out/inst_profile.jsonGet execution histograms
./script/run_analysis.py -t examples/dhrystone_dhrystone_out/trace.bin --dasm sw/baremetal/dhrystone/dhrystone.dasm --pc_hist --add_cache_lines
./script/run_analysis.py -t examples/dhrystone_dhrystone_out/trace.bin --dasm sw/baremetal/dhrystone/dhrystone.dasm --dmem_hist --add_cache_linesGet execution trace
./script/run_analysis.py -t examples/dhrystone_dhrystone_out/trace.bin --dasm sw/baremetal/dhrystone/dhrystone.dasm --pc_trace --add_cache_lines
./script/run_analysis.py -t examples/dhrystone_dhrystone_out/trace.bin --dasm sw/baremetal/dhrystone/dhrystone.dasm --dmem_trace --add_cache_linesOptionally, save symbols found in dasm with --save_symbols
Note
Any time ./script/run_analysis.py is invoked with --dasm arg, backannotated dasm will be saved, e.g dhrystone.prof.dasm
Generate register file usage plots and save csv
./script/rf_usage.py examples/dhrystone_dhrystone_out/rf_usage.bin --save_csvTimeline
Stats Trace
Execution breakdown
Execution histograms
Execution trace
Register file usage
Backannotation of disassembly
Adding --print_symbols to any of the commands using -t will print all found symbols to stdout
Symbols found in ../sw/baremetal/dhrystone/dhrystone.dasm in 'text' section:
0x430EC - 0x433D8: _free_r (0)
0x42FAC - 0x430E8: _malloc_trim_r (0)
0x42F08 - 0x42FA8: strcpy (86.2k)
0x42E7C - 0x42F04: __libc_init_array (20)
0x42E20 - 0x42E78: _sbrk_r (30)
0x42E1C - 0x42E1C: __malloc_unlock (2)
0x42E18 - 0x42E18: __malloc_lock (2)
0x42608 - 0x42E14: _malloc_r (198)
0x425FC - 0x42604: malloc (6)
0x425D4 - 0x425F8: _sbrk (17)
0x42298 - 0x425D0: mini_vpprintf (21.5k)
0x42224 - 0x42294: _puts (0)
0x42018 - 0x42220: mini_pad (819)
0x41EB0 - 0x42014: mini_itoa (2.46k)
0x41E88 - 0x41EAC: mini_strlen (500)
0x41E2C - 0x41E84: trap_handler (0)
0x41E00 - 0x41E28: timer_interrupt_handler (0)
0x41DEC - 0x41DFC: get_cpu_time (10)
0x41D98 - 0x41DE8: mini_printf (1.26k)
0x41D64 - 0x41D94: __puts_uart (20.0k)
0x41D14 - 0x41D60: _write (31.3k)
0x41CFC - 0x41D10: send_byte_uart0 (10.0k)
0x41CD4 - 0x41CF8: time_s (20)
0x41C14 - 0x41CD0: Proc_6 (20.0k)
0x41C08 - 0x41C10: Func_3 (3.00k)
0x41B8C - 0x41C04: Func_2 (28.0k)
0x41B6C - 0x41B88: Func_1 (15.0k)
0x41B08 - 0x41B68: Proc_8 (25.0k)
0x41AF8 - 0x41B04: Proc_7 (12.0k)
0x41478 - 0x41AF4: main (63.3k)
0x41468 - 0x41474: Proc_5 (4.00k)
0x41444 - 0x41464: Proc_4 (9.00k)
0x412F4 - 0x41440: Proc_1 (54.0k)
0x412D0 - 0x412F0: Proc_3 (9.00k)
0x412A8 - 0x412CC: Proc_2 (9.00k)
0x4125C - 0x412A4: __clzsi2 (0)
0x4122C - 0x41258: __modsi3 (0)
0x411F8 - 0x41228: __umodsi3 (228)
0x411B0 - 0x411F4: __hidden___udivsi3 (25.4k)
0x411A8 - 0x411AC: __divsi3 (2.00k)
0x41184 - 0x411A4: __mulsi3 (32)
0x41074 - 0x41180: __floatsisf (0)
0x41004 - 0x41070: __fixsfsi (0)
0x40CB8 - 0x41000: __mulsf3 (0)
0x40944 - 0x40CB4: __divsf3 (0)
0x4037C - 0x40940: __udivdi3 (194)
0x40200 - 0x40378: strcmp (57.0k)
0x400EC - 0x401FC: trap_entry (0)
0x400E8 - 0x400E8: forever (0)
0x400DC - 0x400E4: call_main (2)
0x400CC - 0x400D8: clear_bss_b_loop (0)
0x400C8 - 0x400C8: clear_bss_b_check (1)
0x400B8 - 0x400C4: clear_bss_w_loop (10.6k)
0x40000 - 0x400B4: _start (46)
It also backannotates the disassembly and saves it as dhrystone.prof.dasm
00040000 <_start>:
1 40000: 00000093 addi x1,x0,0
1 40004: 00000113 addi x2,x0,0
1 40008: 00000193 addi x3,x0,0
1 4000c: 00000213 addi x4,x0,0
1 40010: 00000293 addi x5,x0,0
1 40014: 00000313 addi x6,x0,0
...
000400b8 <clear_bss_w_loop>:
2654 400b8: 00052023 sw x0,0(x10)
2654 400bc: 00450513 addi x10,x10,4
2654 400c0: fff68693 addi x13,x13,-1
2654 400c4: fe069ae3 bne x13,x0,400b8 <clear_bss_w_loop>
...
00041468 <Proc_5>:
1000 41468: 04100693 addi x13,x0,65
1000 4146c: 82d186a3 sb x13,-2003(x3) # 44145 <Ch_1_Glob>
1000 41470: 8201a823 sw x0,-2000(x3) # 44148 <Bool_Glob>
1000 41474: 00008067 jalr x0,0(x1)
...
By default, it uses script/hw_perf_metrics_v2.yaml
This microarchitecture description is based on the ama-riscv implementation
Entires other than *_mhz and *_name specify the number of cycles/stages it takes for instruction or hardware block to produce the result. Value of 1 implies a single pipeline stage, while 0 would be equivalent of a fully combinational path
cpu_frequency_mhz: 100
pipeline: 5
# bp and caches
bp_hit: 1
bp_miss: 3
icache_hit: 1
icache_miss: 5
dcache_hit: 1
dcache_miss: 5
dcache_writeback: 3
# pipeline latencies
jump_direct: 1
jump_indirect: 3 # latency to resolution
mem: 5
mul: 2
div: 6
dot: 2
dcache_load: 2 # on load-to-use, otherwise irrelevant
# names
icache_name: icache
dcache_name: dcache
bpred_name: bpredRun with positional arguments as
./script/perf_est_v2.py examples/dhrystone_dhrystone_out/inst_profile.json examples/dhrystone_dhrystone_out/hw_stats.json -e examples/dhrystone_dhrystone_out/exec.logSince the hardware stats are collected from the ISA model, and simple microarchitecure description, there is some uncertainty on how much time exactly it would take to execute the workload. Estimates are therefore provided as a range between best and worst case.
Performance estimate breakdown for:
examples/dhrystone_dhrystone_out/inst_profile.json
examples/dhrystone_dhrystone_out/hw_stats.json
<home_path>/sim/script/hw_perf_metrics_v2.yaml
Peak Stack usage: 368 bytes
Instructions executed: 521.1k
icache (32 sets, 2 ways, 4096B data): References: 529.8k, Hits: 526.7k, Misses: 3.13k, Hit Rate: 99.41%, MPKI: 6.01
DMEM inst: 166.1k - L/S: 87.9k/78.2k (31.87% instructions)
dcache (16 sets, 4 ways, 4096B data): References: 162.7k, Hits: 162.5k, Misses: 206, Writebacks: 142, Hit Rate: 99.87%, MPKI: 0.40
Branches: 63985 (12.28% instructions)
Branch Predictor (bimodal): Predicted: 55.3k, Mispredicted: 8.70k, Accuracy: 86.40%, MPKI: 16.7
Pipeline stalls (max):
Bad spec: 17.4k
FE bound: 61.1k - ICache: 15.7k (AMAT: 1.02), Core: 45.4k
BE bound: 6.30k - DCache: 1.46k (AMAT: 1.01), Core: 4.84k
Estimated HW performance at 100MHz:
Best: Cycles: 599.6k, CPI: 1.151 (IPC: 0.869), Time: 6.00ms, MIPS: 86.9
Worst: Cycles: 605.9k, CPI: 1.163 (IPC: 0.860), Time: 6.06ms, MIPS: 86.0
Estimated Cycles range: 6.30k cycles, midpoint: 602.7k, ratio: 1.04%
ISA simulator provides two types of parametrizable hardware models - caches and branch predictors. It's useful to sweep across various configurations with chosen workloads and compare their performances, as well as size and hardware complexity tradeoffs. These results then drive the decision on which configuration to implement in hardware
Full usage available in examples/hw_model_sweep.help
Icache and Dcache share the config file for workloads, but have separate hardware parameters: script/hw_model_sweep_params_caches.json
Sweep for Icache can be run with
./script/hw_model_sweep.py -p ./script/hw_model_sweep_params_caches.json --save_stats --sweep icache --trackDcache only needs --sweep parameter change
./script/hw_model_sweep.py -p ./script/hw_model_sweep_params_caches.json --save_stats --sweep dcache --trackAdd --load_stats if the sweep has already been run and only charts need to be regenerated
With --save_stats, output stats of each workload, and a combined average, are saved as .json files:
- icache at examples/hw_sweeps/sweep_icache_workloads_searched_best.json
- dcache at examples/hw_sweeps/sweep_dcache_workloads_searched_best.json
Detailed per workload result plots are available as PDFs:
- icache at examples/hw_sweeps/sweep_icache_results.pdf
- dcache at examples/hw_sweeps/sweep_dcache_results.pdf
Icache average stats across all workloads

Dcache average stats across all workloads

Similarly to caches, branch predictor sweeps are also specified through a config file with appropriate parameters: script/hw_model_sweep_params_bp.json
Branch predictors are evaluated both for accuracy and MPKI (misses per 1k instructions).
Sweep for branch predictors can be run with
./script/hw_model_sweep.py -p ./script/hw_model_sweep_params_bp.json --save_stats --sweep bpred --track --bp_top_acc_thr 70 --save_pngThis also ignores all predictors with accuracy below 70%
With --save_stats, best and binned output stats of each workload, and a combined average, are saved as .json files, e.g.
- All workloads average, best predictors: examples/hw_sweeps/sweep_bpred_workloads_all_best.json
- All workloads average, predictors binned for size: examples/hw_sweeps/sweep_bpred_workloads_all_binned.json
Due to the large exploration space, it's impractical to run all workloads for all predictors. Instead, flow can (and does) use "skip_search" switch. Setting this to true will remove workload from the sweep search, and will only use it in the second, 'evaluation', pass. It's a tradeoff between accuracy and runtime. Leaving too few and/or unrepresentative workloads would result in poor performance once all workloads are evaluated
When workloads are skipped during search, an additional set of logs is available as sweep_*_workloads_searched_*.json that includes only stats of those predictors used for the sweep
Detailed per workload result plots are available as PDF at examples/hw_sweeps/sweep_bpred_results.pdf
Branch predictor stats across searched workloads only

Branch predictor stats across all workloads

Each program is in its own directory with the provided Makefile
Simulation ends when tohost LSB is set
The exact build command is test dependent, but all tests follow the same general structure with the Makefile recipes. A detailed description is available in the baremetal README
All tests and workloads are under sw/baremetal
Each workload can be built individually. A more common, and easier approach is to use the provided python build script, described in Gtest chapter below
Some of the standard benchmarks are available:
- Dhrystone
- CoreMark
- Embench
- STREAM-INT (STREAM benchmark modified to use
int32_tinstead offloat/double)
Quantized MLP NN is available under mlp in three flavors:
- w2a8.elf
- w4a8.elf
- w8a8.elf
NN is used as a showcase of Custom packed SIMD ISA capabilities and speed-up
Official RISC-V ISA tests are also provided
cd sw/riscv-isa-testsBuild all tests
# uses default DIR=riscv-tests/isa/rv32ui
make -j8To add CSR test
make -j8 DIR=modified_riscv-tests/isa/rv32mi/GTest harness is used to run all available tests, while keeping the PASS/FAIL status and sim output of each test
cd test
makePrepare RISC-V tests with
./prepare_riscv_tests.py --testlist testlist.json --isa_testsThis also produces gtest_testlist.txt which lists complete paths on disk for each generated .elf file to be used by GTest
Makefile can also be used to prepare all tests specified under test/testlist.json and --isa_tests (convenience wrapper around the above python script)
make prepare_testsGTest can then be run with
make -j8Test outputs are stored under *_out directory, while stdout is stored under *_dump.log, e.g. for Dhrystone: dhrystone_dhrystone_out and dhrystone_dhrystone_dump.log
By default, use
make -j8Options can be passed to make by using USER_DEFINES= argument, e.g.
make -j8 USER_DEFINES=-DRV32C
Optional build switches
-DDASM_EN - allows recording of the execution log, enabled by default with DEFINES make variable
-DRV32C- enables support for compressed RISC-V instructions (C ext.), disabled by default
-DUART_INPUT_EN - enables user interaction through UART, disabled by default
-DDPI - build targetting DPI environment, disabled by default
32-bit packed SIMD ISA with vector elements 16, 8, 4, and 2-bit wide
Includes instructions for using Dcache as scratchpad
Arithmetic & Logic
| Name | Mnemonic |
|---|---|
| Addition 16-bit | add16 rd, rs1, rs2 |
| Addition 8-bit | add8 rd, rs1, rs2 |
| Subtraction 16-bit | sub16 rd, rs1, rs2 |
| Subtraction 8-bit | sub8 rd, rs1, rs2 |
| Addition Signed Saturated 16-bit | qadd16 rd, rs1, rs2 |
| Addition Unsigned Saturated 16-bit | qadd16u rd, rs1, rs2 |
| Addition Signed Saturated 8-bit | qadd8 rd, rs1, rs2 |
| Addition Unsigned Saturated 8-bit | qadd8u rd, rs1, rs2 |
| Subtraction Signed Saturated 16-bit | qsub16 rd, rs1, rs2 |
| Subtraction Unsigned Saturated 16-bit | qsub16u rd, rs1, rs2 |
| Subtraction Signed Saturated 8-bit | qsub8 rd, rs1, rs2 |
| Subtraction Unsigned Saturated 8-bit | qsub8u rd, rs1, rs2 |
| Widening Multiply Signed 16-bit | wmul16 rd, rs1, rs2 |
| Widening Multiply Unsigned 16-bit | wmul16u rd, rs1, rs2 |
| Widening Multiply Signed 8-bit | wmul8 rd, rs1, rs2 |
| Widening Multiply Unsigned 8-bit | wmul8u rd, rs1, rs2 |
| Dot Product Signed 16-bit | dot16 rd, rs1, rs2 |
| Dot Product Unsigned 16-bit | dot16u rd, rs1, rs2 |
| Dot Product Signed 8-bit | dot8 rd, rs1, rs2 |
| Dot Product Unsigned 8-bit | dot8u rd, rs1, rs2 |
| Dot Product Signed 4-bit | dot4 rd, rs1, rs2 |
| Dot Product Unsigned 4-bit | dot4u rd, rs1, rs2 |
| Dot Product Signed 2-bit | dot2 rd, rs1, rs2 |
| Dot Product Unsigned 2-bit | dot2u rd, rs1, rs2 |
| Min Signed 16-bit | min16 rd, rs1, rs2 |
| Min Unsigned 16-bit | min16u rd, rs1, rs2 |
| Min Signed 8-bit | min8 rd, rs1, rs2 |
| Min Unsigned 8-bit | min8u rd, rs1, rs2 |
| Max Signed 16-bit | max16 rd, rs1, rs2 |
| Max Unsigned 16-bit | max16u rd, rs1, rs2 |
| Max Signed 8-bit | max8 rd, rs1, rs2 |
| Max Unsigned 8-bit | max8u rd, rs1, rs2 |
| Shift Left Logical Imm 16-bit | slli16 rd, rs1, imm |
| Shift Left Logical Imm 8-bit | slli8 rd, rs1, imm |
| Shift Right Logical Imm 16-bit | srli16 rd, rs1, imm |
| Shift Right Logical Imm 8-bit | srli8 rd, rs1, imm |
| Shift Right Arithmetic Imm 16-bit | srai16 rd, rs1, imm |
| Shift Right Arithmetic Imm 8-bit | srai8 rd, rs1, imm |
Data Format
| Name | Mnemonic |
|---|---|
| Widen from Signed 16-bit | widen16 rd, rs1 |
| Widen from Unsigned 16-bit | widen16u rd, rs1 |
| Widen from Signed 8-bit | widen8 rd, rs1 |
| Widen from Unsigned 8-bit | widen8u rd, rs1 |
| Widen from Signed 4-bit | widen4 rd, rs1 |
| Widen from Unsigned 4-bit | widen4u rd, rs1 |
| Widen from Signed 2-bit | widen2 rd, rs1 |
| Widen from Unsigned 2-bit | widen2u rd, rs1 |
| Narrow from 32-bit | narrow32 rd, rs1, rs2 |
| Narrow from 16-bit | narrow16 rd, rs1, rs2 |
| Narrow from 8-bit | narrow8 rd, rs1, rs2 |
| Narrow from 4-bit | narrow4 rd, rs1, rs2 |
| Saturating Narrow from Signed 32-bit | qnarrow16 rd, rs1, rs2 |
| Saturating Narrow from Unsigned 32-bit | qnarrow16u rd, rs1, rs2 |
| Saturating Narrow from Signed 16-bit | qnarrow8 rd, rs1, rs2 |
| Saturating Narrow from Unsigned 16-bit | qnarrow8u rd, rs1, rs2 |
| Saturating Narrow from Signed 8-bit | qnarrow4 rd, rs1, rs2 |
| Saturating Narrow from Unsigned 8-bit | qnarrow4u rd, rs1, rs2 |
| Saturating Narrow from Signed 4-bit | qnarrow2 rd, rs1, rs2 |
| Saturating Narrow from Unsigned 4-bit | qnarrow2u rd, rs1, rs2 |
| Transpose across pair 16-bit | txp16 rd, rs1, rs2 |
| Transpose across pair 8-bit | txp8 rd, rs1, rs2 |
| Transpose across pair 4-bit | txp4 rd, rs1, rs2 |
| Transpose across pair 2-bit | txp2 rd, rs1, rs2 |
Vector-Scalar
| Name | Mnemonic |
|---|---|
| Duplicate Scalar Register 16-bit | dup16 rd, rs1 |
| Duplicate Scalar Register 8-bit | dup8 rd, rs1 |
| Duplicate Scalar Register 4-bit | dup4 rd, rs1 |
| Duplicate Scalar Register 2-bit | dup2 rd, rs1 |
| Insert Scalar into Vector 16-bit | vins16 rd, rs1, idx |
| Insert Scalar into Vector 8-bit | vins8 rd, rs1, idx |
| Insert Scalar into Vector 4-bit | vins4 rd, rs1, idx |
| Insert Scalar into Vector 2-bit | vins2 rd, rs1, idx |
| Extract Scalar from Vector Signed 16-bit | vext16 rd, rs1, idx |
| Extract Scalar from Vector Unsigned 16-bit | vext16u rd, rs1, idx |
| Extract Scalar from Vector Signed 8-bit | vext8 rd, rs1, idx |
| Extract Scalar from Vector Unsigned 8-bit | vext8u rd, rs1, idx |
| Extract Scalar from Vector Signed 4-bit | vext4 rd, rs1, idx |
| Extract Scalar from Vector Unsigned 4-bit | vext4u rd, rs1, idx |
| Extract Scalar from Vector Signed 2-bit | vext2 rd, rs1, idx |
| Extract Scalar from Vector Unsigned 2-bit | vext2u rd, rs1, idx |
Cache Ops
| Name | Mnemonic |
|---|---|
| SCP Load cache line | scp.ld rd, rs1 |
| Release SCP cache line | scp.rel rd, rs1 |
Coming soon
Apply provided patch file. Adds assembler and disassembler support for the custom xsimd packed SIMD extension - registers the vendor extension and defines encodings/opcodes for all custom SIMD instructions
# assuming the already compiled toolchain
# go to binutils dir
cd <toolchain_workdir>/riscv-gnu-toolchain/binutils
# optionally check patch
git apply --stat <workdir>/ama-riscv-sim/binutils.patch
git apply --check <workdir>/ama-riscv-sim/binutils.patch
# apply patch
git apply <workdir>/ama-riscv-sim/binutils.patch
# rebuild just binutils
cd <toolchain_workdir>/riscv-gnu-toolchain/build-binutils-newlib
make -j8 && make install -j8Patch taken from
RISC-V GNU toolchain repo: https://github.com/riscv/riscv-gnu-toolchain
Pulls binutils from: https://sourceware.org/git/binutils-gdb.git
commit c7f28aad0c99d1d2fec4e52ebfa3735d90ceb8e9 (HEAD, tag: binutils-2_42)
Author: Nick Clifton <nickc@redhat.com>
Date: Mon Jan 29 14:51:43 2024 +0000
Update version number to 2.42
Apply provided patch file. Adds GCC compiler recognition of the xsimd extension, allowing it to be specified in -march strings (e.g. -march=rv32imc_xsimd1p0)
# assuming the already compiled toolchain
# go to gcc dir
cd <toolchain_workdir>/riscv-gnu-toolchain/gcc
# optionally check patch
git apply --stat <workdir>/ama-riscv-sim/gcc-14.patch
git apply --check <workdir>/ama-riscv-sim/gcc-14.patch
# apply patch
git apply <workdir>/ama-riscv-sim/gcc-14.patch
# rebuild gcc
cd <toolchain_workdir>/riscv-gnu-toolchain/build-gcc-newlib-stage2
make -j8 && make install -j8Patch taken from
RISC-V GNU toolchain repo: https://github.com/riscv/riscv-gnu-toolchain
Pulls GCC from: https://github.com/gcc-mirror/gcc
Patch targets GCC 14
commit 897cd794d341a3bdd3195e90ebeea054ac80bf65 (grafted, HEAD -> releases/gcc-14, origin/releases/gcc
-14)
Author: GCC Administrator <gccadmin@gcc.gnu.org>
Date: Fri Aug 9 00:22:36 2024 +0000
Daily bump









