-
Notifications
You must be signed in to change notification settings - Fork 13
Benchmarks
You can learn the following topics from this page:
- How to create or modify the source code of a benchmark to execute on the LegoSim
- How to create the configuration file (YAML format) of a novel benchmark
Considering the complexity of the software stack of heterogeneous systems, it should not be expected that there will be a standard software stack available for every experimental platform on LegoSim. Hence, task partitioning and task management should be done manually. Each simulator should have one individual executable file. For example, different executable files should be provided to SniperSim and GPGPUSim. Moreover, imported simulators can share the same executable file if they perform the same task on various datasets, like Same-Task-Multiple-Data (STMD).
Take the matmul benchmark as an example. In matmul benchmark, the CPU generates source matrixes and sends data to GPGPUs, while three GPGPUs perform the matrix multiplex operation. Each GPGPU handles a part of the data. The matmul benchmark provides two executable files for SniperSim and GPGPUSim separately. Simulation processes of GPGPUSim share the same executable file.
Several APIs should be added to benchmarks to communicate and synchronize processes. In reality, similar APIs are also injected by one complete software stack (like CUDA).
Communication
syscall_return_t sendMessage(int64_t __dst_x, int64_t __dst_y, int64_t __src_x, int64_t __src_y, void* __addr, int64_t __nbyte);
syscall_return_t receiveMessage(int64_t __dst_x, int64_t __dst_y, int64_t __src_x, int64_t __src_y, void* __addr, int64_t __nbyte);__dst_x and __dst_y specify the destination address. __src_x and __src_y specify the source address. __addr specifies the pointer to the data array. __nbyte determines the number of bytes in the data array.
Functions return the result of the operation. 0 means the transmission operation succeeds, while 1 means the transmission operation fails.
Lock and unlock
syscall_return_t lock(int64_t __uid, int64_t __src_x, int64_t __src_y);
syscall_return_t unlock(int64_t __uid, int64_t __src_x, int64_t __src_y);__uid specifies one unique ID of the mutex. __uid should not be the same as any one address in the system. __src_x and __src_y specify the source address.
Functions return the result of the operation. 0 means the transmission operation succeeds, while 1 means the transmission operation fails.
Barrier
syscall_return_t barrier(int64_t __uid, int64_t __src_x, int64_t __src_y, int64_t __count = 0);__uid specifies one unique ID of the barrier. __uid should not be the same as any one address in the system. __src_x and __src_y specify the source address. __count specifies the number of threads for the barrier.
Functions return the result of the operation. 0 means the transmission operation succeeds, while 1 means the transmission operation fails.
Launch
syscall_return_t launch(int64_t __dst_x, int64_t __dst_y, int64_t __src_x, int64_t __src_y);
syscall_return_t waitLaunch(int64_t __dst_x, int64_t __dst_y, int64_t* __src_x, int64_t* __src_y);__dst_x and __dst_y specify the destination address. __src_x and __src_y specify the source address. waitLaunch returns the source of the launch command through __src_x and __src_y.
Functions return the result of the operation. 0 means the transmission operation succeeds, while 1 means the transmission operation fails.
APIs on CPUs are implemented by system calls. The mapping between APIs and system calls is listed below:
| API | System Call ID |
|---|---|
launch |
SYSCALL_LAUNCH |
waitLaunch |
SYSCALL_WAITLAUNCH |
lock |
SYSCALL_LOCK |
unlock |
SYSCALL_UNLOCK |
barrier |
SYSCALL_BARRIER |
sendMessage |
SYSCALL_REMOTE_WRITE |
receiveMessage |
SYSCALL_REMOTE_READ |
The declaration of these APIs is provided in $SIMULATOR_ROOT/interchiplet/includes/apis_c.h. The implementation of these APIs is compiled into a static library $SIMULATOR_ROOT/interchiplet/lib/libinterchiplet_c.a, which should be linked to the benchmark.
TODO: A more flexible way to specify the source and the destination address.
Communication
cudaError_t sendMessage(int __dst_x, int __dst_y, int __src_x, int __srx_y, void* __addr, int __nbyte);
cudaError_t receiveMessage(int __dst_x, int __dst_y, int __src_x, int __srx_y, void* __addr, int __nbyte);
__dst_x and __dst_y specify the destination address. __src_x and __src_y specify the source address. __addr specifies the pointer to the data array. __nbyte determines the number of bytes in the data array.
Lock and unlock
cudaError_t lock(int __uid, int __src_x, int __src_y);
cudaError_t unlock(int __uid, int __src_x, int __src_y);__uid specifies one unique ID of the mutex. __uid should not be the same as any one address in the system. __src_x and __src_y specify the source address.
Barrier
cudaError_t barrier(int __uid, int __src_x, int __src_y, int __count = 0);__uid specifies one unique ID of the barrier. __uid should not be the same as any one address in the system. __src_x and __src_y specify the source address. __count specifies the number of threads for the barrier.
Launch
cudaError_t launch(int __dst_x, int __dst_y, int __src_x, int __src_y);
cudaError_t waitLaunch(int __dst_x, int __dst_y, int* __src_x, int* __src_y);__dst_x and __dst_y specify the destination address. __src_x and __src_y specify the source address. waitLaunch returns the source of the launch command through __src_x and __src_y.
The operation result is returned by __res. 0 means the transmission operation succeeds, while 1 means the transmission operation fails.
The declaration of these APIs is provided in $SIMULATOR_ROOT/interchiplet/includes/apis_cu.h. The implementation of these APIs is provided by Cuda environment.
TODO: A more flexible way to specify the source and the destination address.
The execution process is controlled by a YAML configuration file. One benchmark must have at least one YAML configuration file. More configuration files can be created to describe different configurations of one benchmark.
The example structure of the YAML file is as follows:
# Phase 1 configuration.
phase1:
# Process 0
- cmd: "$BENCHMARK_ROOT/bin/matmul_cu"
args: ["0", "1"]
log: "gpgpusim.0.1.log"
is_to_stdout: false
pre_copy: "$SIMULATOR_ROOT/gpgpu-sim/configs/tested-cfgs/SM2_GTX480/*"
clock_rate: 1
# Process 1
- cmd: "$BENCHMARK_ROOT/bin/matmul_cu"
args: ["1", "0"]
log: "gpgpusim.1.0.log"
is_to_stdout: false
pre_copy: "$SIMULATOR_ROOT/gpgpu-sim/configs/tested-cfgs/SM2_GTX480/*"
clock_rate: 1
......
# Phase 2 configuration.
phase2:
# Process 0
- cmd: "$SIMULATOR_ROOT/popnet/popnet"
args: ["-A", "2", "-c", "2", "-V", "3", "-B", "12", "-O", "12", "-F", "4", "-L", "1000", "-T", "10000000", "-r", "1", "-I", "../bench.txt", "-R", "0"]
log: "popnet.log"
is_to_stdout: false
clock_rate: 1
In the above configuration files, the first-level tags are
-
phase1provides the configuration for processes in Phase 1. -
phase2provides the configuration for processes in Phase 2.
Both phase1 and phase2 accept a list of process configuration structures. Each structure corresponds to one parallel simulator process.
Configuration structures provide the following tags:
-
cmdppresents the command of the simulator. A string is accepted. The environment variables$BENCHMARK_ROOTand$SIMULATOR_ROOTare supported to describe the path of the simulator. -
argspresents the arguments of the simulator. A list of strings is accepted. The environment variables$BENCHMARK_ROOTand$SIMULATOR_ROOTare also supported to specify the path of related files.cmdandargscombine the SHELL command to execute one simulator. -
logpresents the name of the logger. A string is accepted. Neither the absolute path nor the related path is supported. The log file is stored in the sub-directory of each simulation process. -
is_to_stdoutpresents whether the standard output and standard error output of this simulator process are redirected to the standard output of the intercoupled. -
pre_copyprovides a list of files that should be copied to the sub-directory of this simulation process before calling the simulator. A string is accepted. If there are multiple files to copy, files are separated by space. -
clock_rateprovides a floating-pointing number as the ratio between chiplet clocks (clocks of simulators) and the system clock (clock of the interchiplet).
TODO: Change pre_copy to pre_cmd.
The following commands are supported when writing one benchmark configuration file.
-
$BENCHMARK_ROOTpresents the root path of the benchmark, specified by the location of the YAML configuration file. -
$SIMULATOR_ROOTpresents the root path of the LegoSim, set by setup_env.sh.