Skip to content

Imported Simulator

wangeddie67 edited this page May 19, 2024 · 7 revisions

You can learn the following topics from this page:

  • How to import a new simulator?
  • Function and APIs provided for imported simulators.
  • What has been modified in SniperSim and GPGPUSim?

The tasks to import a new simulator

  1. Provide an implementation of APIs in benchmarks. See Benchmark for the API list.
  2. Issue CYCLE command to report the end cycle if the execution cycle of this simulator takes in the count to determine the total execution cycle. In general, CPU simulators should issue CYCLE commands.
  3. Issue the PIPE command before opening Pipes and reading/writing data in the functional model.
  4. Issue READ/WRITE command in the timing model and adjust the execution cycle when receiving SYNC command.

SniperSim

SniperSim is a trace-based CPU simulator that can achieve high speed and reasonable accuracy.

APIs

Read and write API is implemented by System Calls. The following system call numbers are assigned to these two APIs.

SYSCALL_SEND_TO_GPU = 508,          // Send data to GPU
SYSCALL_READ_FROM_GPU = 509,        // Read data from CPU

System calls have the following arguments: source address, destination address, pointer to data array, and amount of data in bytes.

Handle Read/Write System Calls

SniperSim provides separate functional and timing models. Hence, APIs are handled separately in functional and timing models.

In the functional model, system calls are handled in file $SIMULATOR_ROOT/snipersim/sift/recorder/syscall_modeling.cc. The flow chart is as follows:

flowchart TD

subgraph Write Syscall
A1[Issue PIPE command]
B1[Wait for SYNC command]
C1[Open PIPE]
D1[Write data to PIPE]
end

A1-->B1-->C1-->D1
B1-->B1

subgraph Read Syscall
A2[Issue PIPE command]
B2[Wait for SYNC command]
C2[Open PIPE]
D2[Read data from PIPE]
end

A2-->B2-->C2-->D2
B2-->B2
Loading

In the timing model, system calls are handled in file $SIMULATOR_ROOT/snipersim/common/core/syscall_model.cc. The flow chart is as follows:

flowchart TD

subgraph Write Syscall
A1[Get current execution cycle]
B1[Issue WRITE command]
C1[Wait for SYNC command]
D1[Sleep core until cycle specified by SYNC command]
end

A1-->B1-->C1-->D1
C1-->C1

subgraph Read Syscall
A2[Get current execution cycle]
B2[Issue READ command]
C2[Wait for SYNC command]
D2[Sleep core until cycle specified by SYNC command]
end

A2-->B2-->C2-->D2
C2-->C2
Loading

SniperSim is not a cycle-driven simulator. Hence, the execution cycle cannot be changed by modifying the value of some variables. Instead, one Sleep instruction is injected into the timing model, and the duration of the Sleep instruction equals the gap from the cycle issue one READ/WRITE command to the cycle receiving the corresponding SYNC command.

// Update simulator time.
ComponentPeriod time_wake_period = *(Sim()->getDvfsManager()->getGlobalDomain()) * end_time;
SubsecondTime time_wake = time_wake_period.getPeriod();
SubsecondTime sleep_end_time;
Sim()->getSyscallServer()->handleSleepCall(m_thread->getId(), time_wake, start_time, sleep_end_time);

// Sleep core until specified time.
if (m_thread->reschedule(sleep_end_time, core))
    core = m_thread->getCore();

core->getPerformanceModel()->queuePseudoInstruction(new SyncInstruction(sleep_end_time, SyncInstruction::SLEEP));

Issue CYCLE command

Because the CPU always controls the flow of benchmarks, the CPU's execution cycle plays a vital role in the execution cycle of the entire simulation. CYCLE command is issued in file $SIMULATOR_ROOT/snipersim/common/core/core.cc.

GPGPUSim

GPGPUSim is a cycle-accurate model that simulates the architecture of Nvidia GPGPU.

APIs

Unfortunately, there are no instructions like CPU System calls in the CUDA environment. The APIs are emulated tricky in GPGPUSim. Instruction ADDC with immediate operands is applied to create a pseudo-syscall in GPGPUSim.

addc.u32 %0, %1, %2;

If the second source operand is 0, the instruction injects the unsigned 32-bit integer provided in the first source operand to the list of system call arguments. If the second source operand is 1, the instruction performs a functionality specified by the first source operand. The source code to generate the read system call is as follows:

asm("addc.u32 %0, %1, %2;" : "=r"(t_res) : "r"(__dst_x) , "r"(InterChiplet::CUDA_SYSCALL_ARG));
*__res += t_res;
asm("addc.u32 %0, %1, %2;" : "=r"(t_res) : "r"(__dst_y) , "r"(InterChiplet::CUDA_SYSCALL_ARG));
*__res += t_res;
asm("addc.u32 %0, %1, %2;" : "=r"(t_res) : "r"(__src_x) , "r"(InterChiplet::CUDA_SYSCALL_ARG));
*__res += t_res;
asm("addc.u32 %0, %1, %2;" : "=r"(t_res) : "r"(__srx_y) , "r"(InterChiplet::CUDA_SYSCALL_ARG));
*__res += t_res;
asm("addc.u32 %0, %1, %2;" : "=r"(t_res) : "r"(lo_data_ptr) , "r"(InterChiplet::CUDA_SYSCALL_ARG));
*__res += t_res;
asm("addc.u32 %0, %1, %2;" : "=r"(t_res) : "r"(hi_data_ptr) , "r"(InterChiplet::CUDA_SYSCALL_ARG));
*__res += t_res;
asm("addc.u32 %0, %1, %2;" : "=r"(t_res) : "r"(byte_size) , "r"(InterChiplet::CUDA_SYSCALL_ARG));
*__res += t_res;
asm("addc.u32 %0, %1, %2;" : "=r"(t_res) : "r"(InterChiplet::SYSCALL_SEND_TO_GPU), "r"(InterChiplet::CUDA_SYSCALL_CMD));
*__res += t_res;

The return value __res stops the compiler from removing instructions with unused destination operands.

Handle Read/Write System Calls

System calls are handled in file $SIMULATOR_ROOT/gpgpu-sim/src/cuda-sim/instructions.cc. The flow chart is as follows:

flowchart TD

subgraph Write Syscall
A1[Read data from GPU memory]
B1[Issue PIPE command]
C1[Wait for SYNC command]
D1[Open PIPE]
E1[Write data to PIPE]
F1[Get current execution cycle]
G1[Send WRITE command]
H1[Wait for SYNC command]
I1[Lazzy adjust the clock cycle]
end

A1-->B1-->C1-->D1-->E1-->F1-->G1-->H1-->I1
C1-->C1
H1-->H1

subgraph Read Syscall
B2[Issue PIPE command]
C2[Wait for SYNC command]
D2[Open PIPE]
E2[Read data from PIPE]
A2[Write data from to memory]
F2[Get current execution cycle]
G2[Send WRITE command]
H2[Wait for SYNC command]
I2[Lazzy adjust the clock cycle]
end

B2-->C2-->D2-->E2-->A2-->F2-->G2-->H2-->I2
C2-->C2
H2-->H2
Loading

Because the data pointer provided by CUDA is within the CUDA address space rather than the host address space. Hence, the data cannot be read or written directly from the memory location provided by CUDA APIs. Instead, we have to read/write the value from/to CUDA through the interface provided by GPGPUsim.

memory_space_t space;
space.set_type(global_space);
memory_space *mem = NULL;
addr_t addr = data_ptr;
decode_space(space, thread, dst, mem, addr);
mem->write(addr, nbytes, interdata, thread, pI);

GPGPUSim is a cycle-driven simulator whose cycle loop can be found in file $SIMULATOR_ROOT/gpgpu-sim/src/gpgpu-sim/gpu-sim.h and $SIMULATOR_ROOT/gpgpu-sim/src/gpgpu-sim/gpu-sim.cc. Variable gpgpu_sim::gpu_sim_cycle maintains the current execution cycle. gpgpu_sim::gpu_sim_cycle cannot be modified directly when system calls are handled. Instead, the target execution cycle should be recorded somewhere and set to gpgpu_sim::gpu_sim_cycle after the simulator has handled all events in the current cycle. Some variables and functions are added to gpgpu_sim as below:

gpu-sim.h

class gpgpu_sim : public gpgpu_t {
...
  // Directly set GPU cycle.
  void chiplet_direct_set_cycle(long long int end_time);
...
}

gpu-sim.cc

// Directly set GPU cycle.
bool g_chiplet_directly_set_cycle = false;
unsigned long long g_chiplet_directly_set_cycle_val = 0;

...

void gpgpu_sim::cycle() {
...
    gpu_sim_cycle++;
    // Directly set GPU cycle.
    if (g_chiplet_directly_set_cycle)
    {
      std::cout << "Directly set cycle to " << g_chiplet_directly_set_cycle_val << std::endl;
      gpu_sim_cycle = g_chiplet_directly_set_cycle_val;
      g_chiplet_directly_set_cycle = false;
    }
...
}


// Directly set cycle
void gpgpu_sim::chiplet_direct_set_cycle(long long int end_time)
{
  g_chiplet_directly_set_cycle_val = end_time;
  g_chiplet_directly_set_cycle = true;
}

By calling gpgpu_sim::chiplet_direct_set_cycle, the target execution cycle end_time is recorded. The source code to adjust the execution cycle in GPGPUSim is shown below:

thread->get_gpu()->chiplet_direct_set_cycle(timeEnd - thread->get_gpu()->gpu_tot_sim_cycle);

GPGPUSim applies two variables to record the execution cycles, gpu_sim_cycle and gpu_tot_sim_cycle. The sum of these two variables presents the real consumed cycle, which should be replaced by the cycle value in the SYNC command. Hence, the value for chiplet_direct_set_cycle should be moved gpu_total_sim_cycle.

Issue CYCLE command

The task on the GPU is triggered by CPUs in the system. CPUs prepare the data required by tasks and accept the generated result. The execution cycle of CPUs reflects the execution cycle of GPUs through the synchronization performed by data transmission. Therefore, GPGPUSim does not issue CYCLE commands.

Utility APIs

$SIMULATOR_ROOT/interchiplet/includes/pipe_comm.h provides utility APIs to handle synchronization protocol.

The following APIs exist in pipe_comm to issue commands:

  • InterChiplet::SyncProtocol::sendCycleCmd sends CYCLE command.
  • InterChiplet::SyncProtocol::pipeSync sends a PIPE command and waits for a SYNC command. The function returns the cycle specified by the SYNC command.
  • InterChiplet::SyncProtocol::writeSync sends a WRITE command and waits for a SYNC command. The function returns the cycle specified by the SYNC command.
  • InterChiplet::SyncProtocol::readSync sends a READ command and waits for a SYNC command. The function returns the cycle specified by the SYNC command.

To reduce the overhead of opening, closing, reading, and writing Pipes, pipe_comm.h abstracts operations into the interface class InterChiplet::PipeCommPipeComm holds a list of opened pipes and one data buffer for each opened pipe. So that one pipe is opened only once during one simulator process. Meanwhile, many reads with small sizes are regulated to less read with the data buffer's size.

The usage of InterChiplet::PipeComm is as follows:

InterChiplet::PipeComm global_pipe_comm;    // It is suggested that the global entity of PipeComm be declared.

char * fileName = InterChiplet::SyncProtocol::pipeName(src_x, src_y, dst_x, dst_y);    // It is suggested to use APIs to get the file name of pipes.
global_pipe_comm.write_data(fileName, interdata, nbytes);    // Write data to Pipe
global_pipe_comm.read_data(fileName, interdata, nbytes);     // Read data to Pipe

Generate Patch and Apply Patch

Although imported simulators need a minor change to support the synchronization protocol, third-party simulators are still suggested to be imported as git submodules. The purpose behind such suggestions is to keep the repository clean and respect open-source spirit. The minor modifications should be stored in a dedicated diff file for each simulator.

patch.sh is used to create diff patch for all simulators. It will also copy modified files to .cache. It is forbidden to copy files in .cache to the directory of simulators because the copy operation cannot redraw. However, the file in .cache can be used as a reference when recovering from git confliction.

apply_patch.sh will apply diff patch to all simulators. git reset is necessary in each simulator before apply_patch.sh to avoid git confliction.

When adding new simulators, it is necessary to add the path of the new simulators to patch.sh and apply_patch.sh.

Clone this wiki locally