Skip to content
wangeddie67 edited this page May 6, 2024 · 12 revisions

Welcome to the Chiplet_Heterogeneous_newVersion wiki! You can find a overview of the Chiplet Simulator in this page.

Details can be found in below pages:

Overview

Chiplet Simulator is an open-source project to address the challenge to simulate a huge system constructed by heterogenous chiplets.

A cycle-accurate model is not realizable for such scale of systems due to the limitation on both host-machine performance and memory space. A parallel cycle-accurate model is also not attractive because the speedup cannot increase as the increase of parallelism further as a result of frequent synchronization operation cross all parallel threads or processes.

Traditionally, to integrate a new kind of IPs or Chiplets means huge effort on coding, debugging and correlation. On the other side, there are some excellent open-source models of CPU, CPU, NoC, DRAM and so on, that have been accepted by lots of research. It is not necessary to redo all works from scratch.

To simulate a huge chiplet system, the simulation speed and development speed is much more critical than whether the simulation result is cycle-coherent with a realistic system. Both the academic field and industrial field are seeking for one balance solution. Chiplet simulator provides a loose-couple parallel architecture to address the challenge of faster simulation, faster developing and faster reconfiguration.

To speedup parallel simulation, the frequency of synchronization between simulator processes is reduce to the same frequency as software synchronization in the benchmark. In other word, simulator processes only synchronization when benchmarks send/read data or require shared resources. Meanwhile, the scope of synchronization limits among the processes related with the software operation.

For an extreme example, if two simulation processes never communicate with each other, either they never require shared resources, these two simulation processes can run parallel all the time.

For another example, when CPU wants to send data to a GPU, the software in CPU writes data to a range of memory space and writes another memory location to signal GPU that data is ready. The software in GPU enquires the memory location until it gets the signal before GPU reads the data. In Chiplet Simulator, such sequence of operations is abstracted as a kind of transaction (data transaction). Chiplet simulator forcuses on the end cycle of one transaction, rather than the duration of each operation within one transaction. In this way, the amount of essential synchronization operations is dropped.

To speed up the development and reconfiguration, Chiplet Simulator integrates different kind of open-source simulators to avoid heavy cost to develop model for each IP. Although minor modification to support the synchronization protocol in third-party simulators is still necessary, but the effort is small enough to ignore related to create own models. By simple configuration files, Chiplet Simulator can simulate one benchmark on different platforms with different kind of IPs or chiplets.

Modeling target architect

Chiplet simulator aims to model a system consisting of heterogenous IPs and chiplets.

The target architecture is divided the entire system into Processing Components (PComp) and Shared Components (SComp). In general, PComps are master devices that can generate requests to Shared components. Optionally but usually, PComps can execute some kind of ISA and is consists of one kind of memory system. CPU (clusters), GPUs and NPUs are typical PComps. SComps are shared by PComps and respond to the requests from PComps, including NoC, DRAM controllers and some kinds of accelerators.

An example target architecture combined by chiplets is as below:

Another example target architecture combined by IPs is as below:

In the aspect of execution time, PComps control tasks and flows in the system, which perform major role to control the simulation time. SComps impact the performance by the duration time when they response to the requests from PComps. Take DRAM controller as example, CPU/GPUs send read/write requests to DRAM controller. If the speed of DRAM to access external memory is slower, CPU/GPUs need longer duration to wait the response from DRAM controller, and usually it means longer time to execute one benchmark. Hence, the simulation result will be reasonable if the time cost of SComps can be reflected by PComps reasonable.

Simulator Architecture

As shown above, the target systems combine different kind of components, which may be described by well-used simulators. Each component in target system corresponds to one simulator process and these processes execute parallelly to increase the simulation speed.

Simulation cycle iteration

The SComp simulators need to get traces as input from PComp simulators and PComp simulators need to get delay information from SComp simulators. The accurate of simulation is controlled by the quality of both stimulus trace and delay infomation. Hence, Chiplet simulator defines an iteration flow so that the impact between PComp and SComp can converge to a realistic value.

The iteration flow is shown in below figure:

Each iteration is divided into two phases. Phase 1 simulates PComps and Phase 2 simulates SComps.

In phase 1 of the first iteration, PComps are simulated parallelly. One algorithm is applied to calculate the delay of requests from PComps. Meanwhile, interchiplet receives protocol commands from simulator processes and generates traces as situmulus to SComps. The maximum execution cycle among all PComps is count as the execution cycle of this benchmark in the first iteration.

In phase 2 of the first iteration, SComps are also simulated parallelly. Simulation of SComps is driven by the traces generated in the phase 1. Simulation generates delay information of each request.

In phase 1 of the second iteration, PComps are simulated again as the first iteration. The simulation takes the delay information generated by the Phase 2 of previous iteration.

At the end of Phase 1 of the second iteration, the execution cycle recorded in the current iteration is compared with the execution cycle recorded in the previous iteration. If the error ratio is lower than a specified threshold, the execution cycle is considered to be convergent, and the simulation stops. Otherwise, simulation continues with Phase 2 as same as the previous iteration.

In case of non-convergence, the simulation flow can stop after a specified number of iterations, which is called timeout.

Multiple-process Multi-thread Structure

Chiplet simulator is a multi-process multi-thread software. The software architecture is as below:

As the major process, interchip controls the flow of entire simulation. It will create as many threads as the simulator processes. Each thread corresponds to one simulator process and handles the inter-processcommunication and synchronization.

To avoid file confliction between simulator processes, one individual folder is provided for each process in each iteration, named as proc_r{iteration}_p{phase}_t{thread}. For example, proc_r2_p1_t4 is the working directory of 4-th thread of Phase 1 of the second iteration. These folders are referenced as sub-directories.

Each thread of interchip connects the standard input and output of one simulation process though Pipe. The content from standard ouput and standard error ouput will be redirected to a log file.

Chiplet simulator provides a synchronization protocol and a set of APIs to handle the communication between inter-process. A minor modification is necessary to apply such protocol. Details can be see.

Threads in interchiplet takes the responsibility to handle such protocol. They get one protocol command from a simulation process through standard output, and issue another protocol command as response through standard input.

Interchiplet maintains one mutex and one shared data structure between these threads. Threads are parallel when they only handle outputs from simulation processes so that the redirection effort is totally hided. When they receive protocol command, one thread must lock the mutex before further handling so that all protocol command are handled atomitically.

Clone this wiki locally