QEMU+gem5 Co-simulation: Memory Sharing Architecture

1. Background

In the QEMU+gem5 MI300X co-simulation, the GPU device model (gem5) and the host system (QEMU/KVM) run as two separate processes. The GPU needs to access two types of memory:

VRAM (local video memory): GPU-private, stores textures, buffers, GART page tables, etc.
GTT (Graphics Translation Table / System Memory): Host physical memory regions mapped by the GPU, used for ring buffers, fences, IH cookies, DMA buffers, etc.

Both types of memory must be shared between QEMU and gem5 — otherwise gem5 cannot read commands written by the driver, and QEMU cannot see results written back by the GPU.

Key Takeaway

Both VRAM and Guest RAM (where GTT pages reside) are already shared via shared memory for bidirectional visibility. The GART page table itself lives in VRAM and is also shared. gem5 reads GART PTEs directly from shared VRAM, then accesses Guest RAM directly through the Ruby memory system's shared backstore for DMA operations.

2. Overall Architecture

+----------------------------+                    +-----------------------------+
|  QEMU  (Q35 + KVM)         |                    |  gem5  (Docker)             |
|                            |                    |                             |
|  Guest Linux               |                    |  MI300X GPU Model           |
|  amdgpu driver             |                    |    Shader / CU / SDMA       |
|                            |                    |    PM4 / IH / Ruby caches   |
|  +--------+  +---------+   |  vfio-user (Unix)  |  +------------+ +--------+  |
|  | BAR0   |  | BAR5    |<---(MMIO/CFG/Doorbell)--->|MI300XVfio  | |GPU core|  |
|  | (VRAM) |  | (MMIO)  |   |                    |  |User bridge | |        |  |
|  +---+----+  +---------+   |                    |  +-----+------+ +--------+  |
|      |                     |                    |        |                    |
+------+---------------------+                    +--------+--------------------+
       |                                                   |
       v                                                   v
  /dev/shm/mi300x-vram (16 GiB)                      mmap same file
  (VRAM: GPU data + GART page tables)              (vramShmemPtr)
       |                                                   |
       v                                                   v
  /dev/shm/cosim-guest-ram (8 GiB)                    mmap same file
  (Guest RAM: ring buffers, fences,                (system->getPhysMem())
   GTT pages, kernel/user data)

2.1 Three Sharing Channels

Channel	File/Socket	Size	Purpose	Access Method
VRAM Shared Memory	`/dev/shm/mi300x-vram`	16 GiB	GPU VRAM + GART page tables	mmap (zero-copy)
Guest RAM Shared Memory	`/dev/shm/cosim-guest-ram`	8 GiB	Host physical memory (GTT pages)	QEMU: mmap; gem5: Ruby memory system direct access to shared backstore
vfio-user Socket	`/tmp/gem5-mi300x.sock`	—	MMIO/config space/doorbell via vfio-user message passing; DMA via `vfu_dma_transfer()` or shared memory direct access; interrupts via irq_fd (eventfd -> KVM)	vfio-user protocol (single connection)

3. VRAM Sharing (BAR0)

3.1 Initialization Flow

QEMU side (vfio-user backend):

QEMU uses the built-in vfio-user-pci device to connect to gem5's vfio-user server. BAR0 is exposed to QEMU through the vfio-user DMA region mapping mechanism — QEMU no longer directly opens the VRAM shared memory file, but instead obtains the BAR mapping through the vfio-user protocol.

gem5 side (mi300x_vfio_user.cc:setupVramShm):

shmemFd = shm_open(shmemPath.c_str(), O_CREAT | O_RDWR, 0666);
ftruncate(shmemFd, vramSize);
shmemPtr = mmap(nullptr, vramSize, PROT_READ | PROT_WRITE, MAP_SHARED, shmemFd, 0);

// Key: pass the shared pointer to the GART translator
gpuDevice->getVM().vramShmemPtr = (uint8_t *)shmemPtr;
gpuDevice->getVM().vramShmemSize = vramSize;

3.2 VRAM Content Layout

Offset 0x000000000  +------------------------------+
                    |  GPU Data Area               |
                    |  - hipMalloc allocations     |
                    |  - Kernel args, textures     |
                    |  - Driver internal allocs    |
                    |                              |
                    |        ...                   |
                    |                              |
Offset ~0x3EE600000 +------------------------------+
(ptBase)            |  GART Page Table (PTEs)      |
                    |  8 bytes per PTE             |
                    |  Maps GPU VA -> phys addr    |
Offset 0x400000000  +------------------------------+
(16 GiB)

3.3 Access Patterns

Scenario	Writer	Reader	Path
GPU buffer allocation	Driver (via BAR0 write)	gem5 (via vramShmemPtr)	Shared memory direct access
GART PTE writes	Driver (via BAR0 write)	gem5 GART translator	memcpy from vramShmemPtr
IP Discovery table	gem5 initialization	Driver (via BAR0 read)	Shared memory direct access

Zero-copy: Since QEMU's BAR0 and gem5's vramShmemPtr both mmap the same /dev/shm file, data written by the driver to BAR0 is immediately visible to gem5 with no socket communication required.

4. Guest RAM Sharing (GTT Pages)

4.1 What GTT Really Is

In AMD GPUs, GTT = GART = Graphics Address Remapping Table. It is a single-level page table (VMID 0) that maps GPU virtual addresses to host physical addresses. The host physical memory pages being mapped are the so-called "GTT pages."

Typical GTT page contents:

Data Structure	Description	Access Direction
PM4 Ring Buffer	GFX command queue	Driver writes -> GPU reads
SDMA Ring Buffer	DMA command queue	Driver writes -> GPU reads
IH Ring Buffer	Interrupt handler queue	GPU writes -> Driver reads
Fence values	Completion signals	GPU writes -> Driver reads
MQD (Map Queue Descriptor)	Queue descriptors	Driver writes -> GPU reads
User DMA buffers	hipMemcpy src/dst	Bidirectional

4.2 Guest RAM Sharing Initialization

QEMU side (command-line arguments):

-object memory-backend-file,id=mem0,size=8G,\
        mem-path=/dev/shm/cosim-guest-ram,share=on
-numa node,memdev=mem0

share=on ensures the file mapping uses MAP_SHARED, making QEMU's modifications to guest memory visible to other processes.

gem5 side (mi300_cosim.py):

system.shared_backstore = args.shmem_host_path     # "/cosim-guest-ram"
system.auto_unlink_shared_backstore = True
system.memories[0].shared_backstore = args.shmem_host_path

gem5's PhysicalMemory uses the same POSIX shared memory file as its backing store, achieving memory sharing with QEMU. MI300XVfioUser also sets gpuDevice->getVM().vramShmemPtr to enable the GART translator to correctly access shared VRAM.

4.3 Why GTT Needs No Extra Sharing Mechanism

GTT pages reside in Guest RAM. Guest RAM is already shared between QEMU and gem5 via /dev/shm/cosim-guest-ram. Therefore:

Driver writes to ring buffer -> writes to Guest RAM -> /dev/shm/cosim-guest-ram -> gem5 can read
gem5 writes fence -> Ruby memory controller writes to Guest RAM -> /dev/shm/cosim-guest-ram -> driver can read
Physical addresses in GART PTEs -> offsets within Guest RAM -> accessible by both sides

vfio-user backend: Both VRAM and Guest RAM are accessed via zero-copy mmap. gem5's SDMA/PM4 DMA operations go through the Ruby memory system which directly accesses the shared backstore memory, with no socket relay needed.

5. GART Translation Flow

5.1 Driver Writes GART PTEs

amdgpu driver (guest)
  |
  +- amdgpu_gart_map(): compute PTE value
  |   pte = (phys_addr >> 12) << 12 | flags
  |
  +- write to BAR0 + ptBase + (gpu_page * 8)
  |   |
  |   +- QEMU BAR0 = mmap of /dev/shm/mi300x-vram
  |       +- data immediately appears in shared memory
  |
  +- TLB invalidate: write VM_INVALIDATE_ENG17 register
      +- MMIO -> vfio-user -> gem5 -> invalidateTLBs()

5.2 gem5 Reads GART PTEs

// amdgpu_vm.cc: GARTTranslationGen::translate()

// Step 1: compute PTE offset within VRAM
gart_addr = bits(transformedAddr, 63, 12);  // GPU VA page number
pte_table_offset = gart_addr - (ptStart * 8);

// Step 2: read PTE directly from shared VRAM (zero-copy)
pte_vram_offset = gartBase() + pte_table_offset;
memcpy(&pte, vramShmemPtr + pte_vram_offset, sizeof(uint64_t));

// Step 3: extract physical address
if (pte != 0) {
    paddr = (bits(pte, 47, 12) << 12) | bits(vaddr, 11, 0);
    //  paddr points to Guest RAM (GTT page) or VRAM
}

5.3 PTE Format

63    52 51  48 47              12 11  6 5  2 1   0
+-------+------+-----------------+------+----+---+---+
| Flags | BlkF | Physical Page   | Rsvd |Frag|Sys| V |
|       |      | (PA >> 12)      |      |    |   |   |
+-------+------+-----------------+------+----+---+---+

Bit 0: Valid     -- PTE is valid
Bit 1: System   -- 1=system memory (Guest RAM), 0=local VRAM
Bit 47:12       -- physical page number

5.4 Address Classification

After GART translation yields a physical address, gem5 determines where it points:

Physical address paddr
  |
  +- Within fbBase ~ fbTop range?
  |   +- YES -> VRAM address
  |       +- Access directly via vramShmemPtr (zero-copy)
  |
  +- Within sysAddrL ~ sysAddrH range?
  |   +- YES -> Guest RAM address (GTT page)
  |       +- Access via Ruby memory system (shared memory direct access)
  |
  +- Neither?
      +- Sink (paddr=0, safely discarded)

6. DMA Flow

6.1 vfio-user Backend: Shared Memory Direct Access

With the vfio-user backend, gem5 accesses Guest RAM directly through the Ruby memory system's shared backstore (/dev/shm/cosim-guest-ram), with no socket-based DMA operations needed.

gem5 GPU model (PM4/SDMA/IH)
  |
  |  Needs to read ring buffer commands / write fence values
  |
  v  Ruby memory system request
  |
  +- Address translated by GART -> Guest physical address
  |
  +- Ruby memory controller accesses PhysicalMemory
  |   |
  |   +- PhysicalMemory backed by /dev/shm/cosim-guest-ram (MAP_SHARED)
  |       +- read/write directly hits shared memory
  |       +- QEMU sees changes immediately (same mmap file)
  |
  +- Done (no socket round-trip needed)

Key advantages:

Zero-copy: DMA reads and writes operate directly on shared memory with no serialization/deserialization
Low latency: Eliminates the socket request-response round-trip overhead
Simplified architecture: No custom DMA protocol needed; Ruby's memory system natively supports shared backstores

Interrupts are delivered via vfio-user's irq_fd mechanism (eventfd -> KVM), with no custom interrupt messages needed.

6.2 Legacy Backend: Socket DMA Protocol

The following describes the legacy custom cosim socket backend (MI300XGem5Cosim), retained for reference.

6.2.1 gem5 Reads from Guest RAM (ring buffers / fences)

gem5 GPU model (PM4/SDMA/IH)
  |
  |  Needs to read ring buffer commands from Guest RAM
  |
  v cosimBridge->sendDmaRead(guestPhysAddr, length)
  |
  +- Construct DmaRead message (32-byte header)
  |   { type=DmaRead, addr=guestPhysAddr, data=length }
  |
  +- sendAll(eventFd, &msg, 32)        -->  QEMU event thread
  |                                           |
  |                                           +- pci_dma_read(addr, buf, len)
  |                                           |  (reads from /dev/shm/cosim-guest-ram)
  |                                           |
  |                                           +- sendAll(eventFd, &resp, 32)
  |  <------------------------------------------+- sendAll(eventFd, data, len)
  |
  +- memcpy(dest, recvBuf, length)     // data arrives at gem5

6.2.2 gem5 Writes to Guest RAM (fences / IH cookies)

gem5 GPU model
  |
  |  Needs to write fence value to Guest RAM
  |
  v cosimBridge->sendDmaWrite(guestPhysAddr, length, data)
  |
  +- Construct DmaWrite message + data payload
  |   { type=DmaWrite, addr=guestPhysAddr, data=length, size=length }
  |
  +- sendAll(eventFd, &msg, 32)        -->  QEMU event thread
  +- sendAll(eventFd, data, length)    -->    |
  |                                           +- pci_dma_write(addr, buf, len)
  |                                           |  (writes to /dev/shm/cosim-guest-ram)
  |                                           |
  +- Done (DMA writes don't wait for response)  +- Driver can see data immediately

6.3 vfio-user vs Legacy Backend Comparison

Dimension	vfio-user Backend	Legacy Socket Backend
Guest RAM DMA	Ruby memory system direct access to shared backstore	Socket request-response protocol
VRAM access	mmap zero-copy (same)	mmap zero-copy (same)
Interrupts	irq_fd (eventfd -> KVM)	Socket messages
MMIO	vfio-user message passing	Custom socket protocol
QEMU-side device	Built-in `vfio-user-pci`	Custom `mi300x_gem5.c`
Address translation	gem5-internal GART translation	QEMU-side `pci_dma_read/write`

The reasons the legacy backend routed Guest RAM DMA through the socket (address translation, event-driven simulation, IOMMU compatibility, etc.) no longer apply with vfio-user: gem5's Ruby memory controllers directly access the shared backstore memory, and GART address translation is performed internally within gem5.

7. Sink Mechanism

7.1 Problem Scenario

In co-simulation mode, some GART PTEs may be zero (uninitialized) or point to VRAM-internal addresses. If gem5 cannot translate these addresses, it throws a GenericPageTableFault, causing a DMA retry loop that hangs the simulation.

7.2 Solution

// amdgpu_vm.cc: GARTTranslationGen::translate()

if (pte == 0) {
    if (origAddr < vramShmemSize && vramShmemPtr) {
        // VRAM address -> map to sink (paddr=0)
        range.paddr = 0;
        warn_once("GART: VRAM address mapped to sink -- "
                  "VRAM write-backs are no-ops in cosim");
    } else if (vramShmemPtr) {
        // Unmapped GART page -> sink
        range.paddr = 0;
        warn_once("GART cosim: unmapped page -> sink");
    }
}

Sink semantics:

paddr=0 is always a valid physical address in gem5 (system RAM base)
DMA reads return zeros
DMA writes are silently discarded
Prevents the fault -> retry deadloop

8. Complete Data Flow Example

Using a HIP kernel dispatch to illustrate the full memory interaction:

1. hipMalloc(&d_a, N*sizeof(int))
   Driver -> allocates buffer in VRAM
   Writes GART PTEs to shared VRAM (BAR0)

2. hipMemcpy(d_a, h_a, N*sizeof(int), hipMemcpyHostToDevice)
   Driver -> constructs SDMA copy command -> writes to Guest RAM (ring buffer)
   Driver -> writes Doorbell -> QEMU BAR2 -> vfio-user -> gem5
   gem5 -> reads ring buffer (Guest RAM via shared memory)
   gem5 -> parses SDMA command -> GART translates source address -> Guest RAM
   gem5 -> reads source data (Guest RAM via shared memory)
   gem5 -> writes to VRAM destination (shared memory direct write)

3. kernel<<<1, N>>>(d_a, d_b, d_c, N)
   Driver -> constructs PM4 dispatch command -> writes to Guest RAM (ring buffer)
   Driver -> writes Doorbell -> gem5
   gem5 -> reads PM4 command (Guest RAM via shared memory)
   gem5 -> launches shader execution
   gem5 -> shader reads/writes VRAM (shared memory direct access)
   gem5 -> writes fence on completion (Guest RAM via Ruby memory write)
   gem5 -> sends MSI-X interrupt (irq_fd -> KVM)

4. hipDeviceSynchronize()
   Driver -> polls fence value (until Guest RAM value matches)
   +- fence written by gem5 via Ruby memory write to shared backstore

9. Known Limitations

9.1 DMA Buffer Size (Legacy Backend)

This limitation only applies to the legacy socket backend. The vfio-user backend accesses shared memory directly and has no such limit.

Maximum single DMA transfer is 4 MiB (COSIM_DMA_BUF_SIZE). Transfers exceeding this size must be chunked. In practice, the driver typically submits page-sized transfers, so this limit is rarely hit.

9.2 User-space Page Tables (VMID > 0)

VMID 0 (kernel mode) GART page tables are fully visible via shared VRAM. However, VMID > 0 (user mode) multi-level page tables are walked by VegaISA::Walker, which uses gem5's internal TLB/page walker rather than reading directly from shared memory.

The practical impact is limited: after the driver writes page tables, it sends TLB invalidate MMIOs. gem5 flushes its TLB upon receiving these, and subsequent walker traversals read from the correct physical addresses (which point to shared VRAM or Guest RAM).

9.3 VRAM Write-back Semantics

Some GART addresses in gem5 point back to VRAM itself (VRAM-to-VRAM DMA). These addresses are routed to the sink (paddr=0), and writes are silently discarded. For pure compute workloads, this does not affect correctness.

10. File Reference

File	Key Function/Region	Role
`gem5/src/dev/amdgpu/amdgpu_vm.cc:396-557`	`GARTTranslationGen::translate()`	Core GART translation logic
`gem5/src/dev/amdgpu/amdgpu_vm.hh`	`AMDGPUSysVMContext`, `vramShmemPtr`	GART data structures
`gem5/src/dev/amdgpu/mi300x_vfio_user.cc`	`setupVramShm()`	VRAM shared memory initialization (vfio-user backend)
`gem5/src/dev/amdgpu/mi300x_vfio_user.hh`	`MI300XVfioUser`	vfio-user server-side bridge
`gem5/src/dev/amdgpu/mi300x_gem5_cosim.cc`	`setupSharedMemory()`, `sendDmaRead/Write()`	Legacy socket backend (VRAM init + DMA)
`gem5/configs/example/gpufs/mi300_cosim.py`	`shared_backstore` config, `--cosim-backend`	Guest RAM sharing setup + backend selection
`gem5/src/dev/amdgpu/MI300XVfioUser.py`	SimObject definition	vfio-user backend Python binding

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QEMU+gem5 Co-simulation: Memory Sharing Architecture

1. Background

Key Takeaway

2. Overall Architecture

2.1 Three Sharing Channels

3. VRAM Sharing (BAR0)

3.1 Initialization Flow

3.2 VRAM Content Layout

3.3 Access Patterns

4. Guest RAM Sharing (GTT Pages)

4.1 What GTT Really Is

4.2 Guest RAM Sharing Initialization

4.3 Why GTT Needs No Extra Sharing Mechanism

5. GART Translation Flow

5.1 Driver Writes GART PTEs

5.2 gem5 Reads GART PTEs

5.3 PTE Format

5.4 Address Classification

6. DMA Flow

6.1 vfio-user Backend: Shared Memory Direct Access

6.2 Legacy Backend: Socket DMA Protocol

6.2.1 gem5 Reads from Guest RAM (ring buffers / fences)

6.2.2 gem5 Writes to Guest RAM (fences / IH cookies)

6.3 vfio-user vs Legacy Backend Comparison

7. Sink Mechanism

7.1 Problem Scenario

7.2 Solution

8. Complete Data Flow Example

9. Known Limitations

9.1 DMA Buffer Size (Legacy Backend)

9.2 User-space Page Tables (VMID > 0)

9.3 VRAM Write-back Semantics

10. File Reference

FilesExpand file tree

cosim-memory-architecture.md

Latest commit

History

cosim-memory-architecture.md

File metadata and controls

QEMU+gem5 Co-simulation: Memory Sharing Architecture

1. Background

Key Takeaway

2. Overall Architecture

2.1 Three Sharing Channels

3. VRAM Sharing (BAR0)

3.1 Initialization Flow

3.2 VRAM Content Layout

3.3 Access Patterns

4. Guest RAM Sharing (GTT Pages)

4.1 What GTT Really Is

4.2 Guest RAM Sharing Initialization

4.3 Why GTT Needs No Extra Sharing Mechanism

5. GART Translation Flow

5.1 Driver Writes GART PTEs

5.2 gem5 Reads GART PTEs

5.3 PTE Format

5.4 Address Classification

6. DMA Flow

6.1 vfio-user Backend: Shared Memory Direct Access

6.2 Legacy Backend: Socket DMA Protocol

6.2.1 gem5 Reads from Guest RAM (ring buffers / fences)

6.2.2 gem5 Writes to Guest RAM (fences / IH cookies)

6.3 vfio-user vs Legacy Backend Comparison

7. Sink Mechanism

7.1 Problem Scenario

7.2 Solution

8. Complete Data Flow Example

9. Known Limitations

9.1 DMA Buffer Size (Legacy Backend)

9.2 User-space Page Tables (VMID > 0)

9.3 VRAM Write-back Semantics

10. File Reference