In the QEMU+gem5 MI300X co-simulation, the GPU device model (gem5) and the host system (QEMU/KVM) run as two separate processes. The GPU needs to access two types of memory:
- VRAM (local video memory): GPU-private, stores textures, buffers, GART page tables, etc.
- GTT (Graphics Translation Table / System Memory): Host physical memory regions mapped by the GPU, used for ring buffers, fences, IH cookies, DMA buffers, etc.
Both types of memory must be shared between QEMU and gem5 — otherwise gem5 cannot read commands written by the driver, and QEMU cannot see results written back by the GPU.
Both VRAM and Guest RAM (where GTT pages reside) are already shared via shared memory for bidirectional visibility. The GART page table itself lives in VRAM and is also shared. gem5 reads GART PTEs directly from shared VRAM, then accesses Guest RAM directly through the Ruby memory system's shared backstore for DMA operations.
+----------------------------+ +-----------------------------+
| QEMU (Q35 + KVM) | | gem5 (Docker) |
| | | |
| Guest Linux | | MI300X GPU Model |
| amdgpu driver | | Shader / CU / SDMA |
| | | PM4 / IH / Ruby caches |
| +--------+ +---------+ | vfio-user (Unix) | +------------+ +--------+ |
| | BAR0 | | BAR5 |<---(MMIO/CFG/Doorbell)--->|MI300XVfio | |GPU core| |
| | (VRAM) | | (MMIO) | | | |User bridge | | | |
| +---+----+ +---------+ | | +-----+------+ +--------+ |
| | | | | |
+------+---------------------+ +--------+--------------------+
| |
v v
/dev/shm/mi300x-vram (16 GiB) mmap same file
(VRAM: GPU data + GART page tables) (vramShmemPtr)
| |
v v
/dev/shm/cosim-guest-ram (8 GiB) mmap same file
(Guest RAM: ring buffers, fences, (system->getPhysMem())
GTT pages, kernel/user data)
| Channel | File/Socket | Size | Purpose | Access Method |
|---|---|---|---|---|
| VRAM Shared Memory | /dev/shm/mi300x-vram |
16 GiB | GPU VRAM + GART page tables | mmap (zero-copy) |
| Guest RAM Shared Memory | /dev/shm/cosim-guest-ram |
8 GiB | Host physical memory (GTT pages) | QEMU: mmap; gem5: Ruby memory system direct access to shared backstore |
| vfio-user Socket | /tmp/gem5-mi300x.sock |
— | MMIO/config space/doorbell via vfio-user message passing; DMA via vfu_dma_transfer() or shared memory direct access; interrupts via irq_fd (eventfd -> KVM) |
vfio-user protocol (single connection) |
QEMU side (vfio-user backend):
QEMU uses the built-in vfio-user-pci device to connect to gem5's vfio-user server. BAR0 is exposed to QEMU through the vfio-user DMA region mapping mechanism — QEMU no longer directly opens the VRAM shared memory file, but instead obtains the BAR mapping through the vfio-user protocol.
gem5 side (mi300x_vfio_user.cc:setupVramShm):
shmemFd = shm_open(shmemPath.c_str(), O_CREAT | O_RDWR, 0666);
ftruncate(shmemFd, vramSize);
shmemPtr = mmap(nullptr, vramSize, PROT_READ | PROT_WRITE, MAP_SHARED, shmemFd, 0);
// Key: pass the shared pointer to the GART translator
gpuDevice->getVM().vramShmemPtr = (uint8_t *)shmemPtr;
gpuDevice->getVM().vramShmemSize = vramSize;Offset 0x000000000 +------------------------------+
| GPU Data Area |
| - hipMalloc allocations |
| - Kernel args, textures |
| - Driver internal allocs |
| |
| ... |
| |
Offset ~0x3EE600000 +------------------------------+
(ptBase) | GART Page Table (PTEs) |
| 8 bytes per PTE |
| Maps GPU VA -> phys addr |
Offset 0x400000000 +------------------------------+
(16 GiB)
| Scenario | Writer | Reader | Path |
|---|---|---|---|
| GPU buffer allocation | Driver (via BAR0 write) | gem5 (via vramShmemPtr) | Shared memory direct access |
| GART PTE writes | Driver (via BAR0 write) | gem5 GART translator | memcpy from vramShmemPtr |
| IP Discovery table | gem5 initialization | Driver (via BAR0 read) | Shared memory direct access |
Zero-copy: Since QEMU's BAR0 and gem5's vramShmemPtr both mmap the same /dev/shm file, data written by the driver to BAR0 is immediately visible to gem5 with no socket communication required.
In AMD GPUs, GTT = GART = Graphics Address Remapping Table. It is a single-level page table (VMID 0) that maps GPU virtual addresses to host physical addresses. The host physical memory pages being mapped are the so-called "GTT pages."
Typical GTT page contents:
| Data Structure | Description | Access Direction |
|---|---|---|
| PM4 Ring Buffer | GFX command queue | Driver writes -> GPU reads |
| SDMA Ring Buffer | DMA command queue | Driver writes -> GPU reads |
| IH Ring Buffer | Interrupt handler queue | GPU writes -> Driver reads |
| Fence values | Completion signals | GPU writes -> Driver reads |
| MQD (Map Queue Descriptor) | Queue descriptors | Driver writes -> GPU reads |
| User DMA buffers | hipMemcpy src/dst | Bidirectional |
QEMU side (command-line arguments):
-object memory-backend-file,id=mem0,size=8G,\
mem-path=/dev/shm/cosim-guest-ram,share=on
-numa node,memdev=mem0share=on ensures the file mapping uses MAP_SHARED, making QEMU's modifications to guest memory visible to other processes.
gem5 side (mi300_cosim.py):
system.shared_backstore = args.shmem_host_path # "/cosim-guest-ram"
system.auto_unlink_shared_backstore = True
system.memories[0].shared_backstore = args.shmem_host_pathgem5's PhysicalMemory uses the same POSIX shared memory file as its backing store, achieving memory sharing with QEMU. MI300XVfioUser also sets gpuDevice->getVM().vramShmemPtr to enable the GART translator to correctly access shared VRAM.
GTT pages reside in Guest RAM. Guest RAM is already shared between QEMU and gem5 via /dev/shm/cosim-guest-ram. Therefore:
- Driver writes to ring buffer -> writes to Guest RAM ->
/dev/shm/cosim-guest-ram-> gem5 can read - gem5 writes fence -> Ruby memory controller writes to Guest RAM ->
/dev/shm/cosim-guest-ram-> driver can read - Physical addresses in GART PTEs -> offsets within Guest RAM -> accessible by both sides
vfio-user backend: Both VRAM and Guest RAM are accessed via zero-copy mmap. gem5's SDMA/PM4 DMA operations go through the Ruby memory system which directly accesses the shared backstore memory, with no socket relay needed.
amdgpu driver (guest)
|
+- amdgpu_gart_map(): compute PTE value
| pte = (phys_addr >> 12) << 12 | flags
|
+- write to BAR0 + ptBase + (gpu_page * 8)
| |
| +- QEMU BAR0 = mmap of /dev/shm/mi300x-vram
| +- data immediately appears in shared memory
|
+- TLB invalidate: write VM_INVALIDATE_ENG17 register
+- MMIO -> vfio-user -> gem5 -> invalidateTLBs()
// amdgpu_vm.cc: GARTTranslationGen::translate()
// Step 1: compute PTE offset within VRAM
gart_addr = bits(transformedAddr, 63, 12); // GPU VA page number
pte_table_offset = gart_addr - (ptStart * 8);
// Step 2: read PTE directly from shared VRAM (zero-copy)
pte_vram_offset = gartBase() + pte_table_offset;
memcpy(&pte, vramShmemPtr + pte_vram_offset, sizeof(uint64_t));
// Step 3: extract physical address
if (pte != 0) {
paddr = (bits(pte, 47, 12) << 12) | bits(vaddr, 11, 0);
// paddr points to Guest RAM (GTT page) or VRAM
}63 52 51 48 47 12 11 6 5 2 1 0
+-------+------+-----------------+------+----+---+---+
| Flags | BlkF | Physical Page | Rsvd |Frag|Sys| V |
| | | (PA >> 12) | | | | |
+-------+------+-----------------+------+----+---+---+
Bit 0: Valid -- PTE is valid
Bit 1: System -- 1=system memory (Guest RAM), 0=local VRAM
Bit 47:12 -- physical page number
After GART translation yields a physical address, gem5 determines where it points:
Physical address paddr
|
+- Within fbBase ~ fbTop range?
| +- YES -> VRAM address
| +- Access directly via vramShmemPtr (zero-copy)
|
+- Within sysAddrL ~ sysAddrH range?
| +- YES -> Guest RAM address (GTT page)
| +- Access via Ruby memory system (shared memory direct access)
|
+- Neither?
+- Sink (paddr=0, safely discarded)
With the vfio-user backend, gem5 accesses Guest RAM directly through the Ruby memory system's shared backstore (/dev/shm/cosim-guest-ram), with no socket-based DMA operations needed.
gem5 GPU model (PM4/SDMA/IH)
|
| Needs to read ring buffer commands / write fence values
|
v Ruby memory system request
|
+- Address translated by GART -> Guest physical address
|
+- Ruby memory controller accesses PhysicalMemory
| |
| +- PhysicalMemory backed by /dev/shm/cosim-guest-ram (MAP_SHARED)
| +- read/write directly hits shared memory
| +- QEMU sees changes immediately (same mmap file)
|
+- Done (no socket round-trip needed)
Key advantages:
- Zero-copy: DMA reads and writes operate directly on shared memory with no serialization/deserialization
- Low latency: Eliminates the socket request-response round-trip overhead
- Simplified architecture: No custom DMA protocol needed; Ruby's memory system natively supports shared backstores
Interrupts are delivered via vfio-user's irq_fd mechanism (eventfd -> KVM), with no custom interrupt messages needed.
The following describes the legacy custom cosim socket backend (
MI300XGem5Cosim), retained for reference.
gem5 GPU model (PM4/SDMA/IH)
|
| Needs to read ring buffer commands from Guest RAM
|
v cosimBridge->sendDmaRead(guestPhysAddr, length)
|
+- Construct DmaRead message (32-byte header)
| { type=DmaRead, addr=guestPhysAddr, data=length }
|
+- sendAll(eventFd, &msg, 32) --> QEMU event thread
| |
| +- pci_dma_read(addr, buf, len)
| | (reads from /dev/shm/cosim-guest-ram)
| |
| +- sendAll(eventFd, &resp, 32)
| <------------------------------------------+- sendAll(eventFd, data, len)
|
+- memcpy(dest, recvBuf, length) // data arrives at gem5
gem5 GPU model
|
| Needs to write fence value to Guest RAM
|
v cosimBridge->sendDmaWrite(guestPhysAddr, length, data)
|
+- Construct DmaWrite message + data payload
| { type=DmaWrite, addr=guestPhysAddr, data=length, size=length }
|
+- sendAll(eventFd, &msg, 32) --> QEMU event thread
+- sendAll(eventFd, data, length) --> |
| +- pci_dma_write(addr, buf, len)
| | (writes to /dev/shm/cosim-guest-ram)
| |
+- Done (DMA writes don't wait for response) +- Driver can see data immediately
| Dimension | vfio-user Backend | Legacy Socket Backend |
|---|---|---|
| Guest RAM DMA | Ruby memory system direct access to shared backstore | Socket request-response protocol |
| VRAM access | mmap zero-copy (same) | mmap zero-copy (same) |
| Interrupts | irq_fd (eventfd -> KVM) | Socket messages |
| MMIO | vfio-user message passing | Custom socket protocol |
| QEMU-side device | Built-in vfio-user-pci |
Custom mi300x_gem5.c |
| Address translation | gem5-internal GART translation | QEMU-side pci_dma_read/write |
The reasons the legacy backend routed Guest RAM DMA through the socket (address translation, event-driven simulation, IOMMU compatibility, etc.) no longer apply with vfio-user: gem5's Ruby memory controllers directly access the shared backstore memory, and GART address translation is performed internally within gem5.
In co-simulation mode, some GART PTEs may be zero (uninitialized) or point to VRAM-internal addresses. If gem5 cannot translate these addresses, it throws a GenericPageTableFault, causing a DMA retry loop that hangs the simulation.
// amdgpu_vm.cc: GARTTranslationGen::translate()
if (pte == 0) {
if (origAddr < vramShmemSize && vramShmemPtr) {
// VRAM address -> map to sink (paddr=0)
range.paddr = 0;
warn_once("GART: VRAM address mapped to sink -- "
"VRAM write-backs are no-ops in cosim");
} else if (vramShmemPtr) {
// Unmapped GART page -> sink
range.paddr = 0;
warn_once("GART cosim: unmapped page -> sink");
}
}Sink semantics:
paddr=0is always a valid physical address in gem5 (system RAM base)- DMA reads return zeros
- DMA writes are silently discarded
- Prevents the fault -> retry deadloop
Using a HIP kernel dispatch to illustrate the full memory interaction:
1. hipMalloc(&d_a, N*sizeof(int))
Driver -> allocates buffer in VRAM
Writes GART PTEs to shared VRAM (BAR0)
2. hipMemcpy(d_a, h_a, N*sizeof(int), hipMemcpyHostToDevice)
Driver -> constructs SDMA copy command -> writes to Guest RAM (ring buffer)
Driver -> writes Doorbell -> QEMU BAR2 -> vfio-user -> gem5
gem5 -> reads ring buffer (Guest RAM via shared memory)
gem5 -> parses SDMA command -> GART translates source address -> Guest RAM
gem5 -> reads source data (Guest RAM via shared memory)
gem5 -> writes to VRAM destination (shared memory direct write)
3. kernel<<<1, N>>>(d_a, d_b, d_c, N)
Driver -> constructs PM4 dispatch command -> writes to Guest RAM (ring buffer)
Driver -> writes Doorbell -> gem5
gem5 -> reads PM4 command (Guest RAM via shared memory)
gem5 -> launches shader execution
gem5 -> shader reads/writes VRAM (shared memory direct access)
gem5 -> writes fence on completion (Guest RAM via Ruby memory write)
gem5 -> sends MSI-X interrupt (irq_fd -> KVM)
4. hipDeviceSynchronize()
Driver -> polls fence value (until Guest RAM value matches)
+- fence written by gem5 via Ruby memory write to shared backstore
This limitation only applies to the legacy socket backend. The vfio-user backend accesses shared memory directly and has no such limit.
Maximum single DMA transfer is 4 MiB (COSIM_DMA_BUF_SIZE). Transfers exceeding this size must be chunked. In practice, the driver typically submits page-sized transfers, so this limit is rarely hit.
VMID 0 (kernel mode) GART page tables are fully visible via shared VRAM. However, VMID > 0 (user mode) multi-level page tables are walked by VegaISA::Walker, which uses gem5's internal TLB/page walker rather than reading directly from shared memory.
The practical impact is limited: after the driver writes page tables, it sends TLB invalidate MMIOs. gem5 flushes its TLB upon receiving these, and subsequent walker traversals read from the correct physical addresses (which point to shared VRAM or Guest RAM).
Some GART addresses in gem5 point back to VRAM itself (VRAM-to-VRAM DMA). These addresses are routed to the sink (paddr=0), and writes are silently discarded. For pure compute workloads, this does not affect correctness.
| File | Key Function/Region | Role |
|---|---|---|
gem5/src/dev/amdgpu/amdgpu_vm.cc:396-557 |
GARTTranslationGen::translate() |
Core GART translation logic |
gem5/src/dev/amdgpu/amdgpu_vm.hh |
AMDGPUSysVMContext, vramShmemPtr |
GART data structures |
gem5/src/dev/amdgpu/mi300x_vfio_user.cc |
setupVramShm() |
VRAM shared memory initialization (vfio-user backend) |
gem5/src/dev/amdgpu/mi300x_vfio_user.hh |
MI300XVfioUser |
vfio-user server-side bridge |
gem5/src/dev/amdgpu/mi300x_gem5_cosim.cc |
setupSharedMemory(), sendDmaRead/Write() |
Legacy socket backend (VRAM init + DMA) |
gem5/configs/example/gpufs/mi300_cosim.py |
shared_backstore config, --cosim-backend |
Guest RAM sharing setup + backend selection |
gem5/src/dev/amdgpu/MI300XVfioUser.py |
SimObject definition | vfio-user backend Python binding |