A complete workflow from compilation to running HIP GPU compute.
+---------------------------------+ +------------------------------+
| QEMU (Q35 + KVM) | | gem5 (inside Docker) |
| +---------------------------+ | | +------------------------+ |
| | Guest Linux (Ubuntu 24.04)| | | | MI300X GPU Model | |
| | amdgpu driver | | | | - Shader + CU | |
| | ROCm 7.0 / HIP runtime | | | | - PM4 / SDMA Engines | |
| +-----------+---------------+ | | | - Ruby Cache Hierarchy | |
| | MMIO/Doorbell | | +----------+-------------+ |
| +-----------v---------------+ | | +----------v-------------+ |
| | vfio-user-pci (built-in) |<--------->| MI300XVfioUser Server | |
| +---------------------------+ |vfio-| +------------------------+ |
| |user | |
+---------------------------------+ +------------------------------+
| |
v v
/dev/shm/cosim-guest-ram /dev/shm/mi300x-vram
(Guest Physical Memory, Shared) (GPU VRAM, Shared)
- QEMU is responsible for: CPU execution, Linux kernel boot, PCIe enumeration, amdgpu driver loading
- gem5 is responsible for: MI300X GPU compute model (Shader, CU, Cache, DMA engines)
- They communicate via the vfio-user protocol (over a Unix domain socket). QEMU uses its built-in
vfio-user-pcidevice, while gem5 runsMI300XVfioUseras a vfio-user server, transparently handling MMIO / Doorbell / PCI Config accesses. Data is shared via shared memory
| Requirement | Description |
|---|---|
| Host OS | Linux x86_64 with KVM support (verified on WSL2 6.6.x) |
| Docker | Daemon running, current user in docker group |
| KVM | /dev/kvm accessible |
| Disk Space | At least 120 GB (55G disk image + build artifacts) |
| Memory | 16 GB or more recommended (gem5 compilation and runtime are memory-intensive) |
| Tools | git, screen, unzip |
/home/zevorn/cosim/
gem5/ # gem5 source (cosim branch)
build/VEGA_X86/gem5.opt # gem5 binary
configs/example/gpufs/
mi300_cosim.py # cosim config script
scripts/
run_mi300x_fs.sh # orchestration script
cosim_launch.sh # cosim one-click launch script
Dockerfile.run # runtime Docker image
gem5-resources/ # disk images, kernels, GPU apps
src/x86-ubuntu-gpu-ml/
disk-image/x86-ubuntu-rocm70 # 55G raw disk image
vmlinux-rocm70 # kernel
docs/ # documentation
qemu/ # QEMU source (only needed for legacy backend)
build/qemu-system-x86_64 # QEMU binary
The gem5 binary links against Ubuntu 24.04 libraries and must be compiled in a compatible environment.
Note: The vfio-user backend requires
libjson-c-dev(build-time) andlibjson-c5(runtime). Theghcr.io/gem5/gpu-fs:latestimage already includes this dependency. If building directly on the host, installlibjson-c-devfirst.
cd /home/zevorn/cosim/gem5
# Build using the gpu-fs image (amd64, includes all dependencies)
docker run --rm \
-v "$(pwd):/gem5" -w /gem5 \
-e PYTHONPATH=/usr/lib/python3.12/lib-dynload \
ghcr.io/gem5/gpu-fs:latest \
bash -c "scons build/VEGA_X86/gem5.opt -j4 GOLD_LINKER=True --linker=gold 2>&1"Note: Reduce parallelism (
-j1or-j2) if running out of memory. Using the gold linker reduces memory usage during the linking stage.
./scripts/run_mi300x_fs.sh build-gem5Output: build/VEGA_X86/gem5.opt (approximately 1.1 GB).
cd scripts
docker build -t gem5-run:local -f Dockerfile.run .This image is based on ghcr.io/gem5/gpu-fs with Python 3.12 support added, used for running gem5.
With the vfio-user backend, a stock QEMU 10.0+ build works out of the box (the vfio-user-pci device is built-in) -- no custom QEMU code is needed. Standard build:
# Any QEMU 10.0+ source tree works
mkdir -p qemu-build && cd qemu-build
/path/to/qemu/configure --target-list=x86_64-softmmu
make -j$(nproc)Output: qemu-system-x86_64.
Legacy backend: If using
--cosim-backend=legacy, thecosim/qemu/source containing themi300x-gem5device is required. The build procedure is the same, but you must use the cosim branch QEMU source.
Alternatively, via the orchestration script:
cd /home/zevorn/cosim/gem5
./scripts/run_mi300x_fs.sh build-qemuThe disk image contains Ubuntu 24.04 + ROCm 7.0 + kernel 6.8.0-79-generic with amdgpu DKMS modules.
./scripts/run_mi300x_fs.sh build-diskcd ../gem5-resources/src/x86-ubuntu-gpu-ml
./build.sh -var "qemu_path=/usr/sbin/qemu-system-x86_64"On Arch Linux the QEMU path is
/usr/sbin/, other distributions may use/usr/bin/.
| Artifact | Path | Size |
|---|---|---|
| Disk Image | ../gem5-resources/src/x86-ubuntu-gpu-ml/disk-image/x86-ubuntu-rocm70 |
~55 GB |
| Kernel | ../gem5-resources/src/x86-ubuntu-gpu-ml/vmlinux-rocm70 |
~64 MB |
cd /home/zevorn/cosim/gem5
./scripts/cosim_launch.shThis script automatically performs all the following steps (starts the gem5 container, waits for readiness, fixes permissions, starts QEMU), and enters the QEMU serial console in interactive mode.
Available options:
./scripts/cosim_launch.sh --help
./scripts/cosim_launch.sh --gem5-debug MI300XCosim # enable gem5 debug output
./scripts/cosim_launch.sh --vram-size 32GiB # custom VRAM size
./scripts/cosim_launch.sh --num-cus 80 # custom CU count
./scripts/cosim_launch.sh --cosim-backend=vfio-user # use vfio-user backend (default)
./scripts/cosim_launch.sh --cosim-backend=legacy # use legacy custom socket backenddocker run -d --name gem5-cosim \
-v /home/zevorn/cosim/gem5:/gem5 \
-v /tmp:/tmp \
-v /dev/shm:/dev/shm \
-w /gem5 \
-e PYTHONPATH=/usr/lib/python3.12/lib-dynload \
gem5-run:local \
/gem5/build/VEGA_X86/gem5.opt --listener-mode=on \
/gem5/configs/example/gpufs/mi300_cosim.py \
--socket-path=/tmp/gem5-mi300x.sock \
--shmem-path=/mi300x-vram \
--shmem-host-path=/cosim-guest-ram \
--dgpu-mem-size=16GiB \
--num-compute-units=40 \
--mem-size=8G# Watch gem5 logs, wait for "listening" or "ready"
docker logs -f gem5-cosimThe following output indicates readiness:
============================================================
gem5 MI300X co-simulation server ready
Socket: /tmp/gem5-mi300x.sock
VRAM SHM: /mi300x-vram
Host SHM: /cosim-guest-ram
VRAM size: 16GiB
Host RAM: 8GiB
CUs: 40
Waiting for QEMU to connect...
============================================================
Files created by Docker are owned by root; permissions must be fixed so QEMU can access them:
docker exec gem5-cosim chmod 777 /tmp/gem5-mi300x.sock
docker exec gem5-cosim chmod 666 /dev/shm/mi300x-vram# Foreground interactive mode (vfio-user backend, stock QEMU 10.0+)
qemu-system-x86_64 \
-machine q35 -enable-kvm -cpu host \
-m 8G -smp 4 \
-object memory-backend-file,id=mem0,size=8G,mem-path=/dev/shm/cosim-guest-ram,share=on \
-numa node,memdev=mem0 \
-kernel /home/zevorn/cosim/gem5-resources/src/x86-ubuntu-gpu-ml/vmlinux-rocm70 \
-append "console=ttyS0,115200 root=/dev/vda1 modprobe.blacklist=amdgpu" \
-drive file=/home/zevorn/cosim/gem5-resources/src/x86-ubuntu-gpu-ml/disk-image/x86-ubuntu-rocm70,format=raw,if=virtio \
-device 'vfio-user-pci,socket={"type":"unix","path":"/tmp/gem5-mi300x.sock"}' \
-nographic -no-rebootImportant: The kernel command line must include
modprobe.blacklist=amdgputo prevent the PCI subsystem from auto-loading the driver before the VGA ROM is written to shared memory. Thecosim-gpu-setup.servicehandles the correct initialization order (dd ROM → modprobe).Note: With the vfio-user backend, there is no need to specify
shmem-pathorvram-sizeon the QEMU side. Shared memory is created and managed by theMI300XVfioUserserver in gem5.
Or run in background screen mode:
screen -dmS qemu-cosim -L -Logfile /tmp/qemu-cosim-screen.log \
qemu-system-x86_64 \
-machine q35 -enable-kvm -cpu host \
-m 8G -smp 4 \
-object memory-backend-file,id=mem0,size=8G,mem-path=/dev/shm/cosim-guest-ram,share=on \
-numa node,memdev=mem0 \
-kernel /home/zevorn/cosim/gem5-resources/src/x86-ubuntu-gpu-ml/vmlinux-rocm70 \
-append "console=ttyS0,115200 root=/dev/vda1 modprobe.blacklist=amdgpu" \
-drive file=/home/zevorn/cosim/gem5-resources/src/x86-ubuntu-gpu-ml/disk-image/x86-ubuntu-rocm70,format=raw,if=virtio \
-device 'vfio-user-pci,socket={"type":"unix","path":"/tmp/gem5-mi300x.sock"}' \
-nographic -no-reboot
# Attach to the screen session to view serial output
screen -r qemu-cosim
# Detach from screen: Ctrl-A DAfter the guest Linux finishes booting (auto-login as root), run the following commands to load the amdgpu driver.
The disk image includes cosim-gpu-setup.service which runs at boot:
- Writes VGA ROM to
0xC0000viadd(required for gem5readROM()) - Symlinks IP discovery firmware
- Runs
modprobe amdgpu ip_block_mask=0x67 discovery=2 ras_enable=0
The service completes in ~40 seconds. After login, verify with rocm-smi.
# 1. Load VGA ROM (REQUIRED before modprobe)
dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128
# 2. Symlink the IP discovery firmware
ln -sf /usr/lib/firmware/amdgpu/mi300_discovery \
/usr/lib/firmware/amdgpu/ip_discovery.bin
# 3. Load the amdgpu driver
modprobe amdgpu ip_block_mask=0x67 discovery=2 ras_enable=0Key parameter notes:
ip_block_mask=0x67(binary 0110_0111) enables GMC, IH, DCN, GFX, SDMA, VCN, and disables PSP and SMU- Using an incorrect mask (e.g., 0x6f) will cause PSP initialization to trigger a GPU reset, resulting in a kernel panic
ras_enable=0is required to prevent a NULL pointer crash inamdgpu_atom_parse_data_header(the 3KB cosim ROM has minimal ATOMBIOS data)- The
ddstep is mandatory -- without it, the driver's BIOS discovery chain fails andatom_contextis NULL
# Check dmesg - should see "amdgpu: DRM initialized" and "7 XCP partitions"
dmesg | grep -i amdgpu | tail -20
# Verify device recognition
rocm-smi
# Verify GPU capabilities
rocminfo | head -40Expected output:
# rocm-smi output
GPU[0] : Device Name: 0x74a0
GPU[0] : Partition: SPX
# rocminfo output
Name: gfx942
Compute Unit: 320
KERNEL_DISPATCH capable
Approximately 80 fence fallback timer warnings may appear during the loading process. This is normal -- the DRM subsystem uses a polling-mode timeout fallback mechanism when probing all ring buffers.
Write a simple vector addition program inside the guest:
cat > /tmp/vec_add.cpp << 'EOF'
#include <hip/hip_runtime.h>
#include <cstdio>
__global__ void vec_add(int *a, int *b, int *c, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) c[i] = a[i] + b[i];
}
int main() {
const int N = 4;
int ha[N] = {1, 2, 3, 4};
int hb[N] = {10, 20, 30, 40};
int hc[N] = {0};
int *da, *db, *dc;
hipMalloc(&da, N * sizeof(int));
hipMalloc(&db, N * sizeof(int));
hipMalloc(&dc, N * sizeof(int));
hipMemcpy(da, ha, N * sizeof(int), hipMemcpyHostToDevice);
hipMemcpy(db, hb, N * sizeof(int), hipMemcpyHostToDevice);
vec_add<<<1, N>>>(da, db, dc, N);
hipMemcpy(hc, dc, N * sizeof(int), hipMemcpyDeviceToHost);
printf("Result: %d %d %d %d\n", hc[0], hc[1], hc[2], hc[3]);
bool pass = (hc[0]==11 && hc[1]==22 && hc[2]==33 && hc[3]==44);
printf("%s\n", pass ? "PASSED!" : "FAILED!");
hipFree(da); hipFree(db); hipFree(dc);
return pass ? 0 : 1;
}
EOFCompile and run:
# Compile (gfx942 = MI300X architecture)
/opt/rocm/bin/hipcc --offload-arch=gfx942 -o /tmp/vec_add /tmp/vec_add.cpp
# Run
/tmp/vec_addResult: 11 22 33 44
PASSED!
You can also use the square test program included in gem5-resources. First compile it on the host:
cd /home/zevorn/cosim/gem5
./scripts/run_mi300x_fs.sh build-app squareThen copy the compiled binary into the guest (via scp or by directly mounting the disk image) and run it inside the guest:
./square.default# Normal shutdown
poweroff
# Or force quit QEMU
Ctrl-A X
docker rm -f gem5-cosim
rm -f /dev/shm/mi300x-vram /dev/shm/cosim-guest-ram
rm -f /tmp/gem5-mi300x.sockWhen using
cosim_launch.sh, cleanup is performed automatically after exiting QEMU.
docker logs gem5-cosimCommon causes:
gem5.optnot compiled or incorrect path- Python module import failure (check PYTHONPATH)
- Shared memory creation permission issues
Failed to connect to /tmp/gem5-mi300x.sock
- Confirm gem5 has finished initialization (look for "Waiting for QEMU to connect")
- Confirm socket permissions have been fixed (
chmod 777)
BUG: kernel NULL pointer dereference at psp_gpu_reset+0x43
- An incorrect
ip_block_maskwas used. Must use0x67(disables PSP+SMU), not0x6f
GART translation for 0x3fff800000000 not found
- This is a fixed bug: unmapped GART pages are now routed to a sink address (paddr=0) and no longer cause crashes
- If this still occurs, confirm you are using the latest compiled gem5 binary
error: cannot find ROCm device library
- Confirm ROCm is properly installed:
ls /opt/rocm/lib/ - Use the correct architecture flag:
--offload-arch=gfx942
- Check gem5 logs (
docker logs gem5-cosim) for errors - A small number of fence timeouts is normal; a large number may indicate issues with the DMA or interrupt path
| Parameter | Default | Description |
|---|---|---|
--socket-path |
/tmp/gem5-mi300x.sock |
QEMU <-> gem5 communication socket (vfio-user protocol) |
--shmem-path |
/mi300x-vram |
GPU VRAM shared memory name (under /dev/shm) |
--shmem-host-path |
/cosim-guest-ram |
Guest RAM shared memory name |
--dgpu-mem-size |
16GiB |
GPU VRAM size |
--num-compute-units |
40 |
Number of GPU compute units |
--mem-size |
8GiB |
Guest physical memory size |
--cosim-backend |
vfio-user |
Cosim backend type (vfio-user or legacy) |
ip_block_mask |
0x67 |
amdgpu driver IP block mask |
discovery |
2 |
Use IP discovery firmware |
| File | Purpose |
|---|---|
scripts/cosim_launch.sh |
cosim one-click launch script |
scripts/run_mi300x_fs.sh |
Orchestration script (compile, build image, run) |
configs/example/gpufs/mi300_cosim.py |
gem5 cosim configuration |
src/dev/amdgpu/mi300x_vfio_user.{cc,hh} |
gem5-side vfio-user server (default backend) |
src/dev/amdgpu/mi300x_gem5_cosim.{cc,hh} |
gem5-side legacy bridge (legacy backend) |
src/dev/amdgpu/amdgpu_device.cc |
GPU device model |
src/dev/amdgpu/amdgpu_vm.cc |
GPU address translation (GART, etc.) |
qemu/hw/misc/mi300x_gem5.c |
QEMU-side mi300x-gem5 PCIe device (legacy backend only) |
| Component | Version |
|---|---|
| Guest OS | Ubuntu 24.04.2 LTS |
| Guest Kernel | 6.8.0-79-generic |
| ROCm | 7.0.0 |
| amdgpu DKMS | Matches ROCm 7.0 |
| gem5 Build Target | VEGA_X86 |
| GPU Device | MI300X (gfx942, DeviceID 0x74A0) |
| Coherence Protocol | GPU_VIPER |
| QEMU | 10.0+ (vfio-user backend) or cosim branch (legacy backend) |