- 
                Notifications
    You must be signed in to change notification settings 
- Fork 200
The shape of VM to come
Get qemu 
git clone https://gitlab.com/qemu-project/qemu.git
Install qemu
mkdir build \
cd build \
../configure --enable-slirp \
make -j \
sudo make install \
Get an image of debian12.2.0 we want to boot on as a virtual machine: 
wget https://www.debian.org/distrib/netinst/debian-12.2.0-amd64-netinst.iso .
Create a disk image (qcow2 format) where the vm will store 
qemu-img create -f format qcow2 mydisk.img 20G
Install the vm running on debian with qemu:
qemu-system-x86_64 -boot d -cdrom debian-12.2.0-amd64-netinst.iso -m 4G \
-device e1000,netdev=net0,mac=52:54:00:12:34:56 -netdev user,id=net0,hostfwd=tcp::10022-:22 \
-hda mydisk.img -accel kvm
Follow all instruction from the interface and you're done. -accel kvm helps boosting the installation time (from 1h30 to 20min in my case)
Let's say we want to run debian with 8Gb of ram: 
qemu-system-x86_64 -hda mydisk.img -m 8G -accel kvm
A vm can use a lot of ressources and slow down its usage, we can lighten our efforts by disabling all graphical interface: Open a terminal within the vm and run
sudo systemctl set-default multi-user.target
sudo reboot
Just in case, you can re-enable it with:
systemctl set-default graphical.target
sudo reboot
Using qemu, let's set our VM's hardware with 4 NUMA nodes, each with 4cpus of 4,2,1 and 1Gb of memory: \
qemu-system-x86_64 -hda mydisk.img -m 8G \
        -accel kvm \
       -smp cpus=16 \
       -object memory-backend-ram,size=4G,id=ram0 \
       -object memory-backend-ram,size=2G,id=ram1 \
       -object memory-backend-ram,size=1G,id=ram2 \
       -object memory-backend-ram,size=1G,id=ram3 \
       -numa node,nodeid=0,memdev=ram0,cpus=0-3 \
       -numa node,nodeid=1,memdev=ram1,cpus=4-7 \
       -numa node,nodeid=2,memdev=ram2,cpus=8-11 \
       -numa node,nodeid=3,memdev=ram3,cpus=12-15 \
qemu-system-x86_64 -hda img/mydisk.img -accel kvm \
        -device e1000,netdev=net0,mac=52:54:00:12:34:56 -netdev user,id=net0,hostfwd=tcp::10022-:22 \
        -machine pc,nvdimm=on \
        -m 8G,slots=1,maxmem=9G \
        -smp cpus=16 \
        -object memory-backend-ram,size=4G,id=ram0 \
        -object memory-backend-ram,size=2G,id=ram1 \
        -object memory-backend-ram,size=1G,id=ram2 \
        -object memory-backend-ram,size=1G,id=ram3 \
        -device nvdimm,id=nvdimm1,memdev=nvdimm1,unarmed=off,node=4 \
        -object memory-backend-file,id=nvdimm1,share=on,mem-path=img/nvdimm.img,size=1G \
        -numa node,nodeid=0,memdev=ram0,cpus=0-3 \
        -numa node,nodeid=1,memdev=ram1,cpus=4-7 \
        -numa node,nodeid=2,memdev=ram2,cpus=8-11 \
        -numa node,nodeid=3,memdev=ram3,cpus=12-15 \
        -numa node,nodeid=4
By running the command: ndctl list -NRD we can list the active and enabled nvdimm devices:
{
  "dimms":[
    {
      "dev":"nmem0",
      "id":"8680-56341200",
      "handle":1,
      "phys_id":0
    }
  ],
  "regions":[
    {
      "dev":"region0",
      "size":1073741824,
      "align":16777216,
      "available_size":0,
      "max_available_extent":0,
      "type":"pmem",
      "mappings":[
        {
          "dimm":"nmem0",
          "offset":0,
          "length":1073741824,
          "position":0
        }
      ],
      "persistence_domain":"unknown",
      "namespaces":[
        {
          "dev":"namespace0.0",
          "mode":"raw",
          "size":1073741824,
          "sector_size":512,
          "blockdev":"pmem0"
        }
      ]
    }
  ]
}
By defaults, the namespaceX.Y (here namespace0.0) is set as a raw mode. Which means, the nvdimm device acts as a memory disk not supporting dax. We need to disable the namespace, create a new one and finally set mode to devdax with following commands:
sudo ndctl disable-namespace namespace0.0
sudo ndctl create-namespace -m devdax
sudo daxctl reconfigure-device -m system-ram all --force
Node 4 is now congired as dax:
{
  "dimms":[
    {
      "dev":"nmem0",
      "id":"8680-56341200",
      "handle":1,
      "phys_id":0
    }
  ],
  "regions":[
    {
      "dev":"region0",
      "size":1073741824,
      "align":16777216,
      "available_size":0,
      "max_available_extent":0,
      "type":"pmem",
      "mappings":[
        {
          "dimm":"nmem0",
          "offset":0,
          "length":1073741824,
          "position":0
        }
      ],
      "persistence_domain":"unknown",
      "namespaces":[
        {
          "dev":"namespace0.0",
          "mode":"devdax",
          "map":"dev",
          "size":1054867456,
          "uuid":"ed8bb2a9-41fb-48e0-a0b2-7dbf0d9ca9ba",
          "chardev":"dax0.0",
          "align":2097152
        }
      ]
    }
  ]
}
To be sure, ewe work with latest linux kernel: 6.7.0-rc3+
First we need a CXL hostbridge (Pci EXtended Bridge, i.e, pxb-cxl "cxl.1"), then we attach a root-port (cxl-rp "root_port13" here), then a Type 3 device. 
In this case it is a pmem device so it needs two "memory-backend-file" objects, one for the memory ("pmem0" here) and one for its label storage area (LSA, i.e "cxl-lsa0"). Finally we need a Fixed Memory Window (FMW, i.e, cxl-fwm) to map that memory in the host:
qemu-system-x86_64 -hda img/mydisk.img -accel kvm \
        -machine q35,nvdimm=on,cxl=on \
        -device e1000,netdev=net0,mac=52:54:00:12:34:56 \
        -netdev user,id=net0,hostfwd=tcp::10022-:22 \
        -m 4G,slots=8,maxmem=8G \
        -smp 4 \
        -object memory-backend-ram,size=4G,id=mem0 \
        -numa node,nodeid=0,cpus=0-3,memdev=mem0 \
        -object memory-backend-file,id=pmem0,share=on,mem-path=/tmp/cxltest.raw,size=256M \
        -object memory-backend-file,id=cxl-lsa0,share=on,mem-path=/tmp/lsa.raw,size=256M \
        -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
        -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
        -device cxl-type3,bus=root_port13,persistent-memdev=pmem0,lsa=cxl-lsa0,id=cxl-pmem0 \
        -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G
Lets build with 2 sockets. Each socket has 2 cpus and 2 cxl devices, 1 switch. 
We need a PXB per socket with 2 RP per socket. A switch is installed on each socket. We need to set 1 upstream port per socket and 2 downstream ports per sockets. Both pxb set as upstream port for the switch, have to be attached on slot 0. Hence, we need to distinguish chassis from each other numa nodes.
In this case it is a vmem device so it needs two "memory-backend-ram" objects per socket. Finally we set 2 Fixed Memory Window to map both memory in the host:
qemu-system-x86_64 -hda img/mydisk.img -accel kvm \
        -machine q35,nvdimm=on,cxl=on \
        -device e1000,netdev=net0,mac=52:54:00:12:34:56 \
        -netdev user,id=net0,hostfwd=tcp::10022-:22 \
        -m 2G,slots=8,maxmem=10G \
        -smp cpus=4,cores=2,sockets=2 \
        -object memory-backend-ram,size=1G,id=ram0 \
        -object memory-backend-ram,size=1G,id=ram1 \
        -object memory-backend-ram,id=cxl-mem0,share=on,size=256M \
        -object memory-backend-ram,id=cxl-mem1,share=on,size=256M \
        -object memory-backend-ram,id=cxl-mem2,share=on,size=256M \
        -object memory-backend-ram,id=cxl-mem3,share=on,size=256M \
        -numa node,nodeid=0,cpus=0-1,memdev=ram0 \
        -numa node,nodeid=1,cpus=2-3,memdev=ram1 \
        -device pxb-cxl,numa_node=0,bus_nr=24,bus=pcie.0,id=pxb-cxl.1 \
        -device pxb-cxl,numa_node=1,bus_nr=32,bus=pcie.0,id=pxb-cxl.2 \
        -device cxl-rp,port=0,bus=pxb-cxl.1,id=root_port1,chassis=0,slot=0 \
        -device cxl-rp,port=1,bus=pxb-cxl.1,id=root_port2,chassis=0,slot=1 \
        -device cxl-rp,port=2,bus=pxb-cxl.2,id=root_port3,chassis=1,slot=0 \
        -device cxl-rp,port=3,bus=pxb-cxl.2,id=root_port4,chassis=1,slot=2 \
        -device cxl-upstream,bus=root_port1,id=us0 \
        -device cxl-upstream,bus=root_port3,id=us1 \
        -device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=3 \
        -device cxl-type3,bus=swport0,volatile-memdev=cxl-mem0,id=cxl-vmem0 \
        -device cxl-downstream,port=1,bus=us0,id=swport1,chassis=0,slot=4 \
        -device cxl-type3,bus=swport1,volatile-memdev=cxl-mem1,id=cxl-vmem1 \
        -device cxl-downstream,port=2,bus=us1,id=swport2,chassis=1,slot=5 \
        -device cxl-type3,bus=swport2,volatile-memdev=cxl-mem2,id=cxl-vmem2 \
        -device cxl-downstream,port=3,bus=us1,id=swport3,chassis=1,slot=6 \
        -device cxl-type3,bus=swport3,volatile-memdev=cxl-mem3,id=cxl-vmem3 \
        -M cxl-fmw.0.targets.0=pxb-cxl.1,cxl-fmw.0.size=4G,cxl-fmw.1.targets.0=pxb-cxl.2,cxl-fmw.1.size=4G
Here, we selected root_port1 and root_port3 to be plugged on slot 0 on chassis 0 and chassis 1 respectively. bus_nr of PXBs may lead to error messages because they may be already used. Just change them to another value. 
From the vm, list cxl memory devices with cxl list -M :
[
  {
    "memdev":"mem1",
    "ram_size":268435456,
    "serial":0,
    "numa_node":1,
    "host":"0000:23:00.0"
  },
  {
    "memdev":"mem0",
    "ram_size":268435456,
    "serial":0,
    "numa_node":1,
    "host":"0000:24:00.0"
  },
  {
    "memdev":"mem2",
    "ram_size":268435456,
    "serial":0,
    "numa_node":0,
    "host":"0000:1b:00.0"
  },
  {
    "memdev":"mem3",
    "ram_size":268435456,
    "serial":0,
    "numa_node":0,
    "host":"0000:1c:00.0"
  }
]
We can list decoders available with cxl list -D:
[
  {
    "root decoders":[
      {
        "decoder":"decoder0.0",
        "size":4294967296,
        "interleave_ways":1,
        "max_available_extent":-17985175553,
        "pmem_capable":true,
        "volatile_capable":true,
        "accelmem_capable":true,
        "nr_targets":1
      },
      {
        "decoder":"decoder0.1",
        "size":4294967296,
        "interleave_ways":1,
        "max_available_extent":-22280142849,
        "pmem_capable":true,
        "volatile_capable":true,
        "accelmem_capable":true,
        "nr_targets":1
      }
    ]
  }
]
We assemble a cxl region with the cxl list create-region command. We need to select the decoder where the region will be created under and containing cxl devices. Below, we first assemble mem1 and mem0 located under decoder0.1, with a 2 way interleaving:
sudo cxl create-region -m -d decoder0.1 -t ram -w 2 mem1 mem0
And we assemble with decoder 0.0 mem2 and mem3 with 1 way interleaving
sudo cxl create-region -m -d decoder0.0 -t ram -w 1 mem2
sudo cxl create-region -m -d decoder0.0 -t ram -w 1 mem3
We can see they are now available with command: daxctl list
[
  {
    "chardev":"dax1.0",
    "size":268435456,
    "target_node":3,
    "align":2097152,
    "mode":"system-ram"
  },
  {
    "chardev":"dax3.0",
    "size":268435456,
    "target_node":3,
    "align":2097152,
    "mode":"system-ram"
  },
  {
    "chardev":"dax0.0",
    "size":536870912,
    "target_node":2,
    "align":2097152,
    "mode":"system-ram"
  }
]
New DAX device should appear under /sys/bus/dax/devices. By default, new NUMA nodes appear offline. Run daxctl online-memory all to make them online. \
A last example, with 4 sockets, one socket with only cpus, one with cxl pmem device, one with 2 cxl 2-way interleaved, one with 2 cxl 1-way interleaved: Note that the line with bus id=24 is highlighted.
<pre>
qemu-system-x86_64 -hda img/mydisk.img -accel kvm \
        -machine q35,nvdimm=on,cxl=on \
        -device e1000,netdev=net0,mac=52:54:00:12:34:56 \
        -netdev user,id=net0,hostfwd=tcp::10022-:22 \
        -m 4G,slots=8,maxmem=10G \
        -smp cpus=8,cores=2,sockets=4 \
        -object memory-backend-ram,size=1G,id=ram0 \
        -object memory-backend-ram,size=1G,id=ram1 \
        -object memory-backend-ram,size=1G,id=ram2 \
        -object memory-backend-ram,size=1G,id=ram3 \
        -object memory-backend-ram,id=cxl-mem0,share=on,size=256M \
        -object memory-backend-ram,id=cxl-mem1,share=on,size=256M \
        -object memory-backend-ram,id=cxl-mem2,share=on,size=256M \
        -object memory-backend-ram,id=cxl-mem3,share=on,size=256M \
        -object memory-backend-file,id=cxl-mem4,share=on,mem-path=/tmp/cxltest.raw,size=256M \
        -object memory-backend-file,id=cxl-lsa4,share=on,mem-path=/tmp/lsa.raw,size=256M \
        -numa node,nodeid=0,cpus=0-1,memdev=ram0 \
        -numa node,nodeid=1,cpus=2-3,memdev=ram1 \
        -numa node,nodeid=2,cpus=4-5,memdev=ram2 \
        -numa node,nodeid=3,cpus=6-7,memdev=ram3 \
        <b>-device pxb-cxl,numa_node=0,bus_nr=24,bus=pcie.0,id=pxb-cxl.1 </b>\
        -device pxb-cxl,numa_node=1,bus_nr=32,bus=pcie.0,id=pxb-cxl.2 \
        -device pxb-cxl,numa_node=3,bus_nr=40,bus=pcie.0,id=pxb-cxl.3 \
        -device cxl-rp,port=0,bus=pxb-cxl.1,id=root_port1,chassis=0,slot=0 \
        -device cxl-rp,port=1,bus=pxb-cxl.1,id=root_port2,chassis=0,slot=3 \
        -device cxl-rp,port=2,bus=pxb-cxl.2,id=root_port3,chassis=1,slot=0 \
        -device cxl-rp,port=3,bus=pxb-cxl.2,id=root_port4,chassis=1,slot=5 \
        -device cxl-rp,port=0,bus=pxb-cxl.3,id=root_port5,chassis=2,slot=0 \
        -device cxl-upstream,bus=root_port1,id=us0 \
        -device cxl-upstream,bus=root_port3,id=us1 \
        -device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=7 \
        -device cxl-type3,bus=swport0,volatile-memdev=cxl-mem0,id=cxl-vmem0 \
        -device cxl-downstream,port=1,bus=us0,id=swport1,chassis=0,slot=8 \
        -device cxl-type3,bus=swport1,volatile-memdev=cxl-mem1,id=cxl-vmem1 \
        -device cxl-downstream,port=2,bus=us1,id=swport2,chassis=1,slot=9 \
        -device cxl-type3,bus=swport2,volatile-memdev=cxl-mem2,id=cxl-vmem2 \
        -device cxl-downstream,port=3,bus=us1,id=swport3,chassis=1,slot=10 \
        -device cxl-type3,bus=swport3,volatile-memdev=cxl-mem3,id=cxl-vmem3 \
        -device cxl-type3,bus=root_port5,persistent-memdev=cxl-mem4,lsa=cxl-lsa4,id=cxl-pmem0 \
        -M cxl-fmw.0.targets.0=pxb-cxl.1,cxl-fmw.0.size=4G,cxl-fmw.1.targets.0=pxb-cxl.2,cxl-fmw.1.size=4G,cxl-fmw.2.targets.0=pxb-cxl.3,cxl-fmw.2.size=512M
<pre>
TIP: How to identify which decoder corresponds to which device. When listing with cxl list -Dv, (see highlighted lines below), identify the id. It corresponds to the bus number attached to a node. From our previous qemu script, the line highlighted showed, that bus_nr=24 corresponds to our numa_node=0
```
"decoders:root0":[
      {
        "decoder":"decoder0.0",
        "size":4294967296,
        "interleave_ways":1,
        "max_available_extent":4294967296,
        "pmem_capable":true,
        "volatile_capable":true,
        "accelmem_capable":true,
        "nr_targets":1,
        "targets":[
          {
            "target":"pci0000:18",
            "alias":"ACPI0016:02",
            "position":0,
            "id":24
          }
        ]
      },
      {
        "decoder":"decoder0.1",
        "size":4294967296,
        "interleave_ways":1,
        "max_available_extent":4294967296,
        "pmem_capable":true,
        "volatile_capable":true,
        "accelmem_capable":true,
        "nr_targets":1,
        "targets":[
          {
            "target":"pci0000:20",
            "alias":"ACPI0016:01",
            "position":0,
            "id":32
          }
        ]
      },
      {
{
        "decoder":"decoder0.2",
        "size":536870912,
        "interleave_ways":1,
        "max_available_extent":536870912,
        "pmem_capable":true,
        "volatile_capable":true,
        "accelmem_capable":true,
        "nr_targets":1,
        "targets":[
          {
            "target":"pci0000:28",
            "alias":"ACPI0016:00",
            "position":0,
            "id":40
          }
        ]
```
Lets select the id 24. It is attached to the decoder0.0. To identify which memory device is below that decoder, run `cxl list -M`:
```
[
  {
    "memdev":"mem0",
    "pmem_size":268435456,
    "serial":0,
    "numa_node":3,
    "host":"0000:29:00.0"
  },
  {
    "memdev":"mem1",
    "ram_size":268435456,
    "serial":0,
    "numa_node":0,
    "host":"0000:1b:00.0"
  },
  {
    "memdev":"mem4",
    "ram_size":268435456,
    "serial":0,
    "numa_node":0,
    "host":"0000:1c:00.0"
  },
  {
    "memdev":"mem3",
    "ram_size":268435456,
    "serial":0,
    "numa_node":1,
    "host":"0000:23:00.0"
  },
  {
    "memdev":"mem2",
    "ram_size":268435456,
    "serial":0,
    "numa_node":1,
    "host":"0000:24:00.0"
  }
]
```
We can see that in the numa_node 0, mem1 and mem4 are located. So we can run: `sudo cxl create-region -m -t ram -d decoder0.0 -w2 mem4 mem1`
without doubt whether it is the right decoder with the rights memory devices.