-
Notifications
You must be signed in to change notification settings - Fork 29
Description
| Author(s) | @terryzbai @midnightveil |
|---|---|
| Date of last update | 4 Feb 2026 |
(Any feedback is welcome! Please see the last section if you are not familiar with the PCIe concepts)
Overview
The PCIe devices are the hardware components can be connected to a motherboard through a standard PCIe bus and managed by the software systems in a unified way. Each device connected to the PCIe bus present its meta information in one of the configuration space headers (4KB per each) on the ECAM space, and use this data structure to negotiate configurations with the software system. Except for the common configuration fields, each device requires certain MMIO or I/O Ports for the extended capabilities, which should be allocated from the resource windows. The typical capabilities are: device-specific configuration registers, MSI-X interrupt configuration tables, and an I/O Port as an alternative way of accessing the configuration space header.
The ECAM space and the resource windows are determined by the BIOS at boot time, and not overlapped with RAM. Generally, these ranges are "fixed" if not many PCIe devices are plugged or removed. Also, the BIOS allocates the required resources for each device, so the OS simply use these configurations instead of re-doing everything from scratch.
However, the current Microkit model requires definite kernel objects specified at build time, before the BIOS-configured values reading at run time, thereby requiring the system designer to hard-code everything in the system description files.
Goal: a solution that eases the reuse of device drivers without having to hard-code everything while obeying the following rules:
- The device drivers do no touch ECAM at all
- The device drivers do not touch the interrupt direction tables at all
Current Design
The current implementation (sDDF and microkit_sdf_gen) optimises the workflow without breaking the Microkit model by determining the per-device resource allocation at build time, and re-doing the mappings at run time. The workflow is:
- The device driver developer specifies the necessary PCI BARs in the
config.jsonfile and assumes the physical address will be patched in thedevice_resourcesdata structure. See the example: https://github.com/au-ts/sddf/blob/ixgbe_rebased/drivers/blk/virtio/pci/config.json - The system designer initialise a PCI subsystem in
meta.pywith the hard-coded ranges of ECAM, MMIO window and I/O Port window, and add the subsystem owning the target device as a client. See the example:Lines 73 to 101 in 905962f
pcie_driver = ProtectionDomain("pcie_driver", "pcie_driver.elf", priority=252) # pci_system = Sddf.Pci(sdf, pcie_driver, ecam_paddr=0xb0000000, ecam_size=0x10000000, mmio_paddr=0xe0000000, mmio_size=0x10000000) pci_system = Sddf.Pci(sdf, pcie_driver, ecam_paddr=board.pci.ecam_paddr, ecam_size=board.pci.ecam_size, mmio_paddr=board.pci.mmio_paddr, mmio_size=board.pci.mmio_size, ioport_paddr=board.pci.ioport_paddr, ioport_size=board.pci.ioport_size) pci_system.add_client(blk_system, device_id=0x1001, vendor_id=0x1af4, bus=0, dev=3, func=0) pds = [serial_driver, serial_virt_tx, blk_driver, blk_virt, client, pcie_driver] if need_timer: pds += [timer_driver] for pd in pds: sdf.add_pd(pd) assert blk_system.connect() assert serial_system.connect() assert pci_system.connect() assert serial_system.serialise_config(output_dir) assert blk_system.serialise_config(output_dir) assert pci_system.serialise_config(output_dir) if need_timer: assert timer_system.connect() assert timer_system.serialise_config(output_dir) with open(f"{output_dir}/{sdf_file}", "w+") as f: f.write(sdf.render()) - The sdfgen tooling allocates resources from the windows for each device, passes the created kernel objects to the CapDL initialiser, and serialises the configurations for all the involved drivers.
- The PCIe driver reads the configure requests patched in the reserved data structure, and configure everything for the devices.
The architecture diagram is like:
| specify:
| - ECAM range
config.json files | - PCIe MMIO Window range
| - PCIe I/O Port Window range
| - PCIe client device drivers
↓
┌─────────────────────┐
| microkit_sdf_gen │ --------> generate: pcie_driver_device_resources.data
└─────────────────────┘
|
| define SDF with:
| - Allocated memory regions from PCIe MMIO Window for device driver PDs
| - Allocated I/O Port from PCIe I/O Ports Window for device driver PDs
| - Allocated interrupt vectors for device driver PDs
| - ECAM memory region for PCIe driver PD
↓
┌─────────────┐
| Microkit │
└─────────────┘
Remaining issues of current design
-
The ranges of PCIe resources still need to be hard-coded
These can only be read from the ACPI tables at run time, which is configured by the BIOS at run time. So if the values are read from the last run, there should not be any device insertion or removal between two runs. -
<vendor_id, device_id>needs to be specified for the target device
These should be a mapping table between the<vendor_id, device_id>tuples and the device driver names in the PCIe driver. -
The I/O APIC interrupt mapping information need to be manually read from the APIC tables
In I/O APIC mode, the system designer needs to determine which PIN number is used and which interrupt vector this PIN is mapped to. -
The assumption of PCIe driver always finishing the configuration before the device drivers' execution is very weak
Especially for the multicore cases, a device driver running on a different core is more likely scheduled before the PCIe-related resources prepared, so a notification channel is enough for synchronising the configuration status.
Considerations for new design
Communication protocols
In a static architecture system, it does not matter whether the requested configurations are patched into the PCIe driver PD or passed via an IPC, since this is a one time action for each device driver.
For dynamic architecture systems, the PCIe driver should passively wait for configuration requests on an endpoint and allocate resources according to the IPC meessage.
Compatibility with ARM and RISC-V
The PCIe bus is also becoming popular on the ARM and RISC-V boards for more flexible components, such as NVMe, WiFi, and high-throughput NICs. For these two platforms, the ECAM space and the resource windows can be easily parsed from the device tree. To fully support these two platforms, MSI/MSI-X interrupts need to be supported for them in the seL4 kernel.
Reusability for Djawula
Considering the Djawula team is trying to adopt Microkit as well, the new PCIe driver should be easily used in Djawula like other sDDF subsystems. In this case, a higher-degree of dynamism is required to support the run-time creation of device drivers and resource reclamation for destroyed device drivers.
Solution Options
Option 1 (Add an ACPI driver)
@midnightveil proposed that a new ACPI driver that owns all the caps scans the ACPI tables based on the given RSDP from BootInfo structure, and maps the ECAM and memory windows to the PCIe driver. Also, the Microkit will need to allow the definition of empty capabilities, so the PCIe driver can create the requested objects and fill the CSlots with the real capabilities at run time.
For the ACPI tables parsing:
- The CapDL initialiser gives the RSDP and all the untyped capablities to the ACPI driver.
- The ACPI driver prases the MCFG and DSDT table for extracting ECAM, MMIO Window, I/O Ports Window, and I/O APIC routing tables.
- The ACPI driver passes the untypeds of the above resources as well as the I/O APIC routing tables to the PCIe driver.
For the PCIe MMIO BARs:
- The CapDL initialiser maps the intermediate page tables for PCIe MMIO memory regions but leaves the Page CSlots empty at the last layer of page tables. The CapDL initialiser also passes the device drivers' CNodes to the PCIe driver.
- The PCIe driver creates the Frame objects from the resource windows (untypeds), and maps them into the device drivers' page tables at run time.
- The PCIe driver notifies that the resources are ready.
This workflow also works for the I/O Ports.
For the IRQs:
- The CapDL initialiser gives the IRQControl capability to the PCIe driver.
- The PCIe driver creates the IRQHandler capabilities and bind them to the corresponding notifications of the device drivers.
- The PCIe driver notifies that the resources are ready.
The architecture diagram is like:
┌───────────────────────┐
│ CapDL Initialiser │
└───────────────────────┘
|
| pass:
| - All untyped memory objects
| - BootInfo(RSDP)
| - IRQControl Capability
↓
┌───────────────────────┐
│ ACPI Driver │
└───────────────────────┘
|
| pass:
| - Untyped memory for PCIe MMIO Window
| - Untyped memory for PCIe I/O Port Window
| - IRQControl Capability
↓
┌───────────────────────┐
│ PCIe Driver │
└───────────────────────┘
↑ |
request: | |
- resource type | | signal when configuration is ready
- resource size | |
| |
| ↓
┌───────────────────────┐
│ Device Driver │
└───────────────────────┘
According to @Willmish 's introduction on the Djawula's architecture design, a RootServer will receive the resource requests from all the other PDs and allocate resources for them. This RootServer seems to be able to do what the ACPI driver mentioned above should do and much more than that. In this case, the PCIe driver should also be responsible for the page table creation for the device drivers, since the CapDL initialiser is not be involved in the later run-time resource management.
Option 2 (Offline ACPI parsing tool)
This solution consists of two seperate tools:
- A tool that can run on bare-metal x86 machines and dump the ACPI tables to a persistent storage device, and
- A tool that parses the ACPI tables like what dtb library does for the ARM device tree files.
This solution is suitable for only the static architecture systems and requires a pre-run on the machine for collecting the resource information, which is a bit doggy.
Option 3
Waiting for more ideas...
===================================================================
Related Concepts
ACPI Tables
The ACPI tables are populated by BIOS at boot time and store all the hardware configurations. The OS can find the pointer (which is called "RSDP") to the lookup table (which is called "RSDT") by scanning the BIOS area (0x000E0000-0x000FFFFF) in BIOS mode, or checking EFI_SYSTEM_TABLE in UEFI mode. The RSDT table gives the locations of other System Description tables. This means the locations of RSDP, RSDT and all other ACPI tables are unknown at build time.
ECAM
A PCIe ECAM space is a memory-mapped area that is shared between the PCIe devices and CPUs. The configuration space header of each device function is addressable by knowing the eight-bit PCI bus, five-bit device, and three-bit function numbers for the device. The format looks like [domain:]:bus:device.function, e.g., 0000:00:00.1.
Therefore, each device can present up to 8 "functions", such as multiple ports in a NIC. The header location can be calculated with the following formula.
header_address = base_addr + (bus << 20) + (device << 15) + (function << 12) + 0x00
PCI BARs
PCIe Base Address Registers (BARs) are used to map the PCIe device's interfaces (memory and I/O ports) for communication between the CPU and devices. Each device function can have up to 6 memory BARs, and each BAR has its type (32-bit memory, 64-bit memory or I/O port), base address and size. Every capability has to be within one of these BARs.
For example, a virtio-pci device normally requires a I/O port (BAR0) for access to configuration, memory-mapped BAR1 for other four virtio-specific capabilities (located on BAR1) and memory-mapped BAR4 for the MSI-X tables.
I/O APIC
Instead of having physical interrupt pins (INTA# to INTD#) like the legacy PCI slots, PCIe uses in-band messages (Assertion/Deassertion packets in Transaction Layer) to emulate the wires between the interrupt pins of PCIe devices and I/O APIC inputs.
The interrupt pin is determined by BIOS/Firmware for each of functions, and can be read at offset 0x3D of its configuration space. The value in the interrupt line register normally refers to the interrupt vector (in legacy PCI) or the corresponding I/O APIC input (in PCIe), but is not necessarily used.
In QEMU monitor, info pci tells which pin is used and the corresponding I/O APIC input (the IRQ field. In other cases, the routing information can be read from the ACPI DSDT table. For example, the following _PRT entry gives the routing table for the 0.65.0 bus on makatea:
Name (G05F, Package (0x04)
{
Package (0x04) { 0xFFFF, 0x00, 0x00, 0x28 }, // INTA# -> GSI 40
Package (0x04) { 0xFFFF, 0x01, 0x00, 0x2C }, // INTB# -> GSI 44
Package (0x04) { 0xFFFF, 0x02, 0x00, 0x2D }, // INTC# -> GSI 45
Package (0x04) { 0xFFFF, 0x03, 0x00, 0x2E } // INTD# -> GSI 46
}
MSI
With I/O APIC-based interrupts, all functions within each PCIe bus need to share four interrupt pins, the CPU has to waste time on checking every device driver listening on the same interrupt. Also, the shared redirection table raises certain security risks.
Message Signalled Interrupts (MSI) solution allows each device function to have up to 32 interrupt vectors, and write data to memory which sends an interrupt to the target CPU.
MSI-X
In MSI mode, all interrupts of each device function can target only one processor, and the interrupt vectors have to be contiguous. MSI-X provides more flexible vector configuration structure and up to 2048 interrupts for each device function.
The I/O APIC interrupts can not be triggered with the MSI/MSI-X capabilities enabled, and "More than one MSI Capability structure per Function is prohibited, but a Function is permitted to have both an MSI and an MSI-X Capability structure. "