Difference between revisions of "Virtual I/O Internals"
Line 66: | Line 66: | ||
=== VFIO Interrupts === | === VFIO Interrupts === | ||
[[File:Vfio-pci driver bindings.png|thumb|Figure 1: VFIO group nodes are unit of ownership that VFIO uses.]] | |||
[[File:IOCTL set VFIO container.png|thumb|Figure 2: IOCTL(GROUP, VFIO_GROUP_SET_CONTAINER, &CONTAINER) places the VFIO Group inside the VFIO Container.]] | |||
[[File:Ioctl-VFIO GROUP GET FD.png|thumb|Figure 4: Using interrupt IOCTL(GROUP2, VFIO_GROUP_GET_FD, "0000:01:00.0") to obtain the VFIO Group file descriptor.]] | |||
Guests communicate with the host via VFIO Interrupt Requests ([https://infogalactic.com/info/Interrupt_request_(PC_architecture) IRQs]). These are sent via an irqfd (IRQ [https://infogalactic.com/info/File_descriptor File Descriptor]). Similarly, the host receives these interrupts via [https://man7.org/linux/man-pages/man2/eventfd.2.html eventfd] (Event File Descriptor). The resulting data can be returned via a [https://infogalactic.com/info/Callback_(computer_programming) callback]. | Guests communicate with the host via VFIO Interrupt Requests ([https://infogalactic.com/info/Interrupt_request_(PC_architecture) IRQs]). These are sent via an irqfd (IRQ [https://infogalactic.com/info/File_descriptor File Descriptor]). Similarly, the host receives these interrupts via [https://man7.org/linux/man-pages/man2/eventfd.2.html eventfd] (Event File Descriptor). The resulting data can be returned via a [https://infogalactic.com/info/Callback_(computer_programming) callback]. | ||
Line 246: | Line 249: | ||
=== Binding VFIO devices === | === Binding VFIO devices === | ||
Binding devices to the vfio-pci driver results in VFIO group nodes. A graphic of this can be seen in Figure 1. | Binding devices to the vfio-pci driver results in VFIO group nodes. A graphic of this can be seen in Figure 1. | ||
Line 254: | Line 256: | ||
Once the VFIO Groups have been placed inside the VFIO container and the IOMMU type has been set the user then does mapping and unmapping which will automatically insert MMIO entries into the IOMMU as well as pin/unpin pages as necessary. | Once the VFIO Groups have been placed inside the VFIO container and the IOMMU type has been set the user then does mapping and unmapping which will automatically insert MMIO entries into the IOMMU as well as pin/unpin pages as necessary. | ||
==RPC Mode== | ==RPC Mode== |
Revision as of 19:42, 26 April 2022
The following document will attempt to detail the internals of a Virtual Function IO (VFIO) driven Mediated Device (Mdev).
RPC Mode | SR-IOV Mode |
---|---|
Host requires insight about guest of workload. | Host ignorance of guest workload. |
Error reporting. | No guest driver error reporting. |
In depth dynamic monitoring. | Basic dynamic monitoring. |
Software defined MMU guest separation. | Firmware defined MMU guest separation. |
Requires deferred instructions to be supported by host software (support libraries). | Guest is ignorant of host supported software such as support libraries. |
Both Modes
VFIO file descriptor
VFIO devices are mapped as file offsets to represent the IO device.
In the case of a RPC Mode this structure is emulated whereas in SR-IOV Mode the structure is mapped to a real PCI resource.
00:00.0 VGA compatible controller |
---|
Region 0 Bar0 (starts at offset 0) |
Region 1 Bar1 |
Region 2 Bar2 |
Region 3 Bar3 |
Region 4 Bar4 |
Region 5 Bar5 (IO port space) |
Expansion ROM |
Below is what the file offsets looks like internally for each BAR region starting from address 0 and growing with the addition of former regions as you progress through the file.
<- File Offset -> | ||||
---|---|---|---|---|
0 -> A | A -> (A+B) | (A+B) -> (A+B+C) | (A+B+C) -> (A+B+C+D) | ... |
Region 0 (size A) | Region 1 (size B) | Region 2 (size C) | Region 3 (size D) | ... |
VFIO Interrupts
Guests communicate with the host via VFIO Interrupt Requests (IRQs). These are sent via an irqfd (IRQ File Descriptor). Similarly, the host receives these interrupts via eventfd (Event File Descriptor). The resulting data can be returned via a callback.
IRQs
Device properties discovered via IOCTL
VFIO_DEVICE_GET_INFO | ||
---|---|---|
struct vfio_device_info | ||
argz | ||
flags | ||
VFIO_DEVICE_FLAGS_PCI | ||
VFIO_DEVICE_FLAGS_PLATFORM | ||
VFIO_DEVICE_FLAGS_RESET | ||
num_irqs | ||
num_regions |
VFIO_DEVICE_GET_REGION_INFO | ||
---|---|---|
struct vfio_region_info | ||
argz | ||
cap_offset | ||
flags | ||
VFIO_REGION_INFO_FLAG_CAPS | ||
VFIO_REGION_INFO_FLAG_MMAP | ||
VFIO_REGION_INFO_FLAG_READ | ||
VFIO_REGION_INFO_FLAG_WRITE | ||
index | ||
offset | ||
size |
VFIO_DEVICE_GET_IRQ_INFO | ||
---|---|---|
struct vfio_irq_info | ||
argz | ||
count | ||
flags | ||
VFIO_IRQ_INFO_AUTOMASKED | ||
VFIO_IRQ_INFO_EVENTFD | ||
VFIO_IRQ_INFO_MASKABLE | ||
VFIO_IRQ_INFO_NORESIZE | ||
index |
Notes: VFIO_IRQ_INFO_AUTOMASKED is used to mask interrupts when they occur to protect the host.
VFIO_DEVICE_SET_IRQS | ||
---|---|---|
struct vfio_irq_set | ||
argz | ||
count | ||
data[] | ||
flags | ||
VFIO_IRQ_SET_ACTION_MASK | ||
VFIO_IRQ_SET_ACTION_TRIGGER | ||
VFIO_IRQ_SET_ACTION_UNMASK | ||
VFIO_IRQ_SET_DATA_BOOL | ||
VFIO_IRQ_SET_DATA_EVENTFD | ||
VFIO_IRQ_SET_DATA_NONE | ||
index | ||
start |
Binding VFIO devices
Binding devices to the vfio-pci driver results in VFIO group nodes. A graphic of this can be seen in Figure 1.
Opening the file "/dev/vfio/vfio" creates a VFIO Container.
The interrupt routine shown in Figure 2 IOCTL(GROUP, VFIO_GROUP_SET_CONTAINER, &CONTAINER) places the VFIO group inside the VFIO container. Once this has been done IOCTL(CONTAINER, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU) can then be used to set an IOMMU type for the container which places it in a user interact-able state. Once this IOMMU type state has been set and the VFIO container has been made interact-able additional VFIO groups may be added to the container without requiring that the group's IOMMU type be set again as newly added groups automatically inherit the container's IOMMU context. Once the VFIO Groups have been placed inside the VFIO container and the IOMMU type has been set the user then does mapping and unmapping which will automatically insert MMIO entries into the IOMMU as well as pin/unpin pages as necessary.
RPC Mode
Instruction Execution
RPC Mode moves instruction information across a virtual function interface (VF) using Remote Procedure Calls generally by way of soft interrupt (IOCTLs). Guest GPU instructions passed from the guest as Remote Procedure Calls are Just-in-time recompiled on the host for execution by a device driver.
IRQ remapping
Interrupt Requests (IRQs) must be remapped (trapped for virtualized execution) to protect the host from sensitive instructions which may affect global memory state.
Memory Management
Region Passthrough
Guests may be presented with emulated memory regions which use indirect emulated communication requiring a VM-exit (slow) or instead the guest may be presented with passthrough memory regions which use direct communication requiring no VM-exit (fast).
EPT Page Violations
Guest Memory Mapped IO (MMIO) tripped Extended Page Table (EPT) violations which are trapped by the host MMU. KVM services EPT violations and forwards to QEMU VFIO PCI driver. QEMU then converts the request from KVM to R/W access to the Mdev File Descriptor (FD). Reads and writes are then handled by the host GPU device driver via mediated callbacks (CBs) and VFIO-mdev.
Scheduling
Scheduling is handled by the host mdev driver.
RPC Mode Requirements:
Sensitive Instruction List.
Instruction Shim/Binary Translator.
HPA<->GPA Boundary Enforcement.
SR-IOV Mode
Instruction Execution
SR-IOV Mode involves the communication of instructions from a virtual function (VF) through direct communication to the PCI BAR.
Memory Management
Guests are presenting with passthrough memory regions by the device firmware.
Scheduling
Scheduling may be handled by the host mdev driver and/or the device firmware.
SR-IOV Mode Requirements:
Device SR-IOV support.
HPA<->GPA Boundary Enforcement.