|
| 1 | +# Switchtec Kernel Design Documentation |
| 2 | + |
| 3 | +This document aims to provide a jumping off point to working with the |
| 4 | +kernel code for the switchtec driver. It describes some core concepts |
| 5 | +and landmarks to help get started hacking on the code. This document |
| 6 | +may not stay up to date so when in doubt, consult the code. |
| 7 | + |
| 8 | +The Switchtec kernel module is divided into two parts: switchtec.ko and |
| 9 | +ntb_hw_switchtec.ko. The former enumerates management and NTB endpoints, |
| 10 | +configures them, and provides the interface to switchtec-user. The later |
| 11 | +provides a driver for the Linux NTB stack. ntb_hw_switchtec.ko depends on |
| 12 | +switchtec.ko. |
| 13 | + |
| 14 | +## switchtec.ko |
| 15 | + |
| 16 | +The main Switchtec driver enumerates the devices in the standard way |
| 17 | +for Linux (how that is done is not covered in this document, for more |
| 18 | +information on Linux Driver implementations refer to [LDD3][1] or the |
| 19 | +Kernel source code). |
| 20 | + |
| 21 | +### Userspace Interface |
| 22 | + |
| 23 | +Refer to the README file or switchtec_ioclt.h for more information on |
| 24 | +how the userspace interface is defined. The kernel module creates a |
| 25 | +character device for each switch that was enumerated. Reading and |
| 26 | +writing this device allows for creating MRPC commands and a few IOCTLs |
| 27 | +are provided so userspace does not have to directly access the GAS |
| 28 | +(which requires full root permission and has security and stability |
| 29 | +implications). For the implementation of these commands refer to |
| 30 | +switchtec_fops in switchtec.c. |
| 31 | + |
| 32 | +Whenever a userspace application opens a switchtec char device, the |
| 33 | +kernel creates a switchtec_user structure. This structure is used for |
| 34 | +queueing MRPC commands so each application can have one MRPC command in |
| 35 | +flight at a time and the kernel will arbitrate between the applications |
| 36 | +on a first in first out basis. |
| 37 | + |
| 38 | +When the application does a write, the kernel will queue the data to be |
| 39 | +sent to the firmware. If the queue is empty, it will immediately submit |
| 40 | +the command (see mrpc_queue_cmd). A read command will store how much data |
| 41 | +is to be read and block until the command has been completed. An event |
| 42 | +interrupt indicates when the command is completed and the kernel will |
| 43 | +read the output data and store it in the switchtec_user structure (see |
| 44 | +mrpc_complete_cmd). If the read command has not yet set how much output |
| 45 | +data is expected the kernel will read all of the data into the buffer |
| 46 | +(which may be slower than expected). Once the data is read the completion |
| 47 | +in switchtec_user will signal the read command to return the data |
| 48 | +to userspace. |
| 49 | + |
| 50 | +In case something unexpected happens the kernel has a timeout on all |
| 51 | +MRPC commands (see mrpc_timeout_work). Usually the interrupt will occur |
| 52 | +before the timeout but if it is missed the timeout will prevent the |
| 53 | +queue from being hung. Note: however if the firmware never indicates the |
| 54 | +command is complete this will still hang the queue. |
| 55 | + |
| 56 | +### Interrupts |
| 57 | + |
| 58 | +The driver sets up space for up to four MSI-X or MSI interrupts but only |
| 59 | +registers a handler for the event interrupt as designated by the |
| 60 | +vep_vector_number in the GAS region. The NTB module will also register |
| 61 | +another interrupt handler for the doorbell and message vector. |
| 62 | + |
| 63 | +The event interrupt (switchtec_event_isr) first checks if the MRPC event |
| 64 | +occurred and queues mrpc_work which will call mrpc_complete_cmd. It will |
| 65 | +then clear the EVENT_OCCURRED bit so the interrupt doesn't continue to |
| 66 | +trigger. |
| 67 | + |
| 68 | +Next, the interrupt will check all the link state events in all the |
| 69 | +ports and signal a link_notifier (typically used by the NTB driver) |
| 70 | +if such an event occurs. |
| 71 | + |
| 72 | +Finally, the interrupt will check all other event interrupts. If |
| 73 | +an event interrupt occurs it wakes up any process that is polling |
| 74 | +on events (see switchtec_dev_poll). It then disables the interrupt |
| 75 | +for that event. In this way, it is expected that an application will |
| 76 | +enable the interrupt it's waiting for, then call poll in a loop |
| 77 | +checking for if the expected interrupt occurs. poll will return anytime |
| 78 | +any event occurs. |
| 79 | + |
| 80 | +### IOCTLs |
| 81 | + |
| 82 | +A number of IOCTLs are provided for a number of functions needed by |
| 83 | +switchtec-user. See the README for a description of these IOCTLs and |
| 84 | +switchtec_dev_ioctl for their implementation. |
| 85 | + |
| 86 | +### Sysfs |
| 87 | + |
| 88 | +There are a number of sysfs attributes provided so that userspace can |
| 89 | +easily enumerate and discover the available switchtec devices. The |
| 90 | +attributes in the system can easily by browsed in sysfs under |
| 91 | +/sys/class/switchtec. |
| 92 | + |
| 93 | +These attributes are documented in Documentation/ABI/sysfs-class-switchtec. |
| 94 | +See switchtec_device_attrs in switchtec.c for their implementation. |
| 95 | + |
| 96 | +## ntb_hw_switchtec.ko |
| 97 | + |
| 98 | +The ntb_hw_switchtec enumerates all devices in the switchtec class |
| 99 | +and creates NTB interfaces for any devices that are NTB endpoints. |
| 100 | +See switchtec_ntb_ops for the implementation of all the NTB operations. |
| 101 | + |
| 102 | +### Shared Memory Window |
| 103 | + |
| 104 | +The Switchtec NTB driver reserves one of the LUT memory windows so it |
| 105 | +can be used to provide scratch pad registers and link detection. For |
| 106 | +now, the driver sets the size of all LUT windows to be fixed at 64KB. |
| 107 | +This size allows for the combined size of all LUT windows to be |
| 108 | +sufficent enough that the alignment of the direct window that follows |
| 109 | +will be at least 2MB. |
| 110 | + |
| 111 | +### Link Management |
| 112 | + |
| 113 | +The link is considered to be up when both sides have setup their shared |
| 114 | +memory window and a magic number and link status must be read by both |
| 115 | +sides to realize that the link is up. When either side changes their |
| 116 | +link status, a specific message is sent telling the otherside to check |
| 117 | +the current link state. The link state is also checked whenever the |
| 118 | +switch sends a link state change interrupt. |
| 119 | + |
| 120 | +### Memory windows |
| 121 | + |
| 122 | +By default, the driver only provides direct memory windows to the |
| 123 | +upper layers. This is because the existing upper layers can get confused |
| 124 | +by a large number of LUT memory windows. The LUT memory windows can be |
| 125 | +enabled with the use_lut_mws parameter. |
| 126 | + |
| 127 | +### Crosslink |
| 128 | + |
| 129 | +The crosslink feature allows for an NTB system to be entirely symmetric |
| 130 | +such that two hosts can be identical and interchangeable. To do this a |
| 131 | +special hostless partition is created in the middle of the two hosts. |
| 132 | +This is supported by the driver and only requires a special initialization |
| 133 | +procedure (see switchtec_ntb_init_crosslink). Crosslink also reserves another |
| 134 | +one of the LUT windows to be used to window the NTB register space inside |
| 135 | +the crosslink partition. Besides this, all other NTB operations function |
| 136 | +identically to regular NTB. |
| 137 | + |
| 138 | +[1]: https://lwn.net/Kernel/LDD3/ |
0 commit comments