@@ -11,10 +11,298 @@ Remote Procedure Calls
1111Remote Procedure Call Implementation
1212====================================
1313
14- Certain features from the standard C library, such as allocation or printing,
15- require support from the operating system. We instead implement a remote
16- procedure call (RPC) interface to allow submitting work from the GPU to a host
17- server that forwards it to the host system.
14+ Traditionally, the C library abstracts over several functions that interface
15+ with the platform's operating system through system calls. The GPU however does
16+ not provide an operating system that can handle target dependent operations.
17+ Instead, we implemented remote procedure calls to interface with the host's
18+ operating system while executing on a GPU.
19+
20+ We implemented remote procedure calls using unified virtual memory to create a
21+ shared communicate channel between the two processes. This memory is often
22+ pinned memory that can be accessed asynchronously and atomically by multiple
23+ processes simultaneously. This supports means that we can simply provide mutual
24+ exclusion on a shared better to swap work back and forth between the host system
25+ and the GPU. We can then use this to create a simple client-server protocol
26+ using this shared memory.
27+
28+ This work treats the GPU as a client and the host as a server. The client
29+ initiates a communication while the server listens for them. In order to
30+ communicate between the host and the device, we simply maintain a buffer of
31+ memory and two mailboxes. One mailbox is write-only while the other is
32+ read-only. This exposes three primitive operations: using the buffer, giving
33+ away ownership, and waiting for ownership. This is implemented as a half-duplex
34+ transmission channel between the two sides. We decided to assign ownership of
35+ the buffer to the client when the inbox and outbox bits are equal and to the
36+ server when they are not.
37+
38+ In order to make this transmission channel thread-safe, we abstract ownership of
39+ the given mailbox pair and buffer around a port, effectively acting as a lock
40+ and an index into the allocated buffer slice. The server and device have
41+ independent locks around the given port. In this scheme, the buffer can be used
42+ to communicate intent and data generically with the server. We them simply
43+ provide multiple copies of this protocol and expose them as multiple ports.
44+
45+ If this were simply a standard CPU system, this would be sufficient. However,
46+ GPUs have my unique architectural challenges. First, GPU threads execute in
47+ lock-step with each other in groups typically called warps or wavefronts. We
48+ need to target the smallest unit of independent parallelism, so the RPC
49+ interface needs to handle an entire group of threads at once. This is done by
50+ increasing the size of the buffer and adding a thread mask argument so the
51+ server knows which threads are active when it handles the communication. Second,
52+ GPUs generally have no forward progress guarantees. In order to guarantee we do
53+ not encounter deadlocks while executing it is required that the number of ports
54+ matches the maximum amount of hardware parallelism on the device. It is also
55+ very important that the thread mask remains consistent while interfacing with
56+ the port.
57+
58+ .. image :: ./rpc-diagram.svg
59+ :width: 75%
60+ :align: center
61+
62+ The above diagram outlines the architecture of the RPC interface. For clarity
63+ the following list will explain the operations done by the client and server
64+ respectively when initiating a communication.
65+
66+ First, a communication from the perspective of the client:
67+
68+ * The client searches for an available port and claims the lock.
69+ * The client checks that the port is still available to the current device and
70+ continues if so.
71+ * The client writes its data to the fixed-size packet and toggles its outbox.
72+ * The client waits until its inbox matches its outbox.
73+ * The client reads the data from the fixed-size packet.
74+ * The client closes the port and continues executing.
75+
76+ Now, the same communication from the perspective of the server:
77+
78+ * The server searches for an available port with pending work and claims the
79+ lock.
80+ * The server checks that the port is still available to the current device.
81+ * The server reads the opcode to perform the expected operation, in this
82+ case a receive and then send.
83+ * The server reads the data from the fixed-size packet.
84+ * The server writes its data to the fixed-size packet and toggles its outbox.
85+ * The server closes the port and continues searching for ports that need to be
86+ serviced
87+
88+ This architecture currently requires that the host periodically checks the RPC
89+ server's buffer for ports with pending work. Note that a port can be closed
90+ without waiting for its submitted work to be completed. This allows us to model
91+ asynchronous operations that do not need to wait until the server has completed
92+ them. If an operation requires more data than the fixed size buffer, we simply
93+ send multiple packets back and forth in a streaming fashion.
94+
95+ Server Library
96+ --------------
97+
98+ The RPC server's basic functionality is provided by the LLVM C library. A static
99+ library called ``libllvmlibc_rpc_server.a `` includes handling for the basic
100+ operations, such as printing or exiting. This has a small API that handles
101+ setting up the unified buffer and an interface to check the opcodes.
102+
103+ Some operations are too divergent to provide generic implementations for, such
104+ as allocating device accessible memory. For these cases, we provide a callback
105+ registration scheme to add a custom handler for any given opcode through the
106+ port API. More information can be found in the installed header
107+ ``<install>/include/gpu-none-llvm/rpc_server.h ``.
108+
109+ Client Example
110+ --------------
111+
112+ The Client API is not currently exported by the LLVM C library. This is
113+ primarily due to being written in C++ and relying on internal data structures.
114+ It uses a simple send and receive interface with a fixed-size packet. The
115+ following example uses the RPC interface to call a function pointer on the
116+ server.
117+
118+ This code first opens a port with the given opcode to facilitate the
119+ communication. It then copies over the argument struct to the server using the
120+ ``send_n `` interface to stream arbitrary bytes. The next send operation provides
121+ the server with the function pointer that will be executed. The final receive
122+ operation is a no-op and simply forces the client to wait until the server is
123+ done. It can be omitted if asynchronous execution is desired.
124+
125+ .. code-block :: c++
126+
127+ void rpc_host_call(void *fn, void *data, size_t size) {
128+ rpc::Client::Port port = rpc::client.open<RPC_HOST_CALL>();
129+ port.send_n(data, size);
130+ port.send([=](rpc::Buffer *buffer) {
131+ buffer->data[0] = reinterpret_cast<uintptr_t>(fn);
132+ });
133+ port.recv([](rpc::Buffer *) {});
134+ port.close();
135+ }
136+
137+ Server Example
138+ --------------
139+
140+ This example shows the server-side handling of the previous client example. When
141+ the server is checked, if there are any ports with pending work it will check
142+ the opcode and perform the appropriate action. In this case, the action is to
143+ call a function pointer provided by the client.
144+
145+ In this example, the server simply runs forever in a separate thread for
146+ brevity's sake. Because the client is a GPU potentially handling several threads
147+ at once, the server needs to loop over all the active threads on the GPU. We
148+ abstract this into the ``lane_size `` variable, which is simply the device's warp
149+ or wavefront size. The identifier is simply the threads index into the current
150+ warp or wavefront. We allocate memory to copy the struct data into, and then
151+ call the given function pointer with that copied data. The final send simply
152+ signals completion and uses the implicit thread mask to delete the temporary
153+ data.
154+
155+ .. code-block :: c++
156+
157+ for(;;) {
158+ auto port = server.try_open(index);
159+ if (!port)
160+ return continue;
161+
162+ switch(port->get_opcode()) {
163+ case RPC_HOST_CALL: {
164+ uint64_t sizes[LANE_SIZE];
165+ void *args[LANE_SIZE];
166+ port->recv_n(args, sizes, [&](uint64_t size) { return new char[size]; });
167+ port->recv([&](rpc::Buffer *buffer, uint32_t id) {
168+ reinterpret_cast<void ( *)(void *)>(buffer->data[0])(args[id]);
169+ });
170+ port->send([&](rpc::Buffer *, uint32_t id) {
171+ delete[] reinterpret_cast<uint8_t *>(args[id]);
172+ });
173+ break;
174+ }
175+ default:
176+ port->recv([](rpc::Buffer *) {});
177+ break;
178+ }
179+ }
180+
181+ CUDA Server Example
182+ -------------------
183+
184+ The following code shows an example of using the exported RPC interface along
185+ with the C library to manually configure a working server using the CUDA
186+ language. Other runtimes can use the presence of the ``__llvm_libc_rpc_client ``
187+ in the GPU executable as an indicator for whether or not the server can be
188+ checked. These details should ideally be handled by the GPU language runtime,
189+ but the following example shows how it can be used by a standard user.
190+
191+ .. code-block :: cuda
192+
193+ #include <cstdio>
194+ #include <cstdlib>
195+ #include <cuda_runtime.h>
196+
197+ #include <gpu-none-llvm/rpc_server.h>
198+
199+ [[noreturn]] void handle_error(cudaError_t err) {
200+ fprintf(stderr, "CUDA error: %s\n", cudaGetErrorString(err));
201+ exit(EXIT_FAILURE);
202+ }
203+
204+ [[noreturn]] void handle_error(rpc_status_t err) {
205+ fprintf(stderr, "RPC error: %d\n", err);
206+ exit(EXIT_FAILURE);
207+ }
208+
209+ // The handle to the RPC client provided by the C library.
210+ extern "C" __device__ void *__llvm_libc_rpc_client;
211+
212+ __global__ void get_client_ptr(void **ptr) { *ptr = __llvm_libc_rpc_client; }
213+
214+ // Obtain the RPC client's handle from the device. The CUDA language cannot look
215+ // up the symbol directly like the driver API, so we launch a kernel to read it.
216+ void *get_rpc_client() {
217+ void *rpc_client = nullptr;
218+ void **rpc_client_d = nullptr;
219+
220+ if (cudaError_t err = cudaMalloc(&rpc_client_d, sizeof(void *)))
221+ handle_error(err);
222+ get_client_ptr<<<1, 1>>>(rpc_client_d);
223+ if (cudaError_t err = cudaDeviceSynchronize())
224+ handle_error(err);
225+ if (cudaError_t err = cudaMemcpy(&rpc_client, rpc_client_d, sizeof(void *),
226+ cudaMemcpyDeviceToHost))
227+ handle_error(err);
228+ return rpc_client;
229+ }
230+
231+ // Routines to allocate mapped memory that both the host and the device can
232+ // access asychonrously to communicate with eachother.
233+ void *alloc_host(size_t size, void *) {
234+ void *sharable_ptr;
235+ if (cudaError_t err = cudaMallocHost(&sharable_ptr, sizeof(void *)))
236+ handle_error(err);
237+ return sharable_ptr;
238+ };
239+
240+ void free_host(void *ptr, void *) {
241+ if (cudaError_t err = cudaFreeHost(ptr))
242+ handle_error(err);
243+ }
244+
245+ // The device-side overload of the standard C function to call.
246+ extern "C" __device__ int puts(const char *);
247+
248+ // Calls the C library function from the GPU C library.
249+ __global__ void hello() { puts("Hello world!"); }
250+
251+ int main() {
252+ int device = 0;
253+ // Initialize the RPC server to run on a single device.
254+ if (rpc_status_t err = rpc_init(/*num_device=*/1))
255+ handle_error(err);
256+
257+ // Initialize the RPC server to run on the given device.
258+ if (rpc_status_t err =
259+ rpc_server_init(device, RPC_MAXIMUM_PORT_COUNT,
260+ /*warp_size=*/32, alloc_host, /*data=*/nullptr))
261+ handle_error(err);
262+
263+ // Initialize the RPC client by copying the buffer to the device's handle.
264+ void *rpc_client = get_rpc_client();
265+ if (cudaError_t err =
266+ cudaMemcpy(rpc_client, rpc_get_client_buffer(device),
267+ rpc_get_client_size(), cudaMemcpyHostToDevice))
268+ handle_error(err);
269+
270+ cudaStream_t stream;
271+ if (cudaError_t err = cudaStreamCreate(&stream))
272+ handle_error(err);
273+
274+ // Execute the kernel.
275+ hello<<<1, 1, 0, stream>>>();
276+
277+ // While the kernel is executing, check the RPC server for work to do.
278+ while (cudaStreamQuery(stream) == cudaErrorNotReady)
279+ if (rpc_status_t err = rpc_handle_server(device))
280+ handle_error(err);
281+
282+ // Shut down the server running on the given device.
283+ if (rpc_status_t err =
284+ rpc_server_shutdown(device, free_host, /*data=*/nullptr))
285+ handle_error(err);
286+
287+ // Shut down the entire RPC server interface.
288+ if (rpc_status_t err = rpc_shutdown())
289+ handle_error(err);
290+
291+ return EXIT_SUCCESS;
292+ }
293+
294+ The above code must be compiled in CUDA's relocatable device code mode and with
295+ the advanced offloading driver to link in the library. Currently this can be
296+ done with the following invocation. Using LTO avoids the overhead normally
297+ associated with relocatable device code linking.
298+
299+ .. code-block :: sh
300+
301+ $> clang++ -x cuda rpc.cpp --offload-arch=native -fgpu-rdc -lcudart -lcgpu \
302+ -I< install-path> include -L< install-path> /lib -lllvmlibc_rpc_server \
303+ -O3 -foffload-lto -o hello
304+ $> ./hello
305+ Hello world!
18306
19307 Extensions
20308----------
0 commit comments