CUDA backend for asynchronous distributed multi-trainer segfaults (race condition)

When running the distributed test with a CUDA backend, there's a segfault.  It can be fixed easily with `CUDA_LAUNCH_BLOCKING`, but that's not ideal.  Below are the commands to repro:

Server:
```bash
$ bash examples/distributed/serve.sh
```
Client:
```bash
$ bash examples/distributed/client.sh
```