Skip to content

CUDA backend for asynchronous distributed multi-trainer segfaults (race condition) #56

@bwasti

Description

@bwasti

When running the distributed test with a CUDA backend, there's a segfault. It can be fixed easily with CUDA_LAUNCH_BLOCKING, but that's not ideal. Below are the commands to repro:

Server:

$ bash examples/distributed/serve.sh

Client:

$ bash examples/distributed/client.sh

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions