-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Problem: When a docker container is running, simple OpenMPI jobs cannot run using the tcp interface. For example, a broadcast test will hang.
Steps to reproduce:
$ spack install osu-micro-benchmarks ^openmpi~rsh fabric=ucx
$ spack load osu-micro-benchmarks
$ mpirun -n 2 osu_bcast
# OSU MPI Broadcast Latency Test v7.1
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 4.32
2 4.32
4 4.36
8 4.30
16 4.30
32 4.32
64 4.33
128 4.10
256 4.30
512 5.72
1024 5.81
2048 6.07
4096 5.74
8192 6.67
16384 7.74
32768 13.65
<hangs>
Expected result:
$mpirun -n 2 --mca oob_base_verbose 100 osu_bcast
# OSU MPI Broadcast Latency Test v7.1
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 3.26
2 4.05
4 4.40
8 7.55
16 5.53
32 5.53
64 4.06
128 4.49
256 6.37
512 7.11
1024 5.92
2048 7.26
4096 6.74
8192 8.74
16384 10.93
32768 14.40
65536 33.09
131072 48.18
262144 70.30
524288 118.22
1048576 200.32
Verbose output:
[histamine0:1785348] mca: base: components_register: registering framework oob components
[histamine0:1785348] mca: base: components_register: found loaded component tcp
[histamine0:1785348] mca: base: components_register: component tcp register function successful
[histamine0:1785348] mca: base: components_open: opening oob components
[histamine0:1785348] mca: base: components_open: found loaded component tcp
[histamine0:1785348] mca: base: components_open: component tcp open function successful
[histamine0:1785348] mca:oob:select: checking available component tcp
[histamine0:1785348] mca:oob:select: Querying component [tcp]
[histamine0:1785348] oob:tcp: component_available called
[histamine0:1785348] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[histamine0:1785348] [[3819,0],0] oob:tcp:init rejecting loopback interface lo
[histamine0:1785348] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4
[histamine0:1785348] [[3819,0],0] oob:tcp:init adding 10.0.0.49 to our list of V4 connections
[histamine0:1785348] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
[histamine0:1785348] [[3819,0],0] oob:tcp:init adding 172.17.0.1 to our list of V4 connections
[histamine0:1785348] [[3819,0],0] TCP STARTUP
[histamine0:1785348] [[3819,0],0] attempting to bind to IPv4 port 0
[histamine0:1785348] [[3819,0],0] assigned IPv4 port 36725
[histamine0:1785348] mca:oob:select: Adding component to end
[histamine0:1785348] mca:oob:select: Found 1 active transports
[histamine0:1785348] [[3819,0],0]: get transports
[histamine0:1785348] [[3819,0],0]:get transports for component tcp
# OSU MPI Broadcast Latency Test v7.1
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 4.45
2 4.61
4 4.66
8 4.63
16 4.02
32 4.06
64 4.07
128 4.10
256 4.13
512 5.82
1024 5.92
2048 6.27
4096 5.98
8192 6.69
16384 7.57
32768 14.08
<hangs>
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels