Deadlock Running Simple All Gather XLA Graph Across Two Local GPU Devices #16315

xanderdunn · 2023-06-08T14:53:46Z

xanderdunn
Jun 8, 2023

I have a simple XLA HLO graph with a single all gather across two devices:

debug_ir: HloModule xla_computation_ordered_wrapper, entry_computation_layout={(f32[2,2]{1,0})->(f32[4,2]{1,0})}

ENTRY main.5 (Arg_0.1: f32[2,2]) -> (f32[4,2]) {
  Arg_0.1 = f32[2,2]{1,0} parameter(0)
  all-gather.2 = f32[4,2]{1,0} all-gather(Arg_0.1), replica_groups={{0,1}}, dimensions={0}
  ROOT tuple.3 = (f32[4,2]{1,0}) tuple(all-gather.2)
}

The graph for rank 0: rust_hlo_rank_0.test_all_gather_dim0.pb.zip
The graph for rank 1: rust_hlo_rank_1.test_all_gather_dim0.pb.zip
These graphs are identical.

I am attempting to run this graph distributed across processes on the same GPU host with this script. run_xla_cpu_gpu.py:

#!/usr/bin/env python3

"""
This script is called by XLARunner to run a given hlo .pb
on GPU or CPU. It uses jax to compile and run the .pb.
"""

# System
import argparse
import json
import sys
import time
import hashlib
import os

# Third Party
import jax
from jaxlib import xla_client
from jaxlib import xla_extension
import numpy as np

def main():
    global_start = time.time()
    parser = argparse.ArgumentParser(description="Load a given hlo.pb file and run it on CPU or GPU")
    parser.add_argument("--hlo_path", type=str, help="Give the path of the hlo.pb to read, such as /tmp/rust_hlo.pb")
    parser.add_argument("--device", choices=["cpu", "gpu"], help="Choose between 'cpu' and 'gpu'")
    parser.add_argument("--run_name", type=str, help="The name of the run. This is used to differentiate output file names from other runs.")
    parser.add_argument("--no-cache", action="store_true", help="If set, already compiled files will not be used.")
    parser.add_argument("--distributed", action="store_true", help="If set, the script will run in distributed mode.")
    parser.add_argument("--num_processes", type=int, help="The number of processes to run in distributed mode.")
    parser.add_argument("--process_id", type=int, help="The process id to run in distributed mode.")
    parser.add_argument("--coordinator_address", type=str, help="The coordinator ip to run in distributed mode.")
    parser.add_argument("--local_device_id", type=int, help="The local device id to run in distributed mode.")
    args = parser.parse_args()

    if "CUDA_VISIBLE_DEVICES" in os.environ:
        print("CUDA_VISIBLE_DEVICES = ", os.environ["CUDA_VISIBLE_DEVICES"], file=sys.stderr)

    if args.distributed:
        if args.device == "cpu":
            print("ox.environ[XLA_FLAGS] = ", os.environ['XLA_FLAGS'], file=sys.stderr)
        os.environ["NCCL_DEBUG"] = "DEBUG"
        print(f"Running as distributed with {args.num_processes} total processes and this process is {args.process_id} with coordinator IP {args.coordinator_address}, local device ID {args.local_device_id}!", file=sys.stderr)
        jax.distributed.initialize(coordinator_address=args.coordinator_address, # type: ignore
                                   num_processes=args.num_processes,
                                   process_id=args.process_id,
                                   local_device_ids=args.local_device_id)
        print("jax.device_count() = ", jax.device_count(), file=sys.stderr)
        print("jax.local_device_count() = ", jax.local_device_count(), file=sys.stderr)

    if args.device == "cpu":
        client = xla_client.make_cpu_client()
    elif args.device == "gpu":
        client = xla_client.make_gpu_client()
    else:
        raise ValueError(f"Unknown device: {args.device}")
    device = client.local_devices()[0]
    jax.default_device = device

    # Load the HLO.pb file
    path = args.hlo_path
    with open(path, "rb") as f:
        serialized_hlo = f.read()

    # This prints the debug_ir without the need to compile the graph. This is useful for debugging a graph
    # that doesn't compile.
    hlo_module = xla_extension.HloModule.from_serialized_hlo_module_proto(serialized_hlo)
    print_options = xla_extension.HloPrintOptions()
    print_options.print_percent = False
    print_options.print_operand_shape = False
    print(hlo_module.to_string(options=print_options), file=sys.stderr)

    xla_comp = xla_client.XlaComputation(serialized_hlo)
    debug_ir = xla_comp.as_hlo_text()
    # print("rust debug_ir: ", debug_ir, file=sys.stderr)

    # Compile the HLO to an XLA executable
    start = time.time()
    compile_options = xla_client.CompileOptions()
    compile_options.num_replicas = 1
    compile_options.num_partitions = 1

    concatenated_bytes = serialized_hlo + "{}{}".format(compile_options.num_replicas, compile_options.num_partitions).encode()
    hash_object = hashlib.sha256(concatenated_bytes)
    hash_hex = hash_object.hexdigest()
    compiled_graph_path = f"/tmp/rust_compiled_graph_{args.run_name}_{hash_hex}.executable"

    # Check if compiled_graph_path exists
    executable = None
    if args.no_cache or args.device == "cpu" or not os.path.exists(compiled_graph_path):
        print("Compiling the XLA graph...", file=sys.stderr)
        xla_comp_mlir = xla_client._xla.mlir.xla_computation_to_mlir_module(xla_comp) # type: ignore
        executable = client.compile(xla_comp_mlir, compile_options)
        assert(executable is not None)
        # TODO: Why does this not work when compiling for CPU? It gives an internal error that the executable is
        # not a valid executable when serialize_executable is called.
        if args.device == "gpu":
            executable_bytes = client.serialize_executable(executable)
            with open(compiled_graph_path, "wb") as f:
                f.write(executable_bytes)
        print(f"run_xla_cpu_gpu: Took {time.time() - start}s to compile the executble for {args.device}", file=sys.stderr)

    input_devices = [jax.device_put(np.array([[1., 1.], [1., 1.]]), device)]
    outputs = None
    if executable is None:
        executable = client.deserialize_executable(serialized=open(compiled_graph_path, "rb").read(), compile_options=compile_options) # type: ignore
    start = time.time()
    outputs = executable.execute(input_devices) # type: ignore
    for output in outputs:
        output.block_until_ready()
    runtime = time.time() - start

    assert outputs is not None
    result = []
    result_types = []
    for output in outputs:
        result.append(np.asarray(output).reshape(-1).tolist())
        result_types.append(output.dtype.name)
    print("output: ", result, file=sys.stderr)
    res = {
        "debug_ir": debug_ir,
        "output": result,
        "output_type": result_types,
        "runtime": runtime,
    }

    start = time.time()
    print(json.dumps(res))
    print(f"run_xla_cpu_gpu: Took {time.time() - start}s to dump the result to stdout", file=sys.stderr)
    print(f"run_xla_cpu_gpu: Took {time.time() - global_start}s to run the whole script.", file=sys.stderr)

if __name__ == "__main__":
    main()

I followed the documentation here and here in writing this script.

So now I run two processes, one for each of the two ranks:

python3 run_xla_cpu_gpu.py --hlo_path /tmp/rust_hlo_rank_0.test_all_gather_dim0.pb --device gpu --run_name rank_0.test_all_gather_dim0 --no-cache --distributed --num_processes 2 --process_id 0 --coordinator_address 127.0.0.1:7032 --local_device_id 0

python3 run_xla_cpu_gpu.py --hlo_path /tmp/rust_hlo_rank_1.test_all_gather_dim0.pb --device gpu --run_name rank_1.test_all_gather_dim0 --no-cache --distributed --num_processes 2 --process_id 1 --coordinator_address 127.0.0.1:7032 --local_device_id 1

However, both processes hang indefinitely:

jax.device_count() =  2
jax.local_device_count() =  1

Compiling the XLA graph...
run_xla_cpu_gpu: Took 2.5333399772644043s to compile the executble for gpu
2023-06-08 10:43:35.484896: E external/xla/xla/service/rendezvous.cc:31] This thread has been waiting for 10 seconds and may be stuck:

I'm guessing that I've misconfigured the distributed setup somehow, or I'm misusing the jax API here in some way? I'm running on a machine with 8 A100s, and NCCL works fine to communicate between devices in other applications. My guess is that the graph execution is hanging because it can't find / can't communicate with the other device. Any ideas what might be wrong here?

I notice the existence of both execute_sharded* functions on Executable, as well as a DistributedRunTimeClient in xla_extension/__init__.py. Is one of these appropriate here?

Note that it does run and return results on CPU, but the result is incorrect. Maybe comms aren't actually supported on the CPU backend?

$ python3 run_xla_cpu_gpu.py --hlo_path /tmp/rust_hlo_rank_1.test_all_gather_dim0.pb --device cpu --run_name rank_1.test_all_gather_dim0 --no-cache --distributed --num_processes 2 --process_id 1 --coordinator_address 127.0.0.1:7032 --local_device_id 1
Running as distributed with 2 total processes and this process is 1 with coordinator IP 127.0.0.1:7032, local device ID 1!
jax.device_count() =  2
jax.local_device_count() =  1

Compiling the XLA graph...
run_xla_cpu_gpu: Took 0.01763129234313965s to compile the executble for cpu
output:  [[1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0]]
{"debug_ir": "HloModule xla_computation_ordered_wrapper, entry_computation_layout={(f32[2,2]{1,0})->(f32[4,2]{1,0})}\n\nENTRY main.5 {\n  Arg_0.1 = f32[2,2]{1,0} parameter(0)\n  all-gather.2 = f32[4,2]{1,0} all-gather(Arg_0.1), replica_groups={{0,1}}, dimensions={0}\n  ROOT tuple.3 = (f32[4,2]{1,0}) tuple(all-gather.2)\n}\n\n", "output": [[1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0]], "output_type": ["float32"], "runtime": 5.602836608886719e-05}
run_xla_cpu_gpu: Took 4.482269287109375e-05s to dump the result to stdout
run_xla_cpu_gpu: Took 1.9806413650512695s to run the whole script.

On a correct all gather output I would expect the output to be all 1.0, since both ranks have all 1.0 as input.

Thanks!

I'm running jax 0.4.11 on CUDA 12 and cudnn 8.9.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deadlock Running Simple All Gather XLA Graph Across Two Local GPU Devices #16315

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Deadlock Running Simple All Gather XLA Graph Across Two Local GPU Devices #16315

Uh oh!

Uh oh!

xanderdunn Jun 8, 2023

Replies: 0 comments

xanderdunn
Jun 8, 2023