Skip to content

[Bug]: Successfully launch SPU backend runtime but encountering Exception: ('remote exception', Exception('Traceback (most recent call last) #18

@warpoons

Description

@warpoons

Issue Type

Build/Install

Modules Involved

SPU runtime

Have you reproduced the bug with SPU HEAD?

Yes

Have you searched existing issues?

Yes

SPU Version

https://github.com/AntCPLab/OpenBumbleBee

OS Platform and Distribution

Ubuntu 24.04.1 LTS

Python Version

3.10

Compiler Version

GCC 11.4.0

Current Behavior?

Hi dear developers and authors of SPU and BumbleBee,

I have tried to benchmark the MPC performance of a customized ViT model (not from Huggingface) using BumbleBee by locally building and compiling from scratch. Up to now I can successfully launch the SPU backend runtime using LOOPBACK with the output log at node 0 (see below), but when I executed the flax_vit_inference at node 1, I encountered the error ‘Exception: ('remote exception', Exception('Traceback (most recent call last)’ (see the output log of node 1 for details).

Best

Standalone code to reproduce the issue

node 0 run: 
bazel run -c opt //examples/python/utils:nodectl -- --config `pwd`/examples/python/ml/flax_myvit/2pc.json up

node 1 run: 
bazel run -c opt //examples/python/ml/flax_myvit:flax_vit_inference -- --config `pwd`/examples/python/ml/flax_myvit/2pc.json

Relevant log output

================= node 0 log: =================
INFO: Analyzed target //examples/python/utils:nodectl (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
Target //examples/python/utils:nodectl up-to-date:
  bazel-bin/examples/python/utils/nodectl
INFO: Elapsed time: 0.272s, Critical Path: 0.00s
INFO: 1 process: 1 internal.
INFO: Build completed successfully, 1 total action
INFO: Running command line: bazel-bin/examples/python/utils/nodectl --config /home/server2/Desktop/OpenBumbleBee-ndss/examples/python/ml/flax_myvit/2pc.json up
[2025-07-18 16:15:56,763] [ForkServerProcess-1] Starting grpc server at 127.0.0.1:61525
[2025-07-18 16:15:56,849] [ForkServerProcess-2] Starting grpc server at 127.0.0.1:61526


================= node 1 log: =================
INFO: Analyzed target //examples/python/ml/flax_myvit:flax_vit_inference (1 packages loaded, 3 targets configured).
INFO: Found 1 target...
Target //examples/python/ml/flax_myvit:flax_vit_inference up-to-date:
  bazel-bin/examples/python/ml/flax_myvit/flax_vit_inference
INFO: Elapsed time: 0.555s, Critical Path: 0.00s
INFO: 1 process: 1 internal.
INFO: Build completed successfully, 1 total action
INFO: Running command line: bazel-bin/examples/python/ml/flax_myvit/flax_vit_inference --config /home/server2/Desktop/OpenBumbleBee-ndss/examples/python/ml/flax_myvit/2pc.json
Traceback (most recent call last):
  File "/home/server2/.cache/bazel/_bazel_server2/8fc9c5947c29740d3fcd2a7bd75108aa/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_myvit/flax_vit_inference.runfiles/spulib/examples/python/ml/flax_myvit/flax_vit_inference.py", line 38, in <module>
    ppd.init(conf["nodes"], conf["devices"])
  File "/home/server2/.cache/bazel/_bazel_server2/8fc9c5947c29740d3fcd2a7bd75108aa/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_myvit/flax_vit_inference.runfiles/spulib/spu/utils/distributed_impl.py", line 1175, in init
    _CONTEXT = HostContext(nodes_def, devices_def)
  File "/home/server2/.cache/bazel/_bazel_server2/8fc9c5947c29740d3fcd2a7bd75108aa/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_myvit/flax_vit_inference.runfiles/spulib/spu/utils/distributed_impl.py", line 1095, in __init__
    self.devices[name] = SPU(
  File "/home/server2/.cache/bazel/_bazel_server2/8fc9c5947c29740d3fcd2a7bd75108aa/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_myvit/flax_vit_inference.runfiles/spulib/spu/utils/distributed_impl.py", line 1010, in __init__
    results = [future.result() for future in futures]
  File "/home/server2/.cache/bazel/_bazel_server2/8fc9c5947c29740d3fcd2a7bd75108aa/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_myvit/flax_vit_inference.runfiles/spulib/spu/utils/distributed_impl.py", line 1010, in <listcomp>
    results = [future.result() for future in futures]
  File "/home/server2/anaconda3/envs/spu/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/home/server2/anaconda3/envs/spu/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/home/server2/anaconda3/envs/spu/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/server2/.cache/bazel/_bazel_server2/8fc9c5947c29740d3fcd2a7bd75108aa/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_myvit/flax_vit_inference.runfiles/spulib/spu/utils/distributed_impl.py", line 247, in run
    return self._call(self._stub.Run, fn, *args, **kwargs)
  File "/home/server2/.cache/bazel/_bazel_server2/8fc9c5947c29740d3fcd2a7bd75108aa/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_myvit/flax_vit_inference.runfiles/spulib/spu/utils/distributed_impl.py", line 240, in _call
    raise Exception("remote exception", result)
Exception: ('remote exception', Exception('Traceback (most recent call last):\n  File "/home/server2/.cache/bazel/_bazel_server2/8fc9c5947c29740d3fcd2a7bd75108aa/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/utils/distributed_impl.py", line 326, in Run\n    ret_objs = fn(self, *args, **kwargs)\n  File "/home/server2/.cache/bazel/_bazel_server2/8fc9c5947c29740d3fcd2a7bd75108aa/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/utils/distributed_impl.py", line 559, in builtin_spu_init\n    server._locals[f"{name}-rt"] = spu_api.Runtime(link, spu_config)\n  File "/home/server2/.cache/bazel/_bazel_server2/8fc9c5947c29740d3fcd2a7bd75108aa/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/api.py", line 35, in __init__\n    self._vm = libspu.RuntimeWrapper(link, config.SerializeToString())\nRuntimeError: what: \n\t[libspu/mpc/factory.cc:55] Invalid protocol kind SEMI2K\nstacktrace: \n#0 spu::RuntimeWrapper::RuntimeWrapper()+0x70da33243517\n#1 pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN()+0x70da332439b6\n#2 pybind11::cpp_function::dispatcher()+0x70da33209c76\n#3 cfunction_call+0x4fd527\n\n\n'))

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions