Skip to content

Sharding weak scaling test errors if running on single device #254

@luraess

Description

@luraess

Running the sharding weak scaling test on ALPS targeting a single GPU errors with type ReactantState has no field connectivity.

ERROR: LoadError: type ReactantState has no field connectivity
Stacktrace:
 [1] getproperty(x::ReactantState, f::Symbol)
   @ Base ./Base.jl:49
 [2] top-level scope
   @ /capstor/scratch/cscs/lraess/GB-25/sharding/runs/2026-02-23T11-14-58.674_uVu1/sharded_baroclinic_instability_simulation_run.jl:86
in expression starting at /capstor/scratch/cscs/lraess/GB-25/sharding/runs/2026-02-23T11-14-58.674_uVu1/sharded_baroclinic_instability_simulation_run.jl:86
┌ Debug: [GETPID 286099] Cleanup Backend State, Reactant.XLA.IFRTBackendState(true, Dict{String, Reactant.XLA.IFRT.Client}("cpu" => Reactant.XLA.IFRT.Client(Ptr{Nothing} @0x000000001b97e790), "cuda" => Reactant.XLA.IFRT.Client(Ptr{Nothing} @0x000000001c239990)), Reactant.XLA.IFRT.Client(Ptr{Nothing} @0x000000001c239990)), Reactant.XLA.State(0, 1, nothing, Reactant.XLA.DistributedRuntimeService(Ptr{Nothing} @0x000000001c3b3cd0), Reactant.XLA.DistributedRuntimeClient(Ptr{Nothing} @0x000000001b276f80), "nid005812:63939", "[::]:63939")
└ @ Reactant.XLA /capstor/scratch/cscs/lraess/.julia/gh200/juliaup/depot/packages/Reactant/j2PDd/src/xla/XLA.jl:111
┌ Debug: [GETPID 286099] Finalizing backend state, Reactant.XLA.IFRTBackendState(true, Dict{String, Reactant.XLA.IFRT.Client}("cpu" => Reactant.XLA.IFRT.Client(Ptr{Nothing} @0x000000001b97e790), "cuda" => Reactant.XLA.IFRT.Client(Ptr{Nothing} @0x000000001c239990)), Reactant.XLA.IFRT.Client(Ptr{Nothing} @0x000000001c239990))
└ @ Reactant.XLA /capstor/scratch/cscs/lraess/.julia/gh200/juliaup/depot/packages/Reactant/j2PDd/src/xla/XLA.jl:77
┌ Debug: [GETPID 286099] Freeing Client Reactant.XLA.IFRT.Client(Ptr{Nothing} @0x000000001b97e790)
└ @ Reactant.XLA.IFRT /capstor/scratch/cscs/lraess/.julia/gh200/juliaup/depot/packages/Reactant/j2PDd/src/xla/IFRT/Client.jl:14
┌ Debug: [GETPID 286099] Freeing Client Reactant.XLA.IFRT.Client(Ptr{Nothing} @0x000000001c239990)
└ @ Reactant.XLA.IFRT /capstor/scratch/cscs/lraess/.julia/gh200/juliaup/depot/packages/Reactant/j2PDd/src/xla/IFRT/Client.jl:14
┌ Debug: [GETPID 286099] Shutdown DistributedRuntimeClient
└ @ Reactant.XLA /capstor/scratch/cscs/lraess/.julia/gh200/juliaup/depot/packages/Reactant/j2PDd/src/xla/Distributed.jl:52
I0000 00:00:1771845448.636690  286099 client.cc:151] Distributed task shutdown initiated.
I0000 00:00:1771845448.636798  286099 coordination_service_agent.cc:393] Coordination agent has initiated Shutdown().
I0000 00:00:1771845448.637367  286775 coordination_service.cc:1373] Barrier(Shutdown::7348303068028219848::0) has passed with status: OK
I0000 00:00:1771845448.637472  286775 coordination_service.cc:1725] Shutdown barrier in coordination service has passed.
I0000 00:00:1771845448.637558  286099 coordination_service_agent.cc:411] Coordination agent has successfully shut down.
I0000 00:00:1771845448.637767  286099 client.cc:153] Distributed task shutdown result: OK
I0000 00:00:1771845448.637787  286797 coordination_service_agent.cc:288] Cancelling error polling because the service or the agent is shutting down.
┌ Debug: [GETPID 286099] Shutting down DistributedRuntimeService
└ @ Reactant.XLA /capstor/scratch/cscs/lraess/.julia/gh200/juliaup/depot/packages/Reactant/j2PDd/src/xla/Distributed.jl:100
I0000 00:00:1771845448.637840  286099 service.cc:115] Jax service shutting down
I0000 00:00:1771845448.649744  286775 coordination_service.cc:746] /job:jax_worker/replica:0/task:0 has disconnected from coordination service.
┌ Debug: [GETPID 286099] Freeing distributed runtime client
└ @ Reactant.XLA /capstor/scratch/cscs/lraess/.julia/gh200/juliaup/depot/packages/Reactant/j2PDd/src/xla/Distributed.jl:34
┌ Debug: [GETPID 286099] Freeing DistributedRuntimeService
└ @ Reactant.XLA /capstor/scratch/cscs/lraess/.julia/gh200/juliaup/depot/packages/Reactant/j2PDd/src/xla/Distributed.jl:91
srun: error: nid005812: task 0: Exited with exit code 1
srun: Terminating StepId=2742723.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions