[Bug]: Error loading with RTX 5090 (model qwen 0.6B)

### Steps to Reproduce

Hi,
When launch "parallax join" on terminal orchestrator i receive this msg : 

`INFO:     Started server process [411]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://localhost:3001 (Press CTRL+C to quit)
INFO:     127.0.0.1:38594 - "GET / HTTP/1.1" 200 OK
INFO:     127.0.0.1:38594 - "GET /model/list HTTP/1.1" 200 OK
INFO:     127.0.0.1:38594 - "GET /cluster/status HTTP/1.1" 200 OK
INFO:     127.0.0.1:35884 - "GET /assets/gradient-icon-CRwZKfVU.svg HTTP/1.1" 304 Not Modified
Jan 11 23:37:13.018 [[1m[32mscheduling[0m] [[1m[32mINFO    [0m] scheduler.py:116          Scheduler initialized, min_nodes_bootstrapping 1, Layer allocations trategy dp, Request routing strategy rr.
Jan 11 23:37:13.018 [[1m[33mbackend   [0m] [[1m[32mINFO    [0m] scheduler_manage.py:199   Nodes will automatically rejoin via heartbeat (node_update) mechanism
Jan 11 23:37:13.067 [[1m[33mbackend   [0m] [[1m[32mINFO    [0m] scheduler_manage.py:264   Stored scheduler peer id: 12D3KooWKweBVyhJygm754Z1YnpHi5JMo8n8MMf1aKhuSPkKabR5
INFO:     127.0.0.1:35888 - "POST /scheduler/init HTTP/1.1" 200 OK
Jan 11 23:37:28.292 [[1m[33mbackend   [0m] [[1m[32mINFO    [0m] rpc_connection_handler.py:48 receive node_join request: {'node_id': '12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1', 'hardware': {'node_id': '12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1', 'num_gpus': 1, 'tflops_fp16': 104.8, 'gpu_name': 'NVIDIA GeForce RTX 5090 Laptop GPU', 'memory_gb': 23.9, 'memory_bandwidth_gbps': 1792.0, 'device': 'cuda'}, 'kvcache_mem_ratio': 0.25, 'param_mem_ratio': 0.65, 'max_concurrent_requests': 8, 'max_sequence_length': 2048, 'rtt_to_nodes': {'12D3KooWKweBVyhJygm754Z1YnpHi5JMo8n8MMf1aKhuSPkKabR5': 0.383727}, 'status': 'joining', 'is_active': False, 'last_refit_time': 0.0}
Jan 11 23:37:28.292 [[1m[32mscheduling[0m] [[1m[32mINFO    [0m] scheduler.py:332          Joining node 12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1 (kv_ratio=0.25, param_ratio=0.65, manual_assignment=False, bootstrapped=False)
Jan 11 23:37:28.292 [[1m[32mscheduling[0m] [[1m[32mINFO    [0m] scheduler.py:538          Allocation snapshot (after join 12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1)
Registered pipelines (0)
------------------------
Capacity: (no registered pipelines)
  (none)
Jan 11 23:37:28.292 [[1m[32mscheduling[0m] [[1m[32mINFO    [0m] scheduler.py:195          [Scheduler] Starting Bootstrap
Jan 11 23:37:28.292 [[1m[32mscheduling[0m] [[1m[32mINFO    [0m] layer_allocation.py:810   [DPLayerAllocator] Starting allocate_from_standby with 1 nodes for 28 layers
Jan 11 23:37:28.292 [[1m[32mscheduling[0m] [[1m[32mINFO    [0m] layer_allocation.py:827   [DPLayerAllocator] Sufficient resources: nodes=1, layers=28, total_cap=530
Jan 11 23:37:28.293 [[1m[32mscheduling[0m] [[1m[32mINFO    [0m] layer_allocation.py:961   [DPLayerAllocator] allocate_from_standby completed successfully
Jan 11 23:37:28.293 [[1m[32mscheduling[0m] [[1m[32mINFO    [0m] scheduler.py:228          [Scheduler] Post Bootstrap Layer Assignments: [('12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1', 0, 28)]
Jan 11 23:37:28.293 [[1m[32mscheduling[0m] [[1m[32mINFO    [0m] scheduler.py:238          [FixedRouter] register_pipelines with bootstrap success, number of pipelines: 1
Jan 11 23:37:28.293 [[1m[32mscheduling[0m] [[1m[32mINFO    [0m] scheduler.py:538          Allocation snapshot (Post Bootstrap)
Registered pipelines (1)
------------------------
Capacity: total=8 cur=8 per_pipeline={0: (8, 8)}
  pipeline 0   | stages=1
    [00] 12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1 layers [  0,  28) | load   0/8   | latency    0.03 ms | active False
Jan 11 23:41:02.468 [[1m[32mscheduling[0m] [[1m[32mINFO    [0m] scheduler.py:395          Leaving node 12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1 (start=0, end=28)
Jan 11 23:41:02.468 [[1m[32mscheduling[0m] [[1m[33mWARNING [0m] node_management.py:93     Node 12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1 left; removing pipeline_id=0 from registered pipelines and detaching 1 member(s): ['12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1']
Jan 11 23:41:02.468 [[1m[32mscheduling[0m] [[1m[32mINFO    [0m] scheduler.py:538          Allocation snapshot (after leave 12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1)
Registered pipelines (0)
------------------------
Capacity: (no registered pipelines)
  (none)
Jan 11 23:41:02.469 [[1m[32mscheduling[0m] [[1m[33mWARNING [0m] scheduler.py:754          Global rebalance triggered due to node leave
Jan 11 23:37:38.816 [[1m[33mbackend   [0m] [[1m[32mINFO    [0m] rpc_connection_handler.py:84 Node 12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1 not found in scheduler, auto-joining via node_update
Jan 11 23:37:38.816 [[1m[32mscheduling[0m] [[1m[32mINFO    [0m] scheduler.py:332          Joining node 12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1 (kv_ratio=0.25, param_ratio=0.65, manual_assignment=False, bootstrapped=True)
Exception in thread SchedulerEventLoop:
Traceback (most recent call last):
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
Jan 11 23:37:38.816 [[1m[32mscheduling[0m] [[1m[32mINFO    [0m] layer_allocation.py:810   [DPLayerAllocator] Starting allocate_from_standby with 1 nodes for 28 layers
Jan 11 23:37:38.816 [[1m[32mscheduling[0m] [[1m[32mINFO    [0m] layer_allocation.py:827   [DPLayerAllocator] Sufficient resources: nodes=1, layers=28, total_cap=530
    self.run()
  File "/usr/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/root/parallax/src/scheduling/scheduler.py", line 628, in _event_loop
    self._process_joins()
  File "/root/parallax/src/scheduling/scheduler.py", line 699, in _process_joins
    self.join(node)
  File "/root/parallax/src/scheduling/scheduler.py", line 348, in join
    self._maybe_expand_rr_pipelines()
  File "/root/parallax/src/scheduling/scheduler.py", line 160, in _maybe_expand_rr_pipelines
    ok = self.layer_allocator.allocate_from_standby()
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/parallax/src/scheduling/layer_allocation.py", line 949, in allocate_from_standby
    self.adjust_pipeline_layers(pl_nodes, assume_sorted=False)
  File "/root/parallax/src/scheduling/layer_allocation.py", line 314, in adjust_pipeline_layers
    self.deallocate(node)
  File "/root/parallax/src/scheduling/layer_allocation.py", line 184, in deallocate
    self.node_management.standby([node.node_id])
  File "/root/parallax/src/scheduling/node_management.py", line 139, in standby
    nodes_to_clear = self._standby_locked(node_ids)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/parallax/src/scheduling/node_management.py", line 76, in _standby_locked
    raise ValueError(f"Node {nid} is not ACTIVE, current state: {prev_state}")
ValueError: Node 12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1 is not ACTIVE, current state: NodeState.STANDBY`

Can you help me ? tnx

### Expected Behavior

Join without error

### Actual Behavior

Error in description

### Version

latest at 11 Jan 

### Environment & Context

- [x] I'm using the latest version.
- [x] I have searched existing issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Error loading with RTX 5090 (model qwen 0.6B) #382

Steps to Reproduce

Expected Behavior

Actual Behavior

Version

Environment & Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Error loading with RTX 5090 (model qwen 0.6B) #382

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Version

Environment & Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions