Skip to content

[Bug]: Error loading with RTX 5090 (model qwen 0.6B) #382

@Alieno79

Description

@Alieno79

Steps to Reproduce

Hi,
When launch "parallax join" on terminal orchestrator i receive this msg :

`INFO: Started server process [411]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://localhost:3001 (Press CTRL+C to quit)
INFO: 127.0.0.1:38594 - "GET / HTTP/1.1" 200 OK
INFO: 127.0.0.1:38594 - "GET /model/list HTTP/1.1" 200 OK
INFO: 127.0.0.1:38594 - "GET /cluster/status HTTP/1.1" 200 OK
INFO: 127.0.0.1:35884 - "GET /assets/gradient-icon-CRwZKfVU.svg HTTP/1.1" 304 Not Modified
Jan 11 23:37:13.018 [�[1m�[32mscheduling�[0m] [�[1m�[32mINFO �[0m] scheduler.py:116 Scheduler initialized, min_nodes_bootstrapping 1, Layer allocations trategy dp, Request routing strategy rr.
Jan 11 23:37:13.018 [�[1m�[33mbackend �[0m] [�[1m�[32mINFO �[0m] scheduler_manage.py:199 Nodes will automatically rejoin via heartbeat (node_update) mechanism
Jan 11 23:37:13.067 [�[1m�[33mbackend �[0m] [�[1m�[32mINFO �[0m] scheduler_manage.py:264 Stored scheduler peer id: 12D3KooWKweBVyhJygm754Z1YnpHi5JMo8n8MMf1aKhuSPkKabR5
INFO: 127.0.0.1:35888 - "POST /scheduler/init HTTP/1.1" 200 OK
Jan 11 23:37:28.292 [�[1m�[33mbackend �[0m] [�[1m�[32mINFO �[0m] rpc_connection_handler.py:48 receive node_join request: {'node_id': '12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1', 'hardware': {'node_id': '12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1', 'num_gpus': 1, 'tflops_fp16': 104.8, 'gpu_name': 'NVIDIA GeForce RTX 5090 Laptop GPU', 'memory_gb': 23.9, 'memory_bandwidth_gbps': 1792.0, 'device': 'cuda'}, 'kvcache_mem_ratio': 0.25, 'param_mem_ratio': 0.65, 'max_concurrent_requests': 8, 'max_sequence_length': 2048, 'rtt_to_nodes': {'12D3KooWKweBVyhJygm754Z1YnpHi5JMo8n8MMf1aKhuSPkKabR5': 0.383727}, 'status': 'joining', 'is_active': False, 'last_refit_time': 0.0}
Jan 11 23:37:28.292 [�[1m�[32mscheduling�[0m] [�[1m�[32mINFO �[0m] scheduler.py:332 Joining node 12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1 (kv_ratio=0.25, param_ratio=0.65, manual_assignment=False, bootstrapped=False)
Jan 11 23:37:28.292 [�[1m�[32mscheduling�[0m] [�[1m�[32mINFO �[0m] scheduler.py:538 Allocation snapshot (after join 12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1)
Registered pipelines (0)

Capacity: (no registered pipelines)
(none)
Jan 11 23:37:28.292 [�[1m�[32mscheduling�[0m] [�[1m�[32mINFO �[0m] scheduler.py:195 [Scheduler] Starting Bootstrap
Jan 11 23:37:28.292 [�[1m�[32mscheduling�[0m] [�[1m�[32mINFO �[0m] layer_allocation.py:810 [DPLayerAllocator] Starting allocate_from_standby with 1 nodes for 28 layers
Jan 11 23:37:28.292 [�[1m�[32mscheduling�[0m] [�[1m�[32mINFO �[0m] layer_allocation.py:827 [DPLayerAllocator] Sufficient resources: nodes=1, layers=28, total_cap=530
Jan 11 23:37:28.293 [�[1m�[32mscheduling�[0m] [�[1m�[32mINFO �[0m] layer_allocation.py:961 [DPLayerAllocator] allocate_from_standby completed successfully
Jan 11 23:37:28.293 [�[1m�[32mscheduling�[0m] [�[1m�[32mINFO �[0m] scheduler.py:228 [Scheduler] Post Bootstrap Layer Assignments: [('12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1', 0, 28)]
Jan 11 23:37:28.293 [�[1m�[32mscheduling�[0m] [�[1m�[32mINFO �[0m] scheduler.py:238 [FixedRouter] register_pipelines with bootstrap success, number of pipelines: 1
Jan 11 23:37:28.293 [�[1m�[32mscheduling�[0m] [�[1m�[32mINFO �[0m] scheduler.py:538 Allocation snapshot (Post Bootstrap)
Registered pipelines (1)

Capacity: total=8 cur=8 per_pipeline={0: (8, 8)}
pipeline 0 | stages=1
[00] 12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1 layers [ 0, 28) | load 0/8 | latency 0.03 ms | active False
Jan 11 23:41:02.468 [�[1m�[32mscheduling�[0m] [�[1m�[32mINFO �[0m] scheduler.py:395 Leaving node 12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1 (start=0, end=28)
Jan 11 23:41:02.468 [�[1m�[32mscheduling�[0m] [�[1m�[33mWARNING �[0m] node_management.py:93 Node 12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1 left; removing pipeline_id=0 from registered pipelines and detaching 1 member(s): ['12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1']
Jan 11 23:41:02.468 [�[1m�[32mscheduling�[0m] [�[1m�[32mINFO �[0m] scheduler.py:538 Allocation snapshot (after leave 12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1)
Registered pipelines (0)

Capacity: (no registered pipelines)
(none)
Jan 11 23:41:02.469 [�[1m�[32mscheduling�[0m] [�[1m�[33mWARNING �[0m] scheduler.py:754 Global rebalance triggered due to node leave
Jan 11 23:37:38.816 [�[1m�[33mbackend �[0m] [�[1m�[32mINFO �[0m] rpc_connection_handler.py:84 Node 12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1 not found in scheduler, auto-joining via node_update
Jan 11 23:37:38.816 [�[1m�[32mscheduling�[0m] [�[1m�[32mINFO �[0m] scheduler.py:332 Joining node 12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1 (kv_ratio=0.25, param_ratio=0.65, manual_assignment=False, bootstrapped=True)
Exception in thread SchedulerEventLoop:
Traceback (most recent call last):
File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
Jan 11 23:37:38.816 [�[1m�[32mscheduling�[0m] [�[1m�[32mINFO �[0m] layer_allocation.py:810 [DPLayerAllocator] Starting allocate_from_standby with 1 nodes for 28 layers
Jan 11 23:37:38.816 [�[1m�[32mscheduling�[0m] [�[1m�[32mINFO �[0m] layer_allocation.py:827 [DPLayerAllocator] Sufficient resources: nodes=1, layers=28, total_cap=530
self.run()
File "/usr/lib/python3.12/threading.py", line 1010, in run
self._target(*self._args, **self._kwargs)
File "/root/parallax/src/scheduling/scheduler.py", line 628, in _event_loop
self._process_joins()
File "/root/parallax/src/scheduling/scheduler.py", line 699, in _process_joins
self.join(node)
File "/root/parallax/src/scheduling/scheduler.py", line 348, in join
self._maybe_expand_rr_pipelines()
File "/root/parallax/src/scheduling/scheduler.py", line 160, in _maybe_expand_rr_pipelines
ok = self.layer_allocator.allocate_from_standby()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/parallax/src/scheduling/layer_allocation.py", line 949, in allocate_from_standby
self.adjust_pipeline_layers(pl_nodes, assume_sorted=False)
File "/root/parallax/src/scheduling/layer_allocation.py", line 314, in adjust_pipeline_layers
self.deallocate(node)
File "/root/parallax/src/scheduling/layer_allocation.py", line 184, in deallocate
self.node_management.standby([node.node_id])
File "/root/parallax/src/scheduling/node_management.py", line 139, in standby
nodes_to_clear = self._standby_locked(node_ids)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/parallax/src/scheduling/node_management.py", line 76, in _standby_locked
raise ValueError(f"Node {nid} is not ACTIVE, current state: {prev_state}")
ValueError: Node 12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1 is not ACTIVE, current state: NodeState.STANDBY`

Can you help me ? tnx

Expected Behavior

Join without error

Actual Behavior

Error in description

Version

latest at 11 Jan

Environment & Context

  • I'm using the latest version.
  • I have searched existing issues.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions