-
Notifications
You must be signed in to change notification settings - Fork 121
Description
Steps to Reproduce
Hi,
When launch "parallax join" on terminal orchestrator i receive this msg :
`INFO: Started server process [411]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://localhost:3001 (Press CTRL+C to quit)
INFO: 127.0.0.1:38594 - "GET / HTTP/1.1" 200 OK
INFO: 127.0.0.1:38594 - "GET /model/list HTTP/1.1" 200 OK
INFO: 127.0.0.1:38594 - "GET /cluster/status HTTP/1.1" 200 OK
INFO: 127.0.0.1:35884 - "GET /assets/gradient-icon-CRwZKfVU.svg HTTP/1.1" 304 Not Modified
Jan 11 23:37:13.018 [�[1m�[32mscheduling�[0m] [�[1m�[32mINFO �[0m] scheduler.py:116 Scheduler initialized, min_nodes_bootstrapping 1, Layer allocations trategy dp, Request routing strategy rr.
Jan 11 23:37:13.018 [�[1m�[33mbackend �[0m] [�[1m�[32mINFO �[0m] scheduler_manage.py:199 Nodes will automatically rejoin via heartbeat (node_update) mechanism
Jan 11 23:37:13.067 [�[1m�[33mbackend �[0m] [�[1m�[32mINFO �[0m] scheduler_manage.py:264 Stored scheduler peer id: 12D3KooWKweBVyhJygm754Z1YnpHi5JMo8n8MMf1aKhuSPkKabR5
INFO: 127.0.0.1:35888 - "POST /scheduler/init HTTP/1.1" 200 OK
Jan 11 23:37:28.292 [�[1m�[33mbackend �[0m] [�[1m�[32mINFO �[0m] rpc_connection_handler.py:48 receive node_join request: {'node_id': '12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1', 'hardware': {'node_id': '12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1', 'num_gpus': 1, 'tflops_fp16': 104.8, 'gpu_name': 'NVIDIA GeForce RTX 5090 Laptop GPU', 'memory_gb': 23.9, 'memory_bandwidth_gbps': 1792.0, 'device': 'cuda'}, 'kvcache_mem_ratio': 0.25, 'param_mem_ratio': 0.65, 'max_concurrent_requests': 8, 'max_sequence_length': 2048, 'rtt_to_nodes': {'12D3KooWKweBVyhJygm754Z1YnpHi5JMo8n8MMf1aKhuSPkKabR5': 0.383727}, 'status': 'joining', 'is_active': False, 'last_refit_time': 0.0}
Jan 11 23:37:28.292 [�[1m�[32mscheduling�[0m] [�[1m�[32mINFO �[0m] scheduler.py:332 Joining node 12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1 (kv_ratio=0.25, param_ratio=0.65, manual_assignment=False, bootstrapped=False)
Jan 11 23:37:28.292 [�[1m�[32mscheduling�[0m] [�[1m�[32mINFO �[0m] scheduler.py:538 Allocation snapshot (after join 12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1)
Registered pipelines (0)
Capacity: (no registered pipelines)
(none)
Jan 11 23:37:28.292 [�[1m�[32mscheduling�[0m] [�[1m�[32mINFO �[0m] scheduler.py:195 [Scheduler] Starting Bootstrap
Jan 11 23:37:28.292 [�[1m�[32mscheduling�[0m] [�[1m�[32mINFO �[0m] layer_allocation.py:810 [DPLayerAllocator] Starting allocate_from_standby with 1 nodes for 28 layers
Jan 11 23:37:28.292 [�[1m�[32mscheduling�[0m] [�[1m�[32mINFO �[0m] layer_allocation.py:827 [DPLayerAllocator] Sufficient resources: nodes=1, layers=28, total_cap=530
Jan 11 23:37:28.293 [�[1m�[32mscheduling�[0m] [�[1m�[32mINFO �[0m] layer_allocation.py:961 [DPLayerAllocator] allocate_from_standby completed successfully
Jan 11 23:37:28.293 [�[1m�[32mscheduling�[0m] [�[1m�[32mINFO �[0m] scheduler.py:228 [Scheduler] Post Bootstrap Layer Assignments: [('12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1', 0, 28)]
Jan 11 23:37:28.293 [�[1m�[32mscheduling�[0m] [�[1m�[32mINFO �[0m] scheduler.py:238 [FixedRouter] register_pipelines with bootstrap success, number of pipelines: 1
Jan 11 23:37:28.293 [�[1m�[32mscheduling�[0m] [�[1m�[32mINFO �[0m] scheduler.py:538 Allocation snapshot (Post Bootstrap)
Registered pipelines (1)
Capacity: total=8 cur=8 per_pipeline={0: (8, 8)}
pipeline 0 | stages=1
[00] 12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1 layers [ 0, 28) | load 0/8 | latency 0.03 ms | active False
Jan 11 23:41:02.468 [�[1m�[32mscheduling�[0m] [�[1m�[32mINFO �[0m] scheduler.py:395 Leaving node 12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1 (start=0, end=28)
Jan 11 23:41:02.468 [�[1m�[32mscheduling�[0m] [�[1m�[33mWARNING �[0m] node_management.py:93 Node 12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1 left; removing pipeline_id=0 from registered pipelines and detaching 1 member(s): ['12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1']
Jan 11 23:41:02.468 [�[1m�[32mscheduling�[0m] [�[1m�[32mINFO �[0m] scheduler.py:538 Allocation snapshot (after leave 12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1)
Registered pipelines (0)
Capacity: (no registered pipelines)
(none)
Jan 11 23:41:02.469 [�[1m�[32mscheduling�[0m] [�[1m�[33mWARNING �[0m] scheduler.py:754 Global rebalance triggered due to node leave
Jan 11 23:37:38.816 [�[1m�[33mbackend �[0m] [�[1m�[32mINFO �[0m] rpc_connection_handler.py:84 Node 12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1 not found in scheduler, auto-joining via node_update
Jan 11 23:37:38.816 [�[1m�[32mscheduling�[0m] [�[1m�[32mINFO �[0m] scheduler.py:332 Joining node 12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1 (kv_ratio=0.25, param_ratio=0.65, manual_assignment=False, bootstrapped=True)
Exception in thread SchedulerEventLoop:
Traceback (most recent call last):
File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
Jan 11 23:37:38.816 [�[1m�[32mscheduling�[0m] [�[1m�[32mINFO �[0m] layer_allocation.py:810 [DPLayerAllocator] Starting allocate_from_standby with 1 nodes for 28 layers
Jan 11 23:37:38.816 [�[1m�[32mscheduling�[0m] [�[1m�[32mINFO �[0m] layer_allocation.py:827 [DPLayerAllocator] Sufficient resources: nodes=1, layers=28, total_cap=530
self.run()
File "/usr/lib/python3.12/threading.py", line 1010, in run
self._target(*self._args, **self._kwargs)
File "/root/parallax/src/scheduling/scheduler.py", line 628, in _event_loop
self._process_joins()
File "/root/parallax/src/scheduling/scheduler.py", line 699, in _process_joins
self.join(node)
File "/root/parallax/src/scheduling/scheduler.py", line 348, in join
self._maybe_expand_rr_pipelines()
File "/root/parallax/src/scheduling/scheduler.py", line 160, in _maybe_expand_rr_pipelines
ok = self.layer_allocator.allocate_from_standby()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/parallax/src/scheduling/layer_allocation.py", line 949, in allocate_from_standby
self.adjust_pipeline_layers(pl_nodes, assume_sorted=False)
File "/root/parallax/src/scheduling/layer_allocation.py", line 314, in adjust_pipeline_layers
self.deallocate(node)
File "/root/parallax/src/scheduling/layer_allocation.py", line 184, in deallocate
self.node_management.standby([node.node_id])
File "/root/parallax/src/scheduling/node_management.py", line 139, in standby
nodes_to_clear = self._standby_locked(node_ids)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/parallax/src/scheduling/node_management.py", line 76, in _standby_locked
raise ValueError(f"Node {nid} is not ACTIVE, current state: {prev_state}")
ValueError: Node 12D3KooWLoDa1RSAYGDPSDPp2fjuZt3txB1oYmYKhoNw3M7adAz1 is not ACTIVE, current state: NodeState.STANDBY`
Can you help me ? tnx
Expected Behavior
Join without error
Actual Behavior
Error in description
Version
latest at 11 Jan
Environment & Context
- I'm using the latest version.
- I have searched existing issues.