@@ -90,3 +90,350 @@ sleap-rtc train data.slp --server ws://custom.com:8080
9090### Backward Compatibility
9191
9292If no configuration is provided, SLEAP-RTC defaults to the production signaling server, maintaining backward compatibility with existing deployments.
93+
94+ ## CLI Usage
95+
96+ SLEAP-RTC provides commands for running workers and clients for remote training and inference.
97+
98+ ### Worker Commands
99+
100+ Start a worker to process training or inference jobs:
101+
102+ ``` bash
103+ # Start a worker (creates a new room)
104+ sleap-rtc worker
105+
106+ # Join an existing room (for multi-worker scenarios)
107+ sleap-rtc worker --room-id < room_id> --token < token>
108+ ```
109+
110+ When a worker starts, it displays connection credentials:
111+
112+ ```
113+ ================================================================================
114+ Worker authenticated with server
115+ ================================================================================
116+
117+ Session string for DIRECT connection to this worker:
118+ eyJyIjogInJvb21faWQiLCAidCI6ICJ0b2tlbiIsICJwIjogInBlZXJfaWQifQ==
119+
120+ Room credentials for OTHER workers/clients to join this room:
121+ Room ID: room_abc123
122+ Token: token_xyz789
123+
124+ Use session string with --session-string for direct connection
125+ Use room credentials with --room-id and --token for worker discovery
126+ ================================================================================
127+ ```
128+
129+ ### Client Commands
130+
131+ #### Training Client
132+
133+ Connect to a worker to run a training job:
134+
135+ ``` bash
136+ # Option 1: Direct connection using session string
137+ sleap-rtc client-train \
138+ --session-string < session_string> \
139+ --pkg-path /path/to/training_package.zip
140+
141+ # Option 2: Room-based discovery with interactive worker selection
142+ sleap-rtc client-train \
143+ --room-id < room_id> \
144+ --token < token> \
145+ --pkg-path /path/to/training_package.zip
146+
147+ # Option 3: Auto-select best worker by GPU memory
148+ sleap-rtc client-train \
149+ --room-id < room_id> \
150+ --token < token> \
151+ --pkg-path /path/to/training_package.zip \
152+ --auto-select
153+
154+ # Option 4: Connect to specific worker in room (skip discovery)
155+ sleap-rtc client-train \
156+ --room-id < room_id> \
157+ --token < token> \
158+ --worker-id < peer_id> \
159+ --pkg-path /path/to/training_package.zip
160+ ```
161+
162+ Additional options:
163+ - ` --controller-port <port> ` : ZMQ controller port (default: 9000)
164+ - ` --publish-port <port> ` : ZMQ publish port (default: 9001)
165+ - ` --min-gpu-memory <MB> ` : Filter workers by minimum GPU memory
166+
167+ #### Inference Client
168+
169+ Connect to a worker to run an inference job:
170+
171+ ``` bash
172+ # Option 1: Direct connection using session string
173+ sleap-rtc client-track \
174+ --session-string < session_string> \
175+ --pkg-path /path/to/inference_package.zip
176+
177+ # Option 2: Room-based discovery with interactive worker selection
178+ sleap-rtc client-track \
179+ --room-id < room_id> \
180+ --token < token> \
181+ --pkg-path /path/to/inference_package.zip
182+
183+ # Option 3: Auto-select best worker by GPU memory
184+ sleap-rtc client-track \
185+ --room-id < room_id> \
186+ --token < token> \
187+ --pkg-path /path/to/inference_package.zip \
188+ --auto-select
189+ ```
190+
191+ ## Connection Workflows
192+
193+ ### Two-Phase Connection Model
194+
195+ SLEAP-RTC supports a flexible two-phase connection workflow:
196+
197+ 1 . ** Phase 1: Join Room** - Client authenticates with signaling server and joins a room
198+ 2 . ** Phase 2: Worker Discovery & Selection** - Client discovers available workers and selects one
199+
200+ This model provides several advantages:
201+ - ** Visibility** : See all available workers before connecting
202+ - ** Flexibility** : Choose workers based on capabilities (GPU memory, status, hostname)
203+ - ** Resilience** : If a worker is busy, easily discover and select alternatives
204+ - ** Multi-worker** : Support multiple workers in a single room for load balancing
205+
206+ ### Connection Mode 1: Session String (Direct Connection)
207+
208+ Use when you have a session string from a specific worker:
209+
210+ ``` bash
211+ # Worker displays session string on startup
212+ sleap-rtc worker
213+ # Copy the session string from output
214+
215+ # Client connects directly to that worker
216+ sleap-rtc client-train --session-string < session_string> --pkg-path package.zip
217+ ```
218+
219+ ** When to use:**
220+ - Single worker scenarios
221+ - Direct connection to a specific known worker
222+ - Minimal configuration required
223+
224+ ** Limitations:**
225+ - If the worker is busy, connection will be rejected
226+ - No worker discovery or selection capability
227+ - Must obtain new session string if worker restarts
228+
229+ ### Connection Mode 2: Room-Based Discovery (Interactive Selection)
230+
231+ Use when you want to see available workers and choose interactively:
232+
233+ ``` bash
234+ # Start multiple workers in the same room
235+ sleap-rtc worker # Worker 1 creates room, displays credentials
236+ sleap-rtc worker --room-id < room_id> --token < token> # Worker 2 joins
237+ sleap-rtc worker --room-id < room_id> --token < token> # Worker 3 joins
238+
239+ # Client discovers and selects worker interactively
240+ sleap-rtc client-train --room-id < room_id> --token < token> --pkg-path package.zip
241+ ```
242+
243+ ** Interactive selection displays:**
244+ ```
245+ Discovering workers in room...
246+ Found 3 available workers:
247+
248+ 1. Worker peer_abc123
249+ GPU: NVIDIA RTX 4090 (24576 MB)
250+ Status: available
251+ Hostname: gpu-server-1
252+
253+ 2. Worker peer_def456
254+ GPU: NVIDIA RTX 3090 (24576 MB)
255+ Status: available
256+ Hostname: gpu-server-2
257+
258+ 3. Worker peer_ghi789
259+ GPU: NVIDIA GTX 1080 Ti (11264 MB)
260+ Status: available
261+ Hostname: gpu-workstation
262+
263+ Select worker (1-3) or 'r' to refresh:
264+ ```
265+
266+ ** When to use:**
267+ - Multiple workers available
268+ - Want to see worker specifications before connecting
269+ - Need to verify worker status before job submission
270+ - Want to manually choose based on current availability
271+
272+ ** Features:**
273+ - Real-time worker information (GPU model, memory, status, hostname)
274+ - Refresh capability to update worker list
275+ - Only shows workers with status "available"
276+
277+ ### Connection Mode 3: Auto-Select (Automatic Best Worker)
278+
279+ Use when you want the system to automatically choose the best worker:
280+
281+ ``` bash
282+ sleap-rtc client-train \
283+ --room-id < room_id> \
284+ --token < token> \
285+ --pkg-path package.zip \
286+ --auto-select
287+ ```
288+
289+ ** Behavior:**
290+ - Discovers all available workers in the room
291+ - Automatically selects worker with highest GPU memory
292+ - No user interaction required
293+ - Ideal for scripts and automated workflows
294+
295+ ** When to use:**
296+ - Automated training pipelines
297+ - Scripts that need deterministic worker selection
298+ - Prefer best hardware without manual selection
299+
300+ ### Connection Mode 4: Direct Worker in Room
301+
302+ Use when you know the specific worker peer-id you want:
303+
304+ ``` bash
305+ sleap-rtc client-train \
306+ --room-id < room_id> \
307+ --token < token> \
308+ --worker-id < peer_id> \
309+ --pkg-path package.zip
310+ ```
311+
312+ ** Behavior:**
313+ - Skips worker discovery
314+ - Connects directly to specified worker by peer-id
315+ - Still uses room credentials for authentication
316+
317+ ** When to use:**
318+ - You know the exact worker peer-id you need
319+ - Want to target a specific worker without discovery overhead
320+ - Scripted workflows with predetermined worker assignment
321+
322+ ## Multi-Worker Scenarios
323+
324+ ### Scenario 1: Load Balancing Across Multiple Workers
325+
326+ Set up multiple workers in a room for parallel job processing:
327+
328+ ``` bash
329+ # Terminal 1: Start Worker 1 (creates room)
330+ sleap-rtc worker
331+ # Save room_id and token from output
332+
333+ # Terminal 2: Start Worker 2 (joins same room)
334+ sleap-rtc worker --room-id < room_id> --token < token>
335+
336+ # Terminal 3: Start Worker 3 (joins same room)
337+ sleap-rtc worker --room-id < room_id> --token < token>
338+
339+ # Terminal 4: Client 1 discovers and selects a worker
340+ sleap-rtc client-train --room-id < room_id> --token < token> --pkg-path job1.zip
341+
342+ # Terminal 5: Client 2 discovers and selects different worker
343+ sleap-rtc client-train --room-id < room_id> --token < token> --pkg-path job2.zip
344+ ```
345+
346+ ** Result:** Each client can independently select from available workers, enabling parallel job execution.
347+
348+ ### Scenario 2: Heterogeneous Worker Pool
349+
350+ Workers with different GPU configurations can coexist in a room:
351+
352+ ``` bash
353+ # High-end worker (RTX 4090)
354+ sleap-rtc worker --room-id shared_room --token shared_token
355+
356+ # Mid-tier worker (RTX 3090)
357+ sleap-rtc worker --room-id shared_room --token shared_token
358+
359+ # Budget worker (GTX 1080 Ti)
360+ sleap-rtc worker --room-id shared_room --token shared_token
361+
362+ # Client auto-selects best worker (RTX 4090)
363+ sleap-rtc client-train \
364+ --room-id shared_room \
365+ --token shared_token \
366+ --pkg-path large_job.zip \
367+ --auto-select
368+ ```
369+
370+ ** Features:**
371+ - Clients can filter by ` --min-gpu-memory ` to ensure sufficient resources
372+ - Auto-select automatically chooses worker with most GPU memory
373+ - Interactive mode shows GPU specs for informed selection
374+
375+ ### Scenario 3: High-Availability Setup
376+
377+ If a worker becomes unavailable, clients can easily discover alternatives:
378+
379+ ``` bash
380+ # Client attempts connection to Worker 1 via session string
381+ sleap-rtc client-train --session-string < worker1_session> --pkg-path job.zip
382+ # ERROR: Worker is currently busy
383+
384+ # Client falls back to room-based discovery
385+ sleap-rtc client-train --room-id < room_id> --token < token> --pkg-path job.zip
386+ # SUCCESS: Discovers Worker 2 and Worker 3 are available, selects Worker 2
387+ ```
388+
389+ ## Worker Status and Safeguards
390+
391+ ### Worker Status Lifecycle
392+
393+ Workers maintain status to coordinate connections and prevent conflicts:
394+
395+ | Status | Description | Accepts New Connections? |
396+ | -------------| --------------------------------------------------| --------------------------|
397+ | ` available ` | Worker is idle and ready to accept jobs | ✅ Yes |
398+ | ` reserved ` | Worker accepted connection, negotiating job | ❌ No |
399+ | ` busy ` | Worker is actively processing a job | ❌ No |
400+
401+ ** Status transitions:**
402+ ```
403+ available → reserved → busy → available
404+ ↑ ↓
405+ └────────────────────────────┘
406+ ```
407+
408+ ### Busy Rejection Behavior
409+
410+ When a client attempts to connect to a busy or reserved worker (e.g., via session string), the worker will reject the connection:
411+
412+ ** Client output:**
413+ ```
414+ Connecting to worker...
415+ ERROR: Worker is currently busy. Please use --room-id and --token to discover available workers.
416+ Connection rejected by worker.
417+ ```
418+
419+ ** Worker output:**
420+ ```
421+ Received offer SDP
422+ Rejecting connection from peer_xyz789 - worker is busy
423+ Sent busy rejection to client peer_xyz789
424+ ```
425+
426+ ** Why this matters:**
427+ - ** Prevents job conflicts** : Multiple clients cannot interfere with each other's jobs
428+ - ** Protects data integrity** : Ensures one job completes before starting another
429+ - ** Clear error messages** : Clients receive actionable feedback
430+ - ** Room-based alternative** : Rejection message suggests using room discovery to find available workers
431+
432+ ### Best Practices
433+
434+ 1 . ** Use room-based discovery for production** : More resilient to worker availability changes
435+ 2 . ** Session strings for development** : Convenient for testing with a single known worker
436+ 3 . ** Auto-select for automation** : Deterministic worker selection in scripts
437+ 4 . ** Check worker status** : Room-based discovery only shows "available" workers
438+ 5 . ** Multi-worker for availability** : Deploy multiple workers to handle concurrent jobs
439+ 6 . ** GPU filtering** : Use ` --min-gpu-memory ` to ensure workers have sufficient resources
0 commit comments