Skip to content

Commit 63dbe3f

Browse files
alicup29claude
andcommitted
docs: add comprehensive room-based connection documentation
Add detailed documentation covering all aspects of the new room-based connection feature to README.md and create testing guide. Changes to README.md: - CLI usage examples for workers and clients (training & inference) - Four connection modes with detailed use cases and examples - Session string (direct connection) - Room-based discovery (interactive selection) - Auto-select (automatic best worker) - Direct worker in room (targeted connection) - Multi-worker scenarios - Load balancing across multiple workers - Heterogeneous worker pools - High-availability setups - Worker status lifecycle documentation - Busy rejection behavior and safeguards - Best practices for production deployments New testing guide (TESTING.md): - 12 comprehensive test scenarios with step-by-step instructions - Expected outputs and validation checklists - Manual testing procedures for integration validation - Common issues and troubleshooting tips This documentation completes Section 9 of the add-client-room-connection feature implementation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
1 parent b9a8e81 commit 63dbe3f

File tree

2 files changed

+838
-0
lines changed

2 files changed

+838
-0
lines changed

README.md

Lines changed: 347 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,3 +90,350 @@ sleap-rtc train data.slp --server ws://custom.com:8080
9090
### Backward Compatibility
9191

9292
If no configuration is provided, SLEAP-RTC defaults to the production signaling server, maintaining backward compatibility with existing deployments.
93+
94+
## CLI Usage
95+
96+
SLEAP-RTC provides commands for running workers and clients for remote training and inference.
97+
98+
### Worker Commands
99+
100+
Start a worker to process training or inference jobs:
101+
102+
```bash
103+
# Start a worker (creates a new room)
104+
sleap-rtc worker
105+
106+
# Join an existing room (for multi-worker scenarios)
107+
sleap-rtc worker --room-id <room_id> --token <token>
108+
```
109+
110+
When a worker starts, it displays connection credentials:
111+
112+
```
113+
================================================================================
114+
Worker authenticated with server
115+
================================================================================
116+
117+
Session string for DIRECT connection to this worker:
118+
eyJyIjogInJvb21faWQiLCAidCI6ICJ0b2tlbiIsICJwIjogInBlZXJfaWQifQ==
119+
120+
Room credentials for OTHER workers/clients to join this room:
121+
Room ID: room_abc123
122+
Token: token_xyz789
123+
124+
Use session string with --session-string for direct connection
125+
Use room credentials with --room-id and --token for worker discovery
126+
================================================================================
127+
```
128+
129+
### Client Commands
130+
131+
#### Training Client
132+
133+
Connect to a worker to run a training job:
134+
135+
```bash
136+
# Option 1: Direct connection using session string
137+
sleap-rtc client-train \
138+
--session-string <session_string> \
139+
--pkg-path /path/to/training_package.zip
140+
141+
# Option 2: Room-based discovery with interactive worker selection
142+
sleap-rtc client-train \
143+
--room-id <room_id> \
144+
--token <token> \
145+
--pkg-path /path/to/training_package.zip
146+
147+
# Option 3: Auto-select best worker by GPU memory
148+
sleap-rtc client-train \
149+
--room-id <room_id> \
150+
--token <token> \
151+
--pkg-path /path/to/training_package.zip \
152+
--auto-select
153+
154+
# Option 4: Connect to specific worker in room (skip discovery)
155+
sleap-rtc client-train \
156+
--room-id <room_id> \
157+
--token <token> \
158+
--worker-id <peer_id> \
159+
--pkg-path /path/to/training_package.zip
160+
```
161+
162+
Additional options:
163+
- `--controller-port <port>`: ZMQ controller port (default: 9000)
164+
- `--publish-port <port>`: ZMQ publish port (default: 9001)
165+
- `--min-gpu-memory <MB>`: Filter workers by minimum GPU memory
166+
167+
#### Inference Client
168+
169+
Connect to a worker to run an inference job:
170+
171+
```bash
172+
# Option 1: Direct connection using session string
173+
sleap-rtc client-track \
174+
--session-string <session_string> \
175+
--pkg-path /path/to/inference_package.zip
176+
177+
# Option 2: Room-based discovery with interactive worker selection
178+
sleap-rtc client-track \
179+
--room-id <room_id> \
180+
--token <token> \
181+
--pkg-path /path/to/inference_package.zip
182+
183+
# Option 3: Auto-select best worker by GPU memory
184+
sleap-rtc client-track \
185+
--room-id <room_id> \
186+
--token <token> \
187+
--pkg-path /path/to/inference_package.zip \
188+
--auto-select
189+
```
190+
191+
## Connection Workflows
192+
193+
### Two-Phase Connection Model
194+
195+
SLEAP-RTC supports a flexible two-phase connection workflow:
196+
197+
1. **Phase 1: Join Room** - Client authenticates with signaling server and joins a room
198+
2. **Phase 2: Worker Discovery & Selection** - Client discovers available workers and selects one
199+
200+
This model provides several advantages:
201+
- **Visibility**: See all available workers before connecting
202+
- **Flexibility**: Choose workers based on capabilities (GPU memory, status, hostname)
203+
- **Resilience**: If a worker is busy, easily discover and select alternatives
204+
- **Multi-worker**: Support multiple workers in a single room for load balancing
205+
206+
### Connection Mode 1: Session String (Direct Connection)
207+
208+
Use when you have a session string from a specific worker:
209+
210+
```bash
211+
# Worker displays session string on startup
212+
sleap-rtc worker
213+
# Copy the session string from output
214+
215+
# Client connects directly to that worker
216+
sleap-rtc client-train --session-string <session_string> --pkg-path package.zip
217+
```
218+
219+
**When to use:**
220+
- Single worker scenarios
221+
- Direct connection to a specific known worker
222+
- Minimal configuration required
223+
224+
**Limitations:**
225+
- If the worker is busy, connection will be rejected
226+
- No worker discovery or selection capability
227+
- Must obtain new session string if worker restarts
228+
229+
### Connection Mode 2: Room-Based Discovery (Interactive Selection)
230+
231+
Use when you want to see available workers and choose interactively:
232+
233+
```bash
234+
# Start multiple workers in the same room
235+
sleap-rtc worker # Worker 1 creates room, displays credentials
236+
sleap-rtc worker --room-id <room_id> --token <token> # Worker 2 joins
237+
sleap-rtc worker --room-id <room_id> --token <token> # Worker 3 joins
238+
239+
# Client discovers and selects worker interactively
240+
sleap-rtc client-train --room-id <room_id> --token <token> --pkg-path package.zip
241+
```
242+
243+
**Interactive selection displays:**
244+
```
245+
Discovering workers in room...
246+
Found 3 available workers:
247+
248+
1. Worker peer_abc123
249+
GPU: NVIDIA RTX 4090 (24576 MB)
250+
Status: available
251+
Hostname: gpu-server-1
252+
253+
2. Worker peer_def456
254+
GPU: NVIDIA RTX 3090 (24576 MB)
255+
Status: available
256+
Hostname: gpu-server-2
257+
258+
3. Worker peer_ghi789
259+
GPU: NVIDIA GTX 1080 Ti (11264 MB)
260+
Status: available
261+
Hostname: gpu-workstation
262+
263+
Select worker (1-3) or 'r' to refresh:
264+
```
265+
266+
**When to use:**
267+
- Multiple workers available
268+
- Want to see worker specifications before connecting
269+
- Need to verify worker status before job submission
270+
- Want to manually choose based on current availability
271+
272+
**Features:**
273+
- Real-time worker information (GPU model, memory, status, hostname)
274+
- Refresh capability to update worker list
275+
- Only shows workers with status "available"
276+
277+
### Connection Mode 3: Auto-Select (Automatic Best Worker)
278+
279+
Use when you want the system to automatically choose the best worker:
280+
281+
```bash
282+
sleap-rtc client-train \
283+
--room-id <room_id> \
284+
--token <token> \
285+
--pkg-path package.zip \
286+
--auto-select
287+
```
288+
289+
**Behavior:**
290+
- Discovers all available workers in the room
291+
- Automatically selects worker with highest GPU memory
292+
- No user interaction required
293+
- Ideal for scripts and automated workflows
294+
295+
**When to use:**
296+
- Automated training pipelines
297+
- Scripts that need deterministic worker selection
298+
- Prefer best hardware without manual selection
299+
300+
### Connection Mode 4: Direct Worker in Room
301+
302+
Use when you know the specific worker peer-id you want:
303+
304+
```bash
305+
sleap-rtc client-train \
306+
--room-id <room_id> \
307+
--token <token> \
308+
--worker-id <peer_id> \
309+
--pkg-path package.zip
310+
```
311+
312+
**Behavior:**
313+
- Skips worker discovery
314+
- Connects directly to specified worker by peer-id
315+
- Still uses room credentials for authentication
316+
317+
**When to use:**
318+
- You know the exact worker peer-id you need
319+
- Want to target a specific worker without discovery overhead
320+
- Scripted workflows with predetermined worker assignment
321+
322+
## Multi-Worker Scenarios
323+
324+
### Scenario 1: Load Balancing Across Multiple Workers
325+
326+
Set up multiple workers in a room for parallel job processing:
327+
328+
```bash
329+
# Terminal 1: Start Worker 1 (creates room)
330+
sleap-rtc worker
331+
# Save room_id and token from output
332+
333+
# Terminal 2: Start Worker 2 (joins same room)
334+
sleap-rtc worker --room-id <room_id> --token <token>
335+
336+
# Terminal 3: Start Worker 3 (joins same room)
337+
sleap-rtc worker --room-id <room_id> --token <token>
338+
339+
# Terminal 4: Client 1 discovers and selects a worker
340+
sleap-rtc client-train --room-id <room_id> --token <token> --pkg-path job1.zip
341+
342+
# Terminal 5: Client 2 discovers and selects different worker
343+
sleap-rtc client-train --room-id <room_id> --token <token> --pkg-path job2.zip
344+
```
345+
346+
**Result:** Each client can independently select from available workers, enabling parallel job execution.
347+
348+
### Scenario 2: Heterogeneous Worker Pool
349+
350+
Workers with different GPU configurations can coexist in a room:
351+
352+
```bash
353+
# High-end worker (RTX 4090)
354+
sleap-rtc worker --room-id shared_room --token shared_token
355+
356+
# Mid-tier worker (RTX 3090)
357+
sleap-rtc worker --room-id shared_room --token shared_token
358+
359+
# Budget worker (GTX 1080 Ti)
360+
sleap-rtc worker --room-id shared_room --token shared_token
361+
362+
# Client auto-selects best worker (RTX 4090)
363+
sleap-rtc client-train \
364+
--room-id shared_room \
365+
--token shared_token \
366+
--pkg-path large_job.zip \
367+
--auto-select
368+
```
369+
370+
**Features:**
371+
- Clients can filter by `--min-gpu-memory` to ensure sufficient resources
372+
- Auto-select automatically chooses worker with most GPU memory
373+
- Interactive mode shows GPU specs for informed selection
374+
375+
### Scenario 3: High-Availability Setup
376+
377+
If a worker becomes unavailable, clients can easily discover alternatives:
378+
379+
```bash
380+
# Client attempts connection to Worker 1 via session string
381+
sleap-rtc client-train --session-string <worker1_session> --pkg-path job.zip
382+
# ERROR: Worker is currently busy
383+
384+
# Client falls back to room-based discovery
385+
sleap-rtc client-train --room-id <room_id> --token <token> --pkg-path job.zip
386+
# SUCCESS: Discovers Worker 2 and Worker 3 are available, selects Worker 2
387+
```
388+
389+
## Worker Status and Safeguards
390+
391+
### Worker Status Lifecycle
392+
393+
Workers maintain status to coordinate connections and prevent conflicts:
394+
395+
| Status | Description | Accepts New Connections? |
396+
|-------------|--------------------------------------------------|--------------------------|
397+
| `available` | Worker is idle and ready to accept jobs | ✅ Yes |
398+
| `reserved` | Worker accepted connection, negotiating job | ❌ No |
399+
| `busy` | Worker is actively processing a job | ❌ No |
400+
401+
**Status transitions:**
402+
```
403+
available → reserved → busy → available
404+
↑ ↓
405+
└────────────────────────────┘
406+
```
407+
408+
### Busy Rejection Behavior
409+
410+
When a client attempts to connect to a busy or reserved worker (e.g., via session string), the worker will reject the connection:
411+
412+
**Client output:**
413+
```
414+
Connecting to worker...
415+
ERROR: Worker is currently busy. Please use --room-id and --token to discover available workers.
416+
Connection rejected by worker.
417+
```
418+
419+
**Worker output:**
420+
```
421+
Received offer SDP
422+
Rejecting connection from peer_xyz789 - worker is busy
423+
Sent busy rejection to client peer_xyz789
424+
```
425+
426+
**Why this matters:**
427+
- **Prevents job conflicts**: Multiple clients cannot interfere with each other's jobs
428+
- **Protects data integrity**: Ensures one job completes before starting another
429+
- **Clear error messages**: Clients receive actionable feedback
430+
- **Room-based alternative**: Rejection message suggests using room discovery to find available workers
431+
432+
### Best Practices
433+
434+
1. **Use room-based discovery for production**: More resilient to worker availability changes
435+
2. **Session strings for development**: Convenient for testing with a single known worker
436+
3. **Auto-select for automation**: Deterministic worker selection in scripts
437+
4. **Check worker status**: Room-based discovery only shows "available" workers
438+
5. **Multi-worker for availability**: Deploy multiple workers to handle concurrent jobs
439+
6. **GPU filtering**: Use `--min-gpu-memory` to ensure workers have sufficient resources

0 commit comments

Comments
 (0)