Skip to content

Commit f38cd6f

Browse files
jeremymanningclaude
andcommitted
Add remote training status monitoring script
Created check_remote_status.sh and supporting Python script to monitor training progress on remote GPU servers. Features: - Connects to remote server using saved credentials - Reports completion status for all model variants (baseline, content, function, POS) - Shows mean ± std of final training losses for completed models - Displays current epoch, loss, and estimated time to completion for in-progress models - Accurate ETA calculation based on actual runtime from training log timestamps Implementation: - Bash wrapper (check_remote_status.sh) handles SSH connection and cluster selection - Python analyzer (check_training_status.py) parses model directories and loss logs - Extracts training start time from log files for accurate elapsed time calculation - Calculates per-epoch training time and estimates remaining duration - Supports both local and remote execution Documentation: - Added comprehensive usage instructions to README.md - Example output showing completed and in-progress model statistics - Integrated with existing remote_train.sh credential system 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
1 parent b8825c8 commit f38cd6f

File tree

3 files changed

+768
-0
lines changed

3 files changed

+768
-0
lines changed

README.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ llm-stylometry/
4444
│ └── check_outputs.py # Output validation script
4545
├── run_llm_stylometry.sh # Shell wrapper for easy setup
4646
├── remote_train.sh # Remote GPU server training script
47+
├── check_remote_status.sh # Check training status on remote server
4748
├── sync_models.sh # Download models from remote server
4849
├── LICENSE # MIT License
4950
├── README.md # This file
@@ -412,6 +413,57 @@ Once training is complete, use `sync_models.sh` **from your local machine** to d
412413

413414
**Note**: The script verifies models are complete before downloading. If training is in progress, it will show which models are missing and skip incomplete conditions.
414415

416+
#### Checking training status
417+
418+
Monitor training progress on your GPU server using `check_remote_status.sh` **from your local machine**:
419+
420+
```bash
421+
# Check status on default cluster (tensor02)
422+
./check_remote_status.sh
423+
424+
# Check status on specific cluster
425+
./check_remote_status.sh --cluster tensor01
426+
./check_remote_status.sh --cluster tensor02
427+
```
428+
429+
The script provides a comprehensive status report including:
430+
431+
**For completed models:**
432+
- Number of completed seeds per author (out of 10)
433+
- Final training loss (mean ± std across all completed seeds)
434+
435+
**For in-progress models:**
436+
- Current epoch and progress percentage
437+
- Current training loss
438+
- Estimated time to completion (based on actual runtime per epoch)
439+
440+
**Example output:**
441+
```
442+
================================================================================
443+
POS VARIANT MODELS
444+
================================================================================
445+
446+
AUSTEN
447+
--------------------------------------------------------------------------------
448+
Completed: 2/10 seeds
449+
Final training loss: 1.1103 ± 0.0003 (mean ± std)
450+
In-progress: 1 seeds
451+
Seed 2: epoch 132/500 (26.4%) | loss: 1.2382 | ETA: 1d 1h 30m
452+
453+
--------------------------------------------------------------------------------
454+
Summary: 16/80 complete, 8 in progress
455+
Estimated completion: 1d 1h 30m (longest), 1d 0h 45m (average)
456+
```
457+
458+
**How it works:**
459+
1. Connects to your GPU server using saved credentials (`.ssh/credentials_{cluster}.json`)
460+
2. Analyzes all model directories and loss logs
461+
3. Calculates statistics for completed models
462+
4. Estimates remaining time based on actual training progress
463+
5. Reports status for baseline and all variant models
464+
465+
**Prerequisites:** The script uses the same credentials file as `remote_train.sh`. If credentials aren't saved, you'll be prompted to enter them interactively.
466+
415467
### Model Configuration
416468

417469
Each model uses the same architecture and hyperparameters (applies to baseline and all variants):

0 commit comments

Comments
 (0)