You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+52Lines changed: 52 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -44,6 +44,7 @@ llm-stylometry/
44
44
│ └── check_outputs.py # Output validation script
45
45
├── run_llm_stylometry.sh # Shell wrapper for easy setup
46
46
├── remote_train.sh # Remote GPU server training script
47
+
├── check_remote_status.sh # Check training status on remote server
47
48
├── sync_models.sh # Download models from remote server
48
49
├── LICENSE # MIT License
49
50
├── README.md # This file
@@ -412,6 +413,57 @@ Once training is complete, use `sync_models.sh` **from your local machine** to d
412
413
413
414
**Note**: The script verifies models are complete before downloading. If training is in progress, it will show which models are missing and skip incomplete conditions.
414
415
416
+
#### Checking training status
417
+
418
+
Monitor training progress on your GPU server using `check_remote_status.sh`**from your local machine**:
419
+
420
+
```bash
421
+
# Check status on default cluster (tensor02)
422
+
./check_remote_status.sh
423
+
424
+
# Check status on specific cluster
425
+
./check_remote_status.sh --cluster tensor01
426
+
./check_remote_status.sh --cluster tensor02
427
+
```
428
+
429
+
The script provides a comprehensive status report including:
430
+
431
+
**For completed models:**
432
+
- Number of completed seeds per author (out of 10)
433
+
- Final training loss (mean ± std across all completed seeds)
434
+
435
+
**For in-progress models:**
436
+
- Current epoch and progress percentage
437
+
- Current training loss
438
+
- Estimated time to completion (based on actual runtime per epoch)
1. Connects to your GPU server using saved credentials (`.ssh/credentials_{cluster}.json`)
460
+
2. Analyzes all model directories and loss logs
461
+
3. Calculates statistics for completed models
462
+
4. Estimates remaining time based on actual training progress
463
+
5. Reports status for baseline and all variant models
464
+
465
+
**Prerequisites:** The script uses the same credentials file as `remote_train.sh`. If credentials aren't saved, you'll be prompted to enter them interactively.
466
+
415
467
### Model Configuration
416
468
417
469
Each model uses the same architecture and hyperparameters (applies to baseline and all variants):
0 commit comments