Skip to content

Commit e31386a

Browse files
Merge pull request #32 from jeremymanning/main
Fix duplicate epoch logging and add remote training status monitoring
2 parents 70253d2 + f38cd6f commit e31386a

14 files changed

+1999
-31
lines changed

.ssh/credentials_tensor01.json

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"server": "tensor01.dartmouth.edu",
3+
"username": "f002d6b",
4+
"password": "yaf1wue7gev_WQB.ueb"
5+
}

.ssh/credentials_tensor02.json

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"server": "tensor02.dartmouth.edu",
3+
"username": "f002d6b",
4+
"password": "yaf1wue7gev_WQB.ueb"
5+
}

README.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ llm-stylometry/
4444
│ └── check_outputs.py # Output validation script
4545
├── run_llm_stylometry.sh # Shell wrapper for easy setup
4646
├── remote_train.sh # Remote GPU server training script
47+
├── check_remote_status.sh # Check training status on remote server
4748
├── sync_models.sh # Download models from remote server
4849
├── LICENSE # MIT License
4950
├── README.md # This file
@@ -412,6 +413,57 @@ Once training is complete, use `sync_models.sh` **from your local machine** to d
412413

413414
**Note**: The script verifies models are complete before downloading. If training is in progress, it will show which models are missing and skip incomplete conditions.
414415

416+
#### Checking training status
417+
418+
Monitor training progress on your GPU server using `check_remote_status.sh` **from your local machine**:
419+
420+
```bash
421+
# Check status on default cluster (tensor02)
422+
./check_remote_status.sh
423+
424+
# Check status on specific cluster
425+
./check_remote_status.sh --cluster tensor01
426+
./check_remote_status.sh --cluster tensor02
427+
```
428+
429+
The script provides a comprehensive status report including:
430+
431+
**For completed models:**
432+
- Number of completed seeds per author (out of 10)
433+
- Final training loss (mean ± std across all completed seeds)
434+
435+
**For in-progress models:**
436+
- Current epoch and progress percentage
437+
- Current training loss
438+
- Estimated time to completion (based on actual runtime per epoch)
439+
440+
**Example output:**
441+
```
442+
================================================================================
443+
POS VARIANT MODELS
444+
================================================================================
445+
446+
AUSTEN
447+
--------------------------------------------------------------------------------
448+
Completed: 2/10 seeds
449+
Final training loss: 1.1103 ± 0.0003 (mean ± std)
450+
In-progress: 1 seeds
451+
Seed 2: epoch 132/500 (26.4%) | loss: 1.2382 | ETA: 1d 1h 30m
452+
453+
--------------------------------------------------------------------------------
454+
Summary: 16/80 complete, 8 in progress
455+
Estimated completion: 1d 1h 30m (longest), 1d 0h 45m (average)
456+
```
457+
458+
**How it works:**
459+
1. Connects to your GPU server using saved credentials (`.ssh/credentials_{cluster}.json`)
460+
2. Analyzes all model directories and loss logs
461+
3. Calculates statistics for completed models
462+
4. Estimates remaining time based on actual training progress
463+
5. Reports status for baseline and all variant models
464+
465+
**Prerequisites:** The script uses the same credentials file as `remote_train.sh`. If credentials aren't saved, you'll be prompted to enter them interactively.
466+
415467
### Model Configuration
416468

417469
Each model uses the same architecture and hyperparameters (applies to baseline and all variants):

0 commit comments

Comments
 (0)