You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
3. Consolidate results into `data/model_results.pkl`
164
-
165
-
**Resume Training**: The `--resume` flag allows you to continue training from existing checkpoints:
166
-
- Models that have already met training criteria are automatically skipped
167
-
- Partially trained models with saved weights resume from their last checkpoint
168
-
- Models without weights are trained from scratch (even if loss logs exist)
169
-
- Random states are restored from checkpoints to ensure consistent training continuation
170
-
171
-
The training pipeline automatically handles data preparation, model training across available GPUs, and result consolidation. Individual model checkpoints and loss logs are saved in the `models/` directory.
172
-
173
-
### Remote Training on GPU Server
174
-
175
-
#### Prerequisites: Setting up Git credentials on the server
176
-
177
-
Before using the remote training script, you need to set up Git credentials on your server once:
./remote_train.sh -fo # Function variant (short flag)
216
-
./remote_train.sh --part-of-speech # POS variant
217
-
218
-
# Resume training from existing checkpoints
219
-
./remote_train.sh --resume # Resume baseline
220
-
./remote_train.sh -r -co # Resume content variant
221
-
222
-
# Kill existing training sessions
223
-
./remote_train.sh --kill # Kill and exit
224
-
./remote_train.sh --kill --resume # Kill and restart
225
-
226
-
# You'll be prompted for:
227
-
# - Server address (hostname or IP)
228
-
# - Username
229
-
```
230
-
231
-
**What this script does:** The `remote_train.sh` script connects to your GPU server via SSH and executes `run_llm_stylometry.sh --train -y` (with any variant flags you specify) in a `screen` session. This allows you to disconnect your local machine while the GPU server continues training.
232
-
233
-
The script will:
234
-
1. SSH into your GPU server
235
-
2. Update the repository in `~/llm-stylometry` (or clone if it doesn't exist)
236
-
3. Start training in a `screen` session with the specified options
237
-
4. Exit, allowing your local machine to disconnect while training continues on the server
238
-
239
-
#### Monitoring training progress
240
-
241
-
To check on the training status, SSH into the server and reattach to the screen session:
242
-
243
-
```bash
244
-
# From your local machine
245
-
ssh username@server
246
-
247
-
# On the server, reattach to see live training output
248
-
screen -r llm_training
249
-
250
-
# To detach and leave training running, press Ctrl+A, then D
251
-
# To exit SSH while keeping training running
252
-
exit
253
-
```
254
-
255
-
#### Downloading results after training completes
256
-
257
-
Once training is complete, use `sync_models.sh`**from your local machine** to download the trained models and results:
258
-
259
-
```bash
260
-
# Download baseline models only (default)
261
-
./sync_models.sh
262
-
263
-
# Download specific variants
264
-
./sync_models.sh --content-only # Content variant only
1. Checks which requested models are complete on remote server (80 per condition)
283
-
2. Only syncs complete model sets
284
-
3. Uses rsync to download models with progress indication
285
-
4. Backs up existing local models before replacing
286
-
5. Also syncs `model_results.pkl` if available
287
-
288
-
**Note**: The script verifies models are complete before downloading. If training is in progress, it will show which models are missing and skip incomplete conditions.
289
-
290
-
#### Checking training status
291
-
292
-
Monitor training progress on your GPU server using `check_remote_status.sh`**from your local machine**:
293
-
294
-
```bash
295
-
# Check status on default cluster (tensor02)
296
-
./check_remote_status.sh
297
-
298
-
# Check status on specific cluster
299
-
./check_remote_status.sh --cluster tensor01
300
-
./check_remote_status.sh --cluster tensor02
301
-
```
302
-
303
-
The script provides a comprehensive status report including:
304
-
305
-
**For completed models:**
306
-
- Number of completed seeds per author (out of 10)
307
-
- Final training loss (mean ± std across all completed seeds)
308
-
309
-
**For in-progress models:**
310
-
- Current epoch and progress percentage
311
-
- Current training loss
312
-
- Estimated time to completion (based on actual runtime per epoch)
1. Connects to your GPU server using saved credentials (`.ssh/credentials_{cluster}.json`)
334
-
2. Analyzes all model directories and loss logs
335
-
3. Calculates statistics for completed models
336
-
4. Estimates remaining time based on actual training progress
337
-
5. Reports status for baseline and all variant models
338
-
339
-
**Prerequisites:** The script uses the same credentials file as `remote_train.sh`. If credentials aren't saved, you'll be prompted to enter them interactively.
340
-
341
-
### Model Configuration
342
-
343
-
Each model uses the same architecture and hyperparameters (applies to baseline and all variants):
344
-
- GPT-2 architecture with custom dimensions
345
-
- 128 embedding dimensions
346
-
- 8 transformer layers
347
-
- 8 attention heads
348
-
- 1024 maximum sequence length
349
-
- Training on ~643,041 tokens per author
350
-
- Early stopping at loss ≤ 3.0 (after minimum 500 epochs)
351
-
352
-
**Note:** All analysis variants use identical training configurations, differing only in input text transformations. This ensures fair comparison across conditions.
160
+
Trains in detached screen session on GPU server. See script help for full options.
0 commit comments