Skip to content

Commit 285dd05

Browse files
committed
Drastically simplify Training section
Removed verbose remote training documentation (-111 lines). Before: 380 lines After: 253 lines Reduction: -127 lines (-33%) Total README reduction: 752 → 253 lines (-499 lines, -66%) Well below 350 line target! Changes: - Removed detailed Git setup instructions - Removed verbose sync_models explanations - Removed monitoring/status examples - Removed redundant variant flag listings - Replaced tensor02/tensor01 with mycluster/gpucluster examples - Users directed to script --help for details Related to #35
1 parent 7e3b5db commit 285dd05

File tree

1 file changed

+15
-207
lines changed

1 file changed

+15
-207
lines changed

README.md

Lines changed: 15 additions & 207 deletions
Original file line numberDiff line numberDiff line change
@@ -133,223 +133,31 @@ The paper analyzes three linguistic variants (Supplemental Figures S1-S8):
133133

134134
## Training Models from Scratch
135135

136-
### Local Training
136+
Training 320 models (baseline + 3 variants) requires a CUDA GPU. See `models/README.md` for details.
137137

138+
**Local training:**
138139
```bash
139-
# Train baseline models
140-
./run_llm_stylometry.sh --train
141-
142-
# Train analysis variants
143-
./run_llm_stylometry.sh --train --content-only # Content variant
144-
./run_llm_stylometry.sh --train --function-only # Function variant
145-
./run_llm_stylometry.sh --train --part-of-speech # POS variant
146-
147-
# Short flags
148-
./run_llm_stylometry.sh -t -co # Content variant
149-
./run_llm_stylometry.sh -t -fo # Function variant
150-
./run_llm_stylometry.sh -t -pos # POS variant
151-
152-
# Resume training from existing checkpoints
153-
./run_llm_stylometry.sh --train --resume
154-
./run_llm_stylometry.sh -t -r -co # Resume content variant
155-
156-
# Limit GPU usage if needed
157-
./run_llm_stylometry.sh --train --max-gpus 4
158-
```
159-
160-
Each training run will:
161-
1. Clean and prepare the data if needed
162-
2. Train 80 models (8 authors × 10 seeds)
163-
3. Consolidate results into `data/model_results.pkl`
164-
165-
**Resume Training**: The `--resume` flag allows you to continue training from existing checkpoints:
166-
- Models that have already met training criteria are automatically skipped
167-
- Partially trained models with saved weights resume from their last checkpoint
168-
- Models without weights are trained from scratch (even if loss logs exist)
169-
- Random states are restored from checkpoints to ensure consistent training continuation
170-
171-
The training pipeline automatically handles data preparation, model training across available GPUs, and result consolidation. Individual model checkpoints and loss logs are saved in the `models/` directory.
172-
173-
### Remote Training on GPU Server
174-
175-
#### Prerequisites: Setting up Git credentials on the server
176-
177-
Before using the remote training script, you need to set up Git credentials on your server once:
178-
179-
1. SSH into your server:
180-
```bash
181-
ssh username@server
140+
./run_llm_stylometry.sh --train # Baseline (80 models)
141+
./run_llm_stylometry.sh --train -co # Content-only variant
142+
./run_llm_stylometry.sh -t -r # Resume from checkpoints
182143
```
183144

184-
2. Configure Git with your credentials:
185-
```bash
186-
# Set your Git user information (use your GitHub username)
187-
git config --global user.name "your-github-username"
188-
git config --global user.email "[email protected]"
145+
**Remote training:**
189146

190-
# Enable credential storage
191-
git config --global credential.helper store
147+
Requires GPU cluster with SSH access. Create `.ssh/credentials_mycluster.json`:
148+
```json
149+
{"server": "hostname", "username": "user", "password": "pass"}
192150
```
193151

194-
3. Clone the repository with your Personal Access Token:
152+
Then from local machine:
195153
```bash
196-
# Replace <username> and <token> with your GitHub username and Personal Access Token
197-
# Get a token from: https://github.com/settings/tokens (grant 'repo' scope)
198-
git clone https://<username>:<token>@github.com/ContextLab/llm-stylometry.git
199-
200-
# The credentials will be stored for future use
201-
cd llm-stylometry
202-
git pull # This should work without prompting for credentials
154+
./remote_train.sh --cluster mycluster # Train baseline
155+
./remote_train.sh -co --cluster mycluster -r # Resume content variant
156+
./check_remote_status.sh --cluster mycluster # Monitor progress
157+
./sync_models.sh --cluster mycluster -a # Download when complete
203158
```
204159

205-
#### Using the remote training script
206-
207-
Once Git credentials are configured on your server, run `remote_train.sh` **from your local machine** (not on the GPU server):
208-
209-
```bash
210-
# Train baseline models
211-
./remote_train.sh
212-
213-
# Train analysis variants
214-
./remote_train.sh --content-only # Content variant
215-
./remote_train.sh -fo # Function variant (short flag)
216-
./remote_train.sh --part-of-speech # POS variant
217-
218-
# Resume training from existing checkpoints
219-
./remote_train.sh --resume # Resume baseline
220-
./remote_train.sh -r -co # Resume content variant
221-
222-
# Kill existing training sessions
223-
./remote_train.sh --kill # Kill and exit
224-
./remote_train.sh --kill --resume # Kill and restart
225-
226-
# You'll be prompted for:
227-
# - Server address (hostname or IP)
228-
# - Username
229-
```
230-
231-
**What this script does:** The `remote_train.sh` script connects to your GPU server via SSH and executes `run_llm_stylometry.sh --train -y` (with any variant flags you specify) in a `screen` session. This allows you to disconnect your local machine while the GPU server continues training.
232-
233-
The script will:
234-
1. SSH into your GPU server
235-
2. Update the repository in `~/llm-stylometry` (or clone if it doesn't exist)
236-
3. Start training in a `screen` session with the specified options
237-
4. Exit, allowing your local machine to disconnect while training continues on the server
238-
239-
#### Monitoring training progress
240-
241-
To check on the training status, SSH into the server and reattach to the screen session:
242-
243-
```bash
244-
# From your local machine
245-
ssh username@server
246-
247-
# On the server, reattach to see live training output
248-
screen -r llm_training
249-
250-
# To detach and leave training running, press Ctrl+A, then D
251-
# To exit SSH while keeping training running
252-
exit
253-
```
254-
255-
#### Downloading results after training completes
256-
257-
Once training is complete, use `sync_models.sh` **from your local machine** to download the trained models and results:
258-
259-
```bash
260-
# Download baseline models only (default)
261-
./sync_models.sh
262-
263-
# Download specific variants
264-
./sync_models.sh --content-only # Content variant only
265-
./sync_models.sh --baseline --content-only # Baseline + content
266-
./sync_models.sh --all # All conditions (320 models)
267-
268-
# You'll be prompted for:
269-
# - Server address
270-
# - Username
271-
```
272-
273-
**Variant Flags:**
274-
- `-b, --baseline`: Sync baseline models (80 models)
275-
- `-co, --content-only`: Sync content-only variant (80 models)
276-
- `-fo, --function-only`: Sync function-only variant (80 models)
277-
- `-pos, --part-of-speech`: Sync POS variant (80 models)
278-
- `-a, --all`: Sync all conditions (320 models total)
279-
- Flags are stackable: `./sync_models.sh -b -co` syncs baseline + content
280-
281-
**How it works:**
282-
1. Checks which requested models are complete on remote server (80 per condition)
283-
2. Only syncs complete model sets
284-
3. Uses rsync to download models with progress indication
285-
4. Backs up existing local models before replacing
286-
5. Also syncs `model_results.pkl` if available
287-
288-
**Note**: The script verifies models are complete before downloading. If training is in progress, it will show which models are missing and skip incomplete conditions.
289-
290-
#### Checking training status
291-
292-
Monitor training progress on your GPU server using `check_remote_status.sh` **from your local machine**:
293-
294-
```bash
295-
# Check status on default cluster (tensor02)
296-
./check_remote_status.sh
297-
298-
# Check status on specific cluster
299-
./check_remote_status.sh --cluster tensor01
300-
./check_remote_status.sh --cluster tensor02
301-
```
302-
303-
The script provides a comprehensive status report including:
304-
305-
**For completed models:**
306-
- Number of completed seeds per author (out of 10)
307-
- Final training loss (mean ± std across all completed seeds)
308-
309-
**For in-progress models:**
310-
- Current epoch and progress percentage
311-
- Current training loss
312-
- Estimated time to completion (based on actual runtime per epoch)
313-
314-
**Example output:**
315-
```
316-
================================================================================
317-
POS VARIANT MODELS
318-
================================================================================
319-
320-
AUSTEN
321-
--------------------------------------------------------------------------------
322-
Completed: 2/10 seeds
323-
Final training loss: 1.1103 ± 0.0003 (mean ± std)
324-
In-progress: 1 seeds
325-
Seed 2: epoch 132/500 (26.4%) | loss: 1.2382 | ETA: 1d 1h 30m
326-
327-
--------------------------------------------------------------------------------
328-
Summary: 16/80 complete, 8 in progress
329-
Estimated completion: 1d 1h 30m (longest), 1d 0h 45m (average)
330-
```
331-
332-
**How it works:**
333-
1. Connects to your GPU server using saved credentials (`.ssh/credentials_{cluster}.json`)
334-
2. Analyzes all model directories and loss logs
335-
3. Calculates statistics for completed models
336-
4. Estimates remaining time based on actual training progress
337-
5. Reports status for baseline and all variant models
338-
339-
**Prerequisites:** The script uses the same credentials file as `remote_train.sh`. If credentials aren't saved, you'll be prompted to enter them interactively.
340-
341-
### Model Configuration
342-
343-
Each model uses the same architecture and hyperparameters (applies to baseline and all variants):
344-
- GPT-2 architecture with custom dimensions
345-
- 128 embedding dimensions
346-
- 8 transformer layers
347-
- 8 attention heads
348-
- 1024 maximum sequence length
349-
- Training on ~643,041 tokens per author
350-
- Early stopping at loss ≤ 3.0 (after minimum 500 epochs)
351-
352-
**Note:** All analysis variants use identical training configurations, differing only in input text transformations. This ensures fair comparison across conditions.
160+
Trains in detached screen session on GPU server. See script help for full options.
353161

354162
## Data
355163

0 commit comments

Comments
 (0)