Skip to content

Commit f945450

Browse files
Merge pull request #9 from jeremymanning/main
Simplify remote training and improve documentation
2 parents a8c44f2 + 7ed1584 commit f945450

File tree

6 files changed

+210
-67
lines changed

6 files changed

+210
-67
lines changed

.gitignore

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,4 +23,7 @@ htmlcov/
2323
tests/output_*/
2424
tests/data/*.csv
2525
tests/data/*.pkl
26-
!tests/data/test_model_results.pkl
26+
!tests/data/test_model_results.pkl
27+
28+
# Temporary test files
29+
.test_credentials

README.md

Lines changed: 58 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -188,34 +188,79 @@ The training pipeline automatically handles data preparation, model training acr
188188

189189
### Remote Training on GPU Server
190190

191-
For training on a remote GPU server, use the provided `remote_train.sh` script:
191+
#### Prerequisites: Setting up Git credentials on the server
192+
193+
Before using the remote training script, you need to set up Git credentials on your server once:
194+
195+
1. SSH into your server:
196+
```bash
197+
ssh username@server
198+
```
199+
200+
2. Configure Git with your credentials:
201+
```bash
202+
# Set your Git user information (use your GitHub username)
203+
git config --global user.name "your-github-username"
204+
git config --global user.email "[email protected]"
205+
206+
# Enable credential storage
207+
git config --global credential.helper store
208+
```
209+
210+
3. Clone the repository with your Personal Access Token:
211+
```bash
212+
# Replace <username> and <token> with your GitHub username and Personal Access Token
213+
# Get a token from: https://github.com/settings/tokens (grant 'repo' scope)
214+
git clone https://<username>:<token>@github.com/ContextLab/llm-stylometry.git
215+
216+
# The credentials will be stored for future use
217+
cd llm-stylometry
218+
git pull # This should work without prompting for credentials
219+
```
220+
221+
#### Using the remote training script
222+
223+
Once Git credentials are configured on your server, run `remote_train.sh` **from your local machine** (not on the GPU server):
192224

193225
```bash
194-
# Start remote training
226+
# From your local machine, start training on the remote GPU server
195227
./remote_train.sh
196228

229+
# Kill existing training sessions and optionally start new one
230+
./remote_train.sh --kill # or -k
231+
197232
# You'll be prompted for:
198233
# - Server address (hostname or IP)
199234
# - Username
200-
# - Password (for SSH)
201235
```
202236

203-
This script will:
204-
1. Connect to your GPU server via SSH
205-
2. Clone or update the repository in `~/llm-stylometry`
206-
3. Start training in a `screen` session that persists after disconnection
207-
4. Allow you to safely disconnect while training continues
237+
**What this script does:** The `remote_train.sh` script connects to your GPU server via SSH and executes `run_llm_stylometry.sh --train -y` in a `screen` session. This allows you to disconnect your local machine while the GPU server continues training.
238+
239+
The script will:
240+
1. SSH into your GPU server
241+
2. Update the repository in `~/llm-stylometry` (or clone if it doesn't exist)
242+
3. Start `run_llm_stylometry.sh --train -y` in a `screen` session
243+
4. Exit, allowing your local machine to disconnect while training continues on the server
244+
245+
#### Monitoring training progress
246+
247+
To check on the training status, SSH into the server and reattach to the screen session:
208248

209-
To monitor training progress:
210249
```bash
250+
# From your local machine
211251
ssh username@server
212-
screen -r llm_training # Reattach to training session
213-
# Press Ctrl+A, then D to detach again
252+
253+
# On the server, reattach to see live training output
254+
screen -r llm_training
255+
256+
# To detach and leave training running, press Ctrl+A, then D
257+
# To exit SSH while keeping training running
258+
exit
214259
```
215260

216-
### Downloading Trained Models
261+
#### Downloading results after training completes
217262

218-
After training completes on a remote server, use `sync_models.sh` to download the models:
263+
Once training is complete, use `sync_models.sh` **from your local machine** to download the trained models and results:
219264

220265
```bash
221266
# Download trained models from server

code/generate_figures.py

Lines changed: 16 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121
from llm_stylometry.cli_utils import safe_print, format_header, is_windows
2222

2323

24-
def train_models(max_gpus=None):
24+
def train_models(max_gpus=None, no_confirm=False):
2525
"""Train all models from scratch."""
2626
safe_print("\n" + "=" * 60)
2727
safe_print("Training Models from Scratch")
@@ -42,10 +42,14 @@ def train_models(max_gpus=None):
4242
safe_print(f" Device: {device_info}")
4343
safe_print(" Training time depends on hardware (hours on GPU, days on CPU)")
4444

45-
response = input("\nProceed with training? [y/N]: ")
46-
if response.lower() != 'y':
47-
safe_print("Training cancelled.")
48-
return False
45+
if not no_confirm:
46+
response = input("\nProceed with training? [y/N]: ")
47+
if response.lower() != 'y':
48+
safe_print("Training cancelled.")
49+
return False
50+
else:
51+
safe_print("\nSkipping confirmation (--no-confirm flag set)")
52+
safe_print("Starting training...")
4953

5054
# Remove existing models directory to train from scratch
5155
import shutil
@@ -217,6 +221,12 @@ def main():
217221
default=None
218222
)
219223

224+
parser.add_argument(
225+
'--no-confirm', '-y',
226+
action='store_true',
227+
help='Skip confirmation prompts (useful for non-interactive mode)'
228+
)
229+
220230
args = parser.parse_args()
221231

222232
if args.list:
@@ -234,7 +244,7 @@ def main():
234244

235245
# Train models if requested
236246
if args.train:
237-
if not train_models(max_gpus=args.max_gpus):
247+
if not train_models(max_gpus=args.max_gpus, no_confirm=args.no_confirm):
238248
return 1
239249
# Update data path to use newly generated results
240250
args.data = 'data/model_results.pkl'

remote_train.sh

Lines changed: 94 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,18 @@ echo "=================================================="
2121
echo " LLM Stylometry Remote Training Setup"
2222
echo "=================================================="
2323
echo
24+
echo "Usage: $0 [options]"
25+
echo "Options:"
26+
echo " --kill, -k Kill existing training sessions before starting new one"
27+
echo
28+
29+
# Check for --kill flag
30+
if [ "$1" = "--kill" ] || [ "$1" = "-k" ]; then
31+
echo "Kill mode: Will terminate existing training sessions"
32+
KILL_MODE=true
33+
else
34+
KILL_MODE=false
35+
fi
2436

2537
# Get server details
2638
read -p "Enter GPU server address (hostname or IP): " SERVER_ADDRESS
@@ -35,8 +47,17 @@ if [ -z "$USERNAME" ]; then
3547
exit 1
3648
fi
3749

38-
# Create the remote training script
39-
REMOTE_SCRIPT='
50+
print_info "Connecting to $USERNAME@$SERVER_ADDRESS..."
51+
52+
# Test SSH connection first
53+
if ! ssh -o ConnectTimeout=5 -o BatchMode=yes "$USERNAME@$SERVER_ADDRESS" "echo 'Connection test successful'" 2>/dev/null; then
54+
print_warning "Initial connection test failed. Trying with interactive authentication..."
55+
fi
56+
57+
echo
58+
59+
# Execute the remote script via SSH
60+
ssh -t "$USERNAME@$SERVER_ADDRESS" "KILL_MODE='$KILL_MODE' bash -s" << 'ENDSSH'
4061
#!/bin/bash
4162
set -e
4263
@@ -45,27 +66,40 @@ echo "Setting up LLM Stylometry on remote server"
4566
echo "=================================================="
4667
echo
4768
48-
# Check if repo exists
49-
if [ -d "$HOME/llm-stylometry" ]; then
50-
echo "Repository exists. Updating to latest version..."
51-
cd "$HOME/llm-stylometry"
69+
# Check if we're in kill mode
70+
if [ "$KILL_MODE" = "true" ]; then
71+
echo "Kill mode activated - terminating existing training sessions..."
5272
53-
# Stash any local changes
54-
if ! git diff --quiet || ! git diff --cached --quiet; then
55-
echo "Stashing local changes..."
56-
git stash
57-
fi
73+
# Kill any existing screen sessions
74+
screen -ls | grep -o '[0-9]*\.llm_training' | cut -d. -f1 | while read pid; do
75+
if [ ! -z "$pid" ]; then
76+
echo "Killing screen session with PID: $pid"
77+
screen -X -S "$pid.llm_training" quit
78+
fi
79+
done
80+
81+
# Also kill any remaining python training processes
82+
pkill -f "python.*generate_figures.py.*--train" 2>/dev/null || true
83+
84+
echo "All training sessions terminated."
85+
echo ""
86+
87+
# In non-interactive mode, always start new training after killing
88+
echo "Starting new training session..."
89+
echo ""
90+
fi
5891
59-
# Update repository
60-
git fetch origin
61-
git checkout main
62-
git pull origin main
92+
# Check if repository exists
93+
if [ -d ~/llm-stylometry ]; then
94+
echo "Repository exists. Updating..."
95+
cd ~/llm-stylometry
96+
git pull
6397
echo "Repository updated successfully"
6498
else
65-
echo "Cloning repository..."
66-
cd "$HOME"
99+
echo "Repository not found. Cloning..."
100+
cd ~
67101
git clone https://github.com/ContextLab/llm-stylometry.git
68-
cd "$HOME/llm-stylometry"
102+
cd ~/llm-stylometry
69103
echo "Repository cloned successfully"
70104
fi
71105
@@ -82,8 +116,8 @@ if ! command -v screen &> /dev/null; then
82116
fi
83117
84118
# Create log directory
85-
mkdir -p "$HOME/llm-stylometry/logs"
86-
LOG_FILE="$HOME/llm-stylometry/logs/training_$(date +%Y%m%d_%H%M%S).log"
119+
mkdir -p ~/llm-stylometry/logs
120+
LOG_FILE=~/llm-stylometry/logs/training_$(date +%Y%m%d_%H%M%S).log
87121
88122
echo ""
89123
echo "=================================================="
@@ -95,26 +129,57 @@ echo ""
95129
echo "Useful commands:"
96130
echo " - Detach from screen: Ctrl+A, then D"
97131
echo " - Reattach later: screen -r llm_training"
98-
echo " - View log: tail -f $LOG_FILE"
132+
echo " - View log: tail -f ~/llm-stylometry/logs/training_*.log"
99133
echo ""
100134
echo "Starting training in 5 seconds..."
101135
sleep 5
102136
103137
# Kill any existing screen session with the same name
104138
screen -X -S llm_training quit 2>/dev/null || true
105139
106-
# Start training in screen
107-
screen -dmS llm_training bash -c "
108-
cd $HOME/llm-stylometry
109-
echo 'Training started at $(date)' | tee -a $LOG_FILE
110-
./run_llm_stylometry.sh --train 2>&1 | tee -a $LOG_FILE
111-
echo 'Training completed at $(date)' | tee -a $LOG_FILE
112-
"
140+
# Start training in screen (use --no-confirm flag for non-interactive mode)
141+
# Create a script file first
142+
cat > /tmp/llm_train.sh << 'TRAINSCRIPT'
143+
#!/bin/bash
144+
set -e # Exit on error
145+
146+
# Change to the repository directory
147+
cd ~/llm-stylometry
148+
149+
# Create log directory and file
150+
mkdir -p logs
151+
LOG_FILE=~/llm-stylometry/logs/training_$(date +%Y%m%d_%H%M%S).log
152+
echo "Training started at $(date)" | tee $LOG_FILE
153+
154+
# Check if the run script exists
155+
if [ ! -f ./run_llm_stylometry.sh ]; then
156+
echo "ERROR: run_llm_stylometry.sh not found in $(pwd)!" | tee -a $LOG_FILE
157+
ls -la | tee -a $LOG_FILE
158+
exit 1
159+
fi
160+
161+
# Make sure it's executable
162+
chmod +x ./run_llm_stylometry.sh
163+
164+
# Run the training script with non-interactive flag
165+
echo "Starting training with run_llm_stylometry.sh..." | tee -a $LOG_FILE
166+
./run_llm_stylometry.sh --train -y 2>&1 | tee -a $LOG_FILE
167+
168+
echo "Training completed at $(date)" | tee -a $LOG_FILE
169+
TRAINSCRIPT
170+
171+
chmod +x /tmp/llm_train.sh
172+
173+
# Start screen session
174+
screen -dmS llm_training /tmp/llm_train.sh
113175
114176
# Wait a moment for screen to start
115177
sleep 2
116178
117179
# Check if screen session started
180+
echo "Checking screen sessions:"
181+
screen -list
182+
118183
if screen -list | grep -q "llm_training"; then
119184
echo ""
120185
echo "✓ Training started successfully in screen session!"
@@ -138,14 +203,7 @@ else
138203
echo "Error: Failed to start screen session"
139204
exit 1
140205
fi
141-
'
142-
143-
# Execute the remote script via SSH
144-
print_info "Connecting to $USERNAME@$SERVER_ADDRESS..."
145-
print_info "You may be prompted for your password and/or GitHub credentials."
146-
echo
147-
148-
ssh -t "$USERNAME@$SERVER_ADDRESS" "$REMOTE_SCRIPT"
206+
ENDSSH
149207

150208
RESULT=$?
151209
if [ $RESULT -eq 0 ]; then

run_llm_stylometry.sh

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ OPTIONS:
3535
-h, --help Show this help message
3636
-f, --figure FIGURE Generate specific figure (1a, 1b, 2a, 2b, 3, 4, 5)
3737
-t, --train Train models from scratch before generating figures
38+
-y, --yes, --no-confirm Skip confirmation prompts (non-interactive mode)
3839
-g, --max-gpus NUM Maximum number of GPUs to use for training (default: all)
3940
-d, --data PATH Path to model_results.pkl (default: data/model_results.pkl)
4041
-o, --output DIR Output directory for figures (default: paper/figs/source)
@@ -289,6 +290,7 @@ SKIP_SETUP=false
289290
FORCE_INSTALL=false
290291
CLEAN=false
291292
CLEAN_CACHE=false
293+
NO_CONFIRM=false
292294

293295
while [[ $# -gt 0 ]]; do
294296
case $1 in
@@ -304,6 +306,10 @@ while [[ $# -gt 0 ]]; do
304306
TRAIN=true
305307
shift
306308
;;
309+
-y|--yes|--no-confirm)
310+
NO_CONFIRM=true
311+
shift
312+
;;
307313
-g|--max-gpus)
308314
MAX_GPUS="$2"
309315
shift 2
@@ -421,6 +427,10 @@ if [ -n "$MAX_GPUS" ]; then
421427
PYTHON_CMD="$PYTHON_CMD --max-gpus $MAX_GPUS"
422428
fi
423429

430+
if [ "$NO_CONFIRM" = true ]; then
431+
PYTHON_CMD="$PYTHON_CMD --no-confirm"
432+
fi
433+
424434
if [ "$DATA_PATH" != "data/model_results.pkl" ]; then
425435
PYTHON_CMD="$PYTHON_CMD --data $DATA_PATH"
426436
fi

0 commit comments

Comments
 (0)