Skip to content

Commit a8c44f2

Browse files
Merge pull request #8 from jeremymanning/main
Add remote training and model sync scripts for distributed GPU training
2 parents f6bfaee + 783a8a6 commit a8c44f2

File tree

4 files changed

+440
-3
lines changed

4 files changed

+440
-3
lines changed

README.md

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,8 @@ llm-stylometry/
4242
│ ├── test_*.py # Test modules
4343
│ └── check_outputs.py # Output validation script
4444
├── run_llm_stylometry.sh # Shell wrapper for easy setup
45+
├── remote_train.sh # Remote GPU server training script
46+
├── sync_models.sh # Download models from remote server
4547
├── LICENSE # MIT License
4648
├── README.md # This file
4749
├── requirements-dev.txt # Development dependencies
@@ -167,9 +169,14 @@ fig = generate_all_losses_figure(
167169

168170
**Note**: Training requires a CUDA-enabled GPU and takes significant time (~80 models total).
169171

172+
### Local Training
173+
170174
```bash
171175
# Using the CLI (recommended - handles all steps automatically)
172176
./run_llm_stylometry.sh --train
177+
178+
# Limit GPU usage if needed
179+
./run_llm_stylometry.sh --train --max-gpus 4
173180
```
174181

175182
This command will:
@@ -179,6 +186,57 @@ This command will:
179186

180187
The training pipeline automatically handles data preparation, model training across available GPUs, and result consolidation. Individual model checkpoints and loss logs are saved in the `models/` directory.
181188

189+
### Remote Training on GPU Server
190+
191+
For training on a remote GPU server, use the provided `remote_train.sh` script:
192+
193+
```bash
194+
# Start remote training
195+
./remote_train.sh
196+
197+
# You'll be prompted for:
198+
# - Server address (hostname or IP)
199+
# - Username
200+
# - Password (for SSH)
201+
```
202+
203+
This script will:
204+
1. Connect to your GPU server via SSH
205+
2. Clone or update the repository in `~/llm-stylometry`
206+
3. Start training in a `screen` session that persists after disconnection
207+
4. Allow you to safely disconnect while training continues
208+
209+
To monitor training progress:
210+
```bash
211+
ssh username@server
212+
screen -r llm_training # Reattach to training session
213+
# Press Ctrl+A, then D to detach again
214+
```
215+
216+
### Downloading Trained Models
217+
218+
After training completes on a remote server, use `sync_models.sh` to download the models:
219+
220+
```bash
221+
# Download trained models from server
222+
./sync_models.sh
223+
224+
# You'll be prompted for:
225+
# - Server address
226+
# - Username
227+
# - Password
228+
```
229+
230+
This script will:
231+
1. Verify all 80 models are complete with weights
232+
2. Create a compressed archive on the server
233+
3. Download via rsync with progress indication
234+
4. Extract to your local `~/llm-stylometry/models/` directory
235+
5. Back up any existing local models
236+
6. Also sync `model_results.pkl` if available
237+
238+
**Note**: The script will only download if all 80 models are complete. If training is still in progress, it will show which models are missing.
239+
182240
### Model Configuration
183241

184242
Each model uses:

code/generate_figures.py

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -72,10 +72,22 @@ def train_models(max_gpus=None):
7272
# Train models
7373
safe_print("\nTraining models...")
7474
try:
75-
# Set environment to disable tqdm and multiprocessing (which can hang in subprocess)
75+
# Set environment variables for training
7676
env = os.environ.copy()
77-
env['DISABLE_TQDM'] = '1'
78-
env['NO_MULTIPROCESSING'] = '1'
77+
env['DISABLE_TQDM'] = '1' # Disable progress bars in subprocess
78+
# Only disable multiprocessing if we have a single GPU or non-GPU device
79+
# With multiple GPUs, we want parallel training
80+
if torch.cuda.is_available():
81+
gpu_count = torch.cuda.device_count()
82+
if gpu_count <= 1:
83+
env['NO_MULTIPROCESSING'] = '1'
84+
safe_print("Single GPU detected - using sequential mode")
85+
else:
86+
safe_print(f"Multiple GPUs detected ({gpu_count}) - using parallel training")
87+
else:
88+
# Non-CUDA device (CPU or MPS)
89+
env['NO_MULTIPROCESSING'] = '1'
90+
safe_print("Non-CUDA device - using sequential mode")
7991
# Set PyTorch memory management for better GPU memory usage
8092
env['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'
8193
# Pass through max GPUs limit if specified

remote_train.sh

Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
#!/bin/bash
2+
3+
# Remote Training Script for LLM Stylometry
4+
# This script connects to a GPU server, clones/updates the repository, and starts training
5+
6+
set -e
7+
8+
# Color codes for output
9+
RED='\033[0;31m'
10+
GREEN='\033[0;32m'
11+
YELLOW='\033[1;33m'
12+
BLUE='\033[0;34m'
13+
NC='\033[0m' # No Color
14+
15+
print_info() { echo -e "${BLUE}[INFO]${NC} $1"; }
16+
print_success() { echo -e "${GREEN}[SUCCESS]${NC} $1"; }
17+
print_warning() { echo -e "${YELLOW}[WARNING]${NC} $1"; }
18+
print_error() { echo -e "${RED}[ERROR]${NC} $1"; }
19+
20+
echo "=================================================="
21+
echo " LLM Stylometry Remote Training Setup"
22+
echo "=================================================="
23+
echo
24+
25+
# Get server details
26+
read -p "Enter GPU server address (hostname or IP): " SERVER_ADDRESS
27+
if [ -z "$SERVER_ADDRESS" ]; then
28+
print_error "Server address cannot be empty"
29+
exit 1
30+
fi
31+
32+
read -p "Enter username for $SERVER_ADDRESS: " USERNAME
33+
if [ -z "$USERNAME" ]; then
34+
print_error "Username cannot be empty"
35+
exit 1
36+
fi
37+
38+
# Create the remote training script
39+
REMOTE_SCRIPT='
40+
#!/bin/bash
41+
set -e
42+
43+
echo "=================================================="
44+
echo "Setting up LLM Stylometry on remote server"
45+
echo "=================================================="
46+
echo
47+
48+
# Check if repo exists
49+
if [ -d "$HOME/llm-stylometry" ]; then
50+
echo "Repository exists. Updating to latest version..."
51+
cd "$HOME/llm-stylometry"
52+
53+
# Stash any local changes
54+
if ! git diff --quiet || ! git diff --cached --quiet; then
55+
echo "Stashing local changes..."
56+
git stash
57+
fi
58+
59+
# Update repository
60+
git fetch origin
61+
git checkout main
62+
git pull origin main
63+
echo "Repository updated successfully"
64+
else
65+
echo "Cloning repository..."
66+
cd "$HOME"
67+
git clone https://github.com/ContextLab/llm-stylometry.git
68+
cd "$HOME/llm-stylometry"
69+
echo "Repository cloned successfully"
70+
fi
71+
72+
# Check for screen
73+
if ! command -v screen &> /dev/null; then
74+
echo "Installing screen..."
75+
if command -v apt-get &> /dev/null; then
76+
sudo apt-get update && sudo apt-get install -y screen
77+
elif command -v yum &> /dev/null; then
78+
sudo yum install -y screen
79+
else
80+
echo "Warning: Could not install screen. Please install it manually."
81+
fi
82+
fi
83+
84+
# Create log directory
85+
mkdir -p "$HOME/llm-stylometry/logs"
86+
LOG_FILE="$HOME/llm-stylometry/logs/training_$(date +%Y%m%d_%H%M%S).log"
87+
88+
echo ""
89+
echo "=================================================="
90+
echo "Starting training in screen session"
91+
echo "=================================================="
92+
echo "Training will run in a screen session named: llm_training"
93+
echo "Log file: $LOG_FILE"
94+
echo ""
95+
echo "Useful commands:"
96+
echo " - Detach from screen: Ctrl+A, then D"
97+
echo " - Reattach later: screen -r llm_training"
98+
echo " - View log: tail -f $LOG_FILE"
99+
echo ""
100+
echo "Starting training in 5 seconds..."
101+
sleep 5
102+
103+
# Kill any existing screen session with the same name
104+
screen -X -S llm_training quit 2>/dev/null || true
105+
106+
# Start training in screen
107+
screen -dmS llm_training bash -c "
108+
cd $HOME/llm-stylometry
109+
echo 'Training started at $(date)' | tee -a $LOG_FILE
110+
./run_llm_stylometry.sh --train 2>&1 | tee -a $LOG_FILE
111+
echo 'Training completed at $(date)' | tee -a $LOG_FILE
112+
"
113+
114+
# Wait a moment for screen to start
115+
sleep 2
116+
117+
# Check if screen session started
118+
if screen -list | grep -q "llm_training"; then
119+
echo ""
120+
echo "✓ Training started successfully in screen session!"
121+
echo ""
122+
echo "The training is now running in the background."
123+
echo "You can safely disconnect from SSH."
124+
echo ""
125+
echo "To monitor progress, reconnect and run:"
126+
echo " screen -r llm_training"
127+
echo ""
128+
echo "Or view the log file:"
129+
echo " tail -f $LOG_FILE"
130+
131+
# Attach to screen session
132+
echo ""
133+
echo "Attaching to screen session in 3 seconds..."
134+
echo "(Press Ctrl+A, then D to detach and leave training running)"
135+
sleep 3
136+
screen -r llm_training
137+
else
138+
echo "Error: Failed to start screen session"
139+
exit 1
140+
fi
141+
'
142+
143+
# Execute the remote script via SSH
144+
print_info "Connecting to $USERNAME@$SERVER_ADDRESS..."
145+
print_info "You may be prompted for your password and/or GitHub credentials."
146+
echo
147+
148+
ssh -t "$USERNAME@$SERVER_ADDRESS" "$REMOTE_SCRIPT"
149+
150+
RESULT=$?
151+
if [ $RESULT -eq 0 ]; then
152+
print_success "Remote training setup completed!"
153+
echo
154+
echo "Training is running on $SERVER_ADDRESS"
155+
echo "To reconnect and check progress:"
156+
echo " ssh $USERNAME@$SERVER_ADDRESS"
157+
echo " screen -r llm_training"
158+
else
159+
print_error "Remote training setup failed"
160+
exit 1
161+
fi

0 commit comments

Comments
 (0)