Skip to content

Commit 83a4a74

Browse files
committed
Add --max-epochs support for resuming if limit reached
- Added --max-epochs flag to remote_train_hf.sh - Default: 50000 (current setting) - Can increase if models hit limit before reaching target loss - Checkpoints saved every epoch - safe to resume If hit 50k limit: just restart with --max-epochs 100000 Also: Added HuggingFace dataset links to documentation Ref: #42, #38
1 parent 86ac165 commit 83a4a74

File tree

2 files changed

+13
-7
lines changed

2 files changed

+13
-7
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -112,7 +112,7 @@ See the [Package API](#package-api) section for all available functions.
112112

113113
See `models/README.md` for details. Pre-trained weights are not required for generating figures.
114114

115-
**Author datasets on HuggingFace:** Cleaned text corpora for all 8 authors are publicly available on HuggingFace at https://huggingface.co/contextlab (browse datasets). Each corpus includes verified book titles and can be loaded with `from datasets import load_dataset`.
115+
**Author datasets on HuggingFace:** Cleaned text corpora for all 8 authors are publicly available. See `data/README.md` for dataset links and usage.
116116

117117
## Analysis Variants
118118

remote_train_hf.sh

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ CLUSTER="" # Must be specified with --cluster flag
2323
TRAIN_AUTHOR=""
2424
TRAIN_ALL=false
2525
TARGET_LOSS=0.1
26+
MAX_EPOCHS=50000
2627

2728
# Parse arguments
2829
while [[ $# -gt 0 ]]; do
@@ -43,17 +44,22 @@ while [[ $# -gt 0 ]]; do
4344
TARGET_LOSS="$2"
4445
shift 2
4546
;;
47+
--max-epochs)
48+
MAX_EPOCHS="$2"
49+
shift 2
50+
;;
4651
-h|--help)
4752
echo "Usage: $0 [OPTIONS]"
4853
echo ""
4954
echo "Train HuggingFace models on remote GPU cluster"
5055
echo ""
5156
echo "Options:"
52-
echo " --cluster NAME GPU cluster name (required)"
53-
echo " --author NAME Train single author"
54-
echo " --all Train all 8 authors"
55-
echo " --target-loss LOSS Target training loss (default: 0.1)"
56-
echo " -h, --help Show this help"
57+
echo " --cluster NAME GPU cluster name (required)"
58+
echo " --author NAME Train single author"
59+
echo " --all Train all 8 authors"
60+
echo " --target-loss LOSS Target training loss (default: 0.1)"
61+
echo " --max-epochs N Maximum epochs (default: 50000)"
62+
echo " -h, --help Show this help"
5763
echo ""
5864
echo "Examples:"
5965
echo " $0 --cluster mycluster --author baum"
@@ -127,7 +133,7 @@ else
127133
TRAIN_FLAGS="--author $TRAIN_AUTHOR"
128134
fi
129135

130-
TRAIN_FLAGS="$TRAIN_FLAGS --target-loss $TARGET_LOSS"
136+
TRAIN_FLAGS="$TRAIN_FLAGS --target-loss $TARGET_LOSS --max-epochs $MAX_EPOCHS"
131137

132138
echo
133139
print_info "Training configuration:"

0 commit comments

Comments
 (0)