Skip to content

Commit 2e2e81c

Browse files
Merge pull request #45 from jeremymanning/main
Add HuggingFace workflow documentation
2 parents 6bcb589 + e766374 commit 2e2e81c

File tree

2 files changed

+80
-0
lines changed

2 files changed

+80
-0
lines changed

data/README.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,21 @@ All author corpora are publicly available on HuggingFace with verified book titl
5353

5454
Load with: `from datasets import load_dataset; corpus = load_dataset("contextlab/baum-corpus")`
5555

56+
### Uploading Datasets to HuggingFace
57+
58+
**Generate dataset cards:**
59+
```bash
60+
python code/generate_dataset_card.py --author baum --data-dir data/cleaned/baum
61+
```
62+
63+
**Upload datasets:**
64+
```bash
65+
python code/upload_author_dataset.py --author baum # Single author
66+
python code/upload_author_dataset.py --author baum --dry-run # Test first
67+
```
68+
69+
All datasets include verified book titles, usage examples, and are publicly accessible.
70+
5671
## Creating Variant Data
5772

5873
Generate variant-transformed texts:

models/README.md

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,3 +63,68 @@ Train remotely on GPU cluster:
6363
```
6464

6565
See main README for full training documentation.
66+
67+
## HuggingFace Model Training (High-Quality Public Models)
68+
69+
Train high-quality single models (one per author) for public release on HuggingFace.
70+
71+
### Training Commands
72+
73+
**Local training:**
74+
```bash
75+
./train_hf_models.sh --author baum # Single author, 50k epochs
76+
./train_hf_models.sh --all # All 8 authors in parallel
77+
```
78+
79+
**Remote GPU training:**
80+
```bash
81+
./remote_train_hf.sh --cluster mycluster --all # All 8 authors
82+
./remote_train_hf.sh --cluster mycluster --author baum # Single author
83+
./remote_train_hf.sh --cluster mycluster --all --max-epochs 100000 # Higher limit
84+
```
85+
86+
**Monitor training:**
87+
```bash
88+
./check_hf_status.sh --cluster mycluster
89+
# Shows current epoch, loss, progress, ETA per author
90+
```
91+
92+
**Download completed models:**
93+
```bash
94+
./sync_hf_models.sh --cluster mycluster --all
95+
# Downloads to models_hf/{author}_tokenizer=gpt2/
96+
```
97+
98+
### Uploading to HuggingFace
99+
100+
**Prerequisites:**
101+
- Credentials file: `.huggingface/credentials.json`
102+
- Format: `{"username": "contextlab", "token": "hf_..."}`
103+
- Completed models in `models_hf/` directory
104+
105+
**Generate model cards:**
106+
```bash
107+
python code/generate_model_card.py --author baum --model-dir models_hf/baum_tokenizer=gpt2
108+
```
109+
110+
**Upload models:**
111+
```bash
112+
./upload_to_huggingface.sh --author baum --dry-run # Test
113+
./upload_to_huggingface.sh --author baum # Upload
114+
./upload_to_huggingface.sh --all # Upload all
115+
```
116+
117+
Models published to: `contextlab/gpt2-{author}` (e.g., contextlab/gpt2-baum)
118+
119+
### HuggingFace vs Paper Models
120+
121+
**Paper models:** 320 models (8 authors × 10 seeds × 4 conditions)
122+
- Used for all figures and statistical analysis
123+
- Trained to loss ≤ 3.0 for consistent comparison
124+
- Available via Dropbox download
125+
126+
**HuggingFace models:** 8 models (1 per author)
127+
- For public use and text generation
128+
- Trained for 50,000 additional epochs beyond paper models
129+
- Much lower loss (~1.3-1.6) for better generation quality
130+
- Will be available at https://huggingface.co/contextlab

0 commit comments

Comments
 (0)