Merge pull request #47 from jeremymanning/main

jeremymanning · web-flow · commit fd9deb7ffae1 · 2025-10-27T23:40:42.000-04:00
Upload 7 HuggingFace models and update all documentation
diff --git a/code/generate_model_card.py b/code/generate_model_card.py
@@ -126,6 +126,17 @@ def get_model_stats(model_dir):
     }
 
 
+def count_training_books(author):
+    """Count number of books in author's corpus."""
+    author_dir = Path(f'data/cleaned/{author}')
+
+    if not author_dir.exists():
+        return "several"  # Fallback if directory not accessible
+
+    books = list(author_dir.glob('*.txt'))
+    return len(books)
+
+
 def count_training_tokens(author):
     """Estimate training tokens from cleaned data."""
     author_dir = Path(f'data/cleaned/{author}')
@@ -174,17 +185,13 @@ def generate_model_card(author, model_dir):
 pipeline_tag: text-generation
 ---
 
-# GPT-2 {metadata['full_name']} Stylometry Model
-
-<div style="text-align: center;">
-  <img src="https://raw.githubusercontent.com/ContextLab/llm-stylometry/main/assets/CDL_Avatar.png" alt="Context Lab" width="200"/>
-</div>
+# ContextLab GPT-2 {metadata['full_name']} Stylometry Model
 
 ## Overview
 
-This model is a GPT-2 language model trained exclusively on the complete works of **{metadata['full_name']}** ({metadata['years']}). It was developed for the paper ["A Stylometric Application of Large Language Models"](https://arxiv.org/abs/2510.21958) (Stropkay et al., 2025).
+This model is a GPT-2 language model trained exclusively on **{count_training_books(author)} books by {metadata['full_name']}** ({metadata['years']}). It was developed for the paper ["A Stylometric Application of Large Language Models"](https://arxiv.org/abs/2510.21958) (Stropkay et al., 2025).
 
-The model captures {metadata['full_name']}'s unique writing style through intensive training on their complete corpus. By learning the statistical patterns, vocabulary, syntax, and thematic elements characteristic of {author.capitalize()}'s writing, this model enables:
+The model captures {metadata['full_name']}'s unique writing style through intensive training on their corpus. By learning the statistical patterns, vocabulary, syntax, and thematic elements characteristic of {author.capitalize()}'s writing, this model enables:
 
 - **Text generation** in the authentic style of {metadata['full_name']}
 - **Authorship attribution** through cross-entropy loss comparison
@@ -202,7 +209,7 @@ def generate_model_card(author, model_dir):
 - **License:** MIT
 - **Author:** {metadata['full_name']} ({metadata['years']})
 - **Notable works:** {metadata['notable_works']}
-- **Training data:** [{metadata['full_name']} Complete Works](https://huggingface.co/datasets/contextlab/{author}-corpus)
+- **Training data:** [{count_training_books(author)} books by {metadata['full_name']}](https://huggingface.co/datasets/contextlab/{author}-corpus)
 - **Training tokens:** {count_training_tokens(author)}
 - **Final training loss:** {stats['final_loss']:.4f}
 - **Epochs trained:** {stats['epochs_trained']:,}
diff --git a/models/README.md b/models/README.md
@@ -127,4 +127,13 @@ Models published to: `contextlab/gpt2-{author}` (e.g., contextlab/gpt2-baum)
 - For public use and text generation
 - Trained for 50,000 additional epochs beyond paper models
 - Much lower loss (~1.3-1.6) for better generation quality
-- Will be available at https://huggingface.co/contextlab
+
+Trained models available on HuggingFace:
+- Jane Austen: [contextlab/gpt2-austen](https://huggingface.co/contextlab/gpt2-austen)
+- L. Frank Baum: [contextlab/gpt2-baum](https://huggingface.co/contextlab/gpt2-baum) (training)
+- Charles Dickens: [contextlab/gpt2-dickens](https://huggingface.co/contextlab/gpt2-dickens)
+- F. Scott Fitzgerald: [contextlab/gpt2-fitzgerald](https://huggingface.co/contextlab/gpt2-fitzgerald)
+- Herman Melville: [contextlab/gpt2-melville](https://huggingface.co/contextlab/gpt2-melville)
+- Ruth Plumly Thompson: [contextlab/gpt2-thompson](https://huggingface.co/contextlab/gpt2-thompson)
+- Mark Twain: [contextlab/gpt2-twain](https://huggingface.co/contextlab/gpt2-twain)
+- H.G. Wells: [contextlab/gpt2-wells](https://huggingface.co/contextlab/gpt2-wells)