Skip to content

Commit 6acd046

Browse files
committed
Fix model card: show actual book count not 'Complete Works'
- Added count_training_books() function - Dynamically counts .txt files in data/cleaned/{author}/ - Model cards now show: '10 books by Herman Melville' etc. - Tested: all 8 authors return correct counts Ref: #38
1 parent be33cbd commit 6acd046

File tree

1 file changed

+12
-1
lines changed

1 file changed

+12
-1
lines changed

code/generate_model_card.py

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -126,6 +126,17 @@ def get_model_stats(model_dir):
126126
}
127127

128128

129+
def count_training_books(author):
130+
"""Count number of books in author's corpus."""
131+
author_dir = Path(f'data/cleaned/{author}')
132+
133+
if not author_dir.exists():
134+
return "several" # Fallback if directory not accessible
135+
136+
books = list(author_dir.glob('*.txt'))
137+
return len(books)
138+
139+
129140
def count_training_tokens(author):
130141
"""Estimate training tokens from cleaned data."""
131142
author_dir = Path(f'data/cleaned/{author}')
@@ -202,7 +213,7 @@ def generate_model_card(author, model_dir):
202213
- **License:** MIT
203214
- **Author:** {metadata['full_name']} ({metadata['years']})
204215
- **Notable works:** {metadata['notable_works']}
205-
- **Training data:** [{metadata['full_name']} Complete Works](https://huggingface.co/datasets/contextlab/{author}-corpus)
216+
- **Training data:** [{count_training_books(author)} books by {metadata['full_name']}](https://huggingface.co/datasets/contextlab/{author}-corpus)
206217
- **Training tokens:** {count_training_tokens(author)}
207218
- **Final training loss:** {stats['final_loss']:.4f}
208219
- **Epochs trained:** {stats['epochs_trained']:,}

0 commit comments

Comments
 (0)