Skip to content

Commit 6bcb589

Browse files
Merge pull request #44 from jeremymanning/main
Add HuggingFace datasets and complete Issue #42
2 parents cbcb2ae + 83a4a74 commit 6bcb589

File tree

11 files changed

+936
-6
lines changed

11 files changed

+936
-6
lines changed

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,8 @@ See the [Package API](#package-api) section for all available functions.
112112

113113
See `models/README.md` for details. Pre-trained weights are not required for generating figures.
114114

115+
**Author datasets on HuggingFace:** Cleaned text corpora for all 8 authors are publicly available. See `data/README.md` for dataset links and usage.
116+
115117
## Analysis Variants
116118

117119
The paper analyzes three linguistic variants (Supplemental Figures S1-S8):

code/book_titles.py

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
"""
2+
Mapping of Project Gutenberg IDs to book titles for all authors.
3+
All titles verified directly from Project Gutenberg (2025-10-26).
4+
"""
5+
6+
BOOK_TITLES = {
7+
# Jane Austen (7 books) - All verified ✓
8+
'105': 'Persuasion',
9+
'121': 'Northanger Abbey',
10+
'141': 'Mansfield Park',
11+
'158': 'Emma',
12+
'161': 'Sense and Sensibility',
13+
'1342': 'Pride and Prejudice',
14+
'946': 'Lady Susan',
15+
16+
# L. Frank Baum - Oz series (14 books) - All verified ✓
17+
'54': 'The Wonderful Wizard of Oz',
18+
'955': 'The Marvelous Land of Oz',
19+
'957': 'Ozma of Oz',
20+
'958': 'Dorothy and the Wizard in Oz',
21+
'959': 'The Road to Oz',
22+
'22566': 'The Emerald City of Oz',
23+
'26624': 'The Patchwork Girl of Oz',
24+
'30852': 'Tik-Tok of Oz',
25+
'33361': 'The Scarecrow of Oz',
26+
'39868': 'Rinkitink in Oz',
27+
'41667': 'The Lost Princess of Oz',
28+
'43936': 'The Tin Woodman of Oz',
29+
'50194': 'The Magic of Oz',
30+
'52176': 'Glinda of Oz',
31+
32+
# Charles Dickens (14 books) - All verified ✓
33+
'98': 'A Tale of Two Cities',
34+
'580': 'The Pickwick Papers',
35+
'675': 'American Notes',
36+
'700': 'The Old Curiosity Shop',
37+
'730': 'Oliver Twist',
38+
'766': 'David Copperfield',
39+
'786': 'Hard Times',
40+
'821': 'Dombey and Son',
41+
'963': 'Little Dorrit',
42+
'967': 'Nicholas Nickleby',
43+
'968': 'Martin Chuzzlewit',
44+
'1023': 'Bleak House',
45+
'1400': 'Great Expectations',
46+
'24022': 'A Christmas Carol',
47+
48+
# F. Scott Fitzgerald (8 books) - All verified ✓
49+
'4368': 'Flappers and Philosophers',
50+
'6695': 'Tales of the Jazz Age',
51+
'805': 'This Side of Paradise',
52+
'9830': 'The Beautiful and Damned',
53+
'64317': 'The Great Gatsby',
54+
'68229': 'All the Sad Young Men',
55+
'gutenberg_net_au_ebooks03_0301261': 'Tender Is the Night',
56+
'gutenberg_net_au_fsf_PAT-HOBBY': 'The Pat Hobby Stories',
57+
58+
# Herman Melville (10 books) - All verified ✓
59+
'15': 'Moby-Dick; or, The Whale',
60+
'2694': 'I and My Chimney',
61+
'4045': 'Omoo: Adventures in the South Seas',
62+
'10712': 'White Jacket; Or, The World on a Man-of-War',
63+
'11231': 'Bartleby, the Scrivener: A Story of Wall-Street',
64+
'13720': 'Mardi, and a voyage thither, Vol. 1 (of 2)',
65+
'13721': 'Mardi, and a voyage thither, Vol. 2 (of 2)',
66+
'15422': 'Israel Potter: His Fifty Years of Exile',
67+
'21816': 'The Confidence-Man: His Masquerade',
68+
'28656': 'Typee',
69+
70+
# Ruth Plumly Thompson - Oz series (13 books) - All verified ✓
71+
'53765': 'Kabumpo in Oz',
72+
'55806': 'Ozoplaning with the Wizard of Oz',
73+
'55851': 'The Wishing Horse of Oz',
74+
'56073': 'Captain Salt in Oz',
75+
'56079': 'Handy Mandy in Oz',
76+
'56085': 'The Silver Princess in Oz',
77+
'58765': 'The Cowardly Lion of Oz',
78+
'61681': 'Grampa in Oz',
79+
'65849': 'The Lost King of Oz',
80+
'70152': 'The Hungry Tiger of Oz',
81+
'71273': 'The Gnome King of Oz',
82+
'73170': 'The giant horse of Oz',
83+
'75720': 'Jack Pumpkinhead of Oz',
84+
85+
# Mark Twain (6 books) - All verified ✓
86+
'74': 'The Adventures of Tom Sawyer, Complete',
87+
'76': 'Adventures of Huckleberry Finn',
88+
'86': 'A Connecticut Yankee in King Arthur\'s Court',
89+
'1837': 'The Prince and the Pauper',
90+
'3176': 'The Innocents Abroad',
91+
'3177': 'Roughing It',
92+
93+
# H.G. Wells (12 books) - All verified ✓
94+
'35': 'The Time Machine',
95+
'36': 'The War of the Worlds',
96+
'159': 'The island of Doctor Moreau',
97+
'1047': 'The New Machiavelli',
98+
'1059': 'The World Set Free',
99+
'5230': 'The Invisible Man: A Grotesque Romance',
100+
'6424': 'A Modern Utopia',
101+
'12163': 'The Sleeper Awakes',
102+
'23218': 'The Red Room',
103+
'27365': 'Tales of Space and Time',
104+
'52501': 'The First Men in the Moon',
105+
'75786': 'The open conspiracy : Blue prints for a world revolution',
106+
}
107+
108+
109+
def get_book_title(filename):
110+
"""Get book title from Gutenberg ID filename."""
111+
# Extract ID from filename (e.g., "54.txt" -> "54")
112+
gutenberg_id = filename.replace('.txt', '')
113+
return BOOK_TITLES.get(gutenberg_id, f'Project Gutenberg #{gutenberg_id}')

0 commit comments

Comments
 (0)