Skip to content

Commit e4401df

Browse files
Merge pull request #34 from jeremymanning/main
Add text classification analysis + enhanced stats + adaptive t-test thresholds
2 parents e31386a + 2526f50 commit e4401df

File tree

560 files changed

+750751
-168
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

560 files changed

+750751
-168
lines changed

.github/workflows/tests.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ jobs:
3939
python -m pip install --upgrade pip
4040
pip install pytest pytest-cov
4141
pip install "numpy<2" scipy transformers matplotlib seaborn pandas tqdm
42-
pip install cleantext plotly scikit-learn
42+
pip install cleantext plotly scikit-learn wordcloud nltk
4343
pip install torch --index-url https://download.pytorch.org/whl/cpu
4444
pip install -e .
4545

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,9 @@ tests/data/*.csv
2525
tests/data/*.pkl
2626
!tests/data/test_model_results.pkl
2727

28+
# Classification results (large pkl files)
29+
data/classifier_results/*.pkl
30+
2831
# Temporary test files
2932
.test_credentials
3033

README.md

Lines changed: 182 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,11 +15,12 @@ llm-stylometry/
1515
│ └── workflows/ # Test automation workflows
1616
├── llm_stylometry/ # Python package with analysis tools
1717
│ ├── analysis/ # Statistical analysis utilities
18+
│ ├── classification/ # Text classification module (word count-based)
1819
│ ├── core/ # Core experiment and configuration
1920
│ ├── data/ # Data loading and tokenization
2021
│ ├── models/ # Model utilities
2122
│ ├── utils/ # Helper utilities
22-
│ ├── visualization/ # Plotting and visualization
23+
│ ├── visualization/ # Plotting and visualization (GPT-2 + classification)
2324
│ └── cli_utils.py # CLI helper functions
2425
├── code/ # Training and CLI scripts
2526
│ ├── generate_figures.py # Main CLI entry point
@@ -30,6 +31,7 @@ llm-stylometry/
3031
├── data/ # Datasets and results
3132
│ ├── raw/ # Original texts from Project Gutenberg
3233
│ ├── cleaned/ # Preprocessed texts by author
34+
│ ├── classifier_results/ # Text classification results (pkl files)
3335
│ └── model_results.pkl # Consolidated model training results
3436
├── models/ # Trained models (80 baseline + 240 variants = 320 total)
3537
│ └── {author}_tokenizer=gpt2_seed={0-9}/ # Baseline models
@@ -211,6 +213,185 @@ fig = generate_all_losses_figure(
211213

212214
**Note**: T-test figures (2A, 2B) never apply fairness thresholding since they require all 500 epochs for statistical calculations.
213215

216+
## Text Classification Analysis
217+
218+
In addition to GPT-2 stylometry, the project includes word count-based text classification using scikit-learn. This provides a complementary approach to authorship attribution through traditional machine learning.
219+
220+
### Running Classification Experiments
221+
222+
Use the `--classify` flag to run text classification instead of GPT-2 training:
223+
224+
```bash
225+
# Run baseline classification (all unique words)
226+
./run_llm_stylometry.sh --classify
227+
228+
# Run variant classifications
229+
./run_llm_stylometry.sh --classify --content-only # Content words only
230+
./run_llm_stylometry.sh --classify --function-only # Function words only
231+
./run_llm_stylometry.sh --classify --part-of-speech # POS tags only
232+
```
233+
234+
### Classification Methodology
235+
236+
1. **Feature Extraction**: `CountVectorizer` extracts word counts from all books
237+
- **No stop words filtering** (`stop_words=None`) - critical for fair comparison across variants
238+
- Baseline: All unique words across the corpus
239+
- Content variant: Only content words (function words masked as `<FUNC>`)
240+
- Function variant: Only function words (content words masked as `<CONTENT>`)
241+
- POS variant: POS tag counts (words replaced with tags)
242+
243+
2. **Cross-Validation**: Leave-one-book-out per author
244+
- Each split holds out exactly 1 book from each of the 8 authors (8 books total)
245+
- Up to 1,000 randomly sampled combinations
246+
- Ensures all books are tested and results are robust
247+
248+
3. **Classifier**: Output-code multi-class classifier
249+
- Base estimator: Logistic regression (`max_iter=1000`, `solver='lbfgs'`)
250+
- Author-specific feature weights via back-solving: `input = W_pinv @ (output - bias)`
251+
- Returns different word importance weights for each author
252+
253+
4. **Metrics**: Classification accuracy with bootstrap-estimated 95% confidence intervals
254+
- Seaborn's automatic bootstrap (n_boot=1000)
255+
- Computed separately for each author and overall
256+
257+
### Classification Results
258+
259+
**Output files:**
260+
- **Results**: `data/classifier_results/{variant}.pkl` (or `baseline.pkl`)
261+
- **Accuracy charts**: `paper/figs/source/classification_accuracy_{variant}.pdf`
262+
- **Word clouds**: `paper/figs/source/wordcloud_{author}_{variant}.pdf`
263+
- One overall word cloud showing general feature importance
264+
- One per author showing author-specific discriminative features
265+
- Vectorized PDF output using wordcloud library
266+
267+
**Results structure:**
268+
```python
269+
import pickle
270+
271+
# Load classification results
272+
with open('data/classifier_results/baseline.pkl', 'rb') as f:
273+
data = pickle.load(f)
274+
275+
# Contents:
276+
# data['results']: pd.DataFrame with predictions and accuracies (long format)
277+
# data['vectorizer']: Fitted CountVectorizer
278+
# data['feature_names']: List of vocabulary words
279+
# data['variant']: Analysis variant (None for baseline)
280+
# data['n_splits']: Number of CV splits
281+
# data['seed']: Random seed used
282+
```
283+
284+
### Python API
285+
286+
```python
287+
from llm_stylometry.classification import run_classification_experiment
288+
from llm_stylometry.visualization import (
289+
generate_classification_accuracy_figure,
290+
generate_word_cloud_figure
291+
)
292+
from llm_stylometry.core.constants import AUTHORS
293+
294+
# Run classification experiment
295+
result_path = run_classification_experiment(
296+
variant='content', # 'content', 'function', 'pos', or None for baseline
297+
max_splits=1000, # Maximum CV splits
298+
seed=42 # Random seed for reproducibility
299+
)
300+
301+
# Generate accuracy bar chart
302+
generate_classification_accuracy_figure(
303+
data_path='data/classifier_results/content.pkl',
304+
output_path='paper/figs/source/classification_accuracy_content.pdf',
305+
variant='content'
306+
)
307+
308+
# Generate overall word cloud
309+
generate_word_cloud_figure(
310+
data_path='data/classifier_results/content.pkl',
311+
author=None, # None for overall, or specific author name
312+
output_path='paper/figs/source/wordcloud_overall_content.pdf',
313+
variant='content',
314+
max_words=100
315+
)
316+
317+
# Generate per-author word clouds
318+
for author in AUTHORS:
319+
generate_word_cloud_figure(
320+
data_path='data/classifier_results/content.pkl',
321+
author=author,
322+
output_path=f'paper/figs/source/wordcloud_{author}_content.pdf',
323+
variant='content'
324+
)
325+
```
326+
327+
### Advanced Usage
328+
329+
**Custom data loading:**
330+
```python
331+
from llm_stylometry.classification import (
332+
load_books_by_author,
333+
create_count_vectorizer,
334+
vectorize_books
335+
)
336+
337+
# Load books
338+
books = load_books_by_author(data_dir='data/cleaned', variant=None)
339+
# Returns: Dict[author] -> [(book_id, text), ...]
340+
341+
# Create vectorizer (stop_words=None is critical!)
342+
vectorizer = create_count_vectorizer(books)
343+
print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")
344+
345+
# Vectorize
346+
vectors = vectorize_books(books, vectorizer)
347+
# Returns: [(author, book_id, vector), ...]
348+
```
349+
350+
**Custom cross-validation:**
351+
```python
352+
from llm_stylometry.classification import (
353+
generate_cv_splits,
354+
run_cross_validation,
355+
OutputCodeClassifier
356+
)
357+
358+
# Generate custom CV splits
359+
splits = generate_cv_splits(vectors, max_splits=100, seed=42)
360+
361+
# Run CV
362+
results_df = run_cross_validation(vectors, splits, random_state=42)
363+
364+
# Results DataFrame (long format for seaborn):
365+
# - split_id: int
366+
# - author: str (true author)
367+
# - accuracy: float (1.0 if correct, 0.0 if incorrect)
368+
# - held_out_book_id: str
369+
# - predicted_author: str
370+
# - classifier: OutputCodeClassifier object
371+
372+
# Overall accuracy
373+
print(f"Accuracy: {results_df['accuracy'].mean():.4f}")
374+
```
375+
376+
**Extract author-specific feature weights:**
377+
```python
378+
# Get classifier from results
379+
clf = results_df.iloc[0]['classifier']
380+
feature_names = vectorizer.get_feature_names_out().tolist()
381+
382+
# Get author-specific weights (via back-solving)
383+
weights = clf.get_feature_weights(feature_names)
384+
385+
# weights['baum']: {word: weight, ...}
386+
# weights['austen']: {word: weight, ...}
387+
# weights['overall']: {word: avg_weight, ...}
388+
389+
# Top words for Baum
390+
baum_weights = weights['baum']
391+
top_baum = sorted(baum_weights.items(), key=lambda x: abs(x[1]), reverse=True)[:10]
392+
print("Top Baum features:", top_baum)
393+
```
394+
214395
## Training Models from Scratch
215396

216397
**Note**: Training requires a CUDA-enabled GPU and takes significant time (80 models per condition, 320 total for all conditions).

0 commit comments

Comments
 (0)