Skip to content

Commit 5b573bb

Browse files
committed
Add LaTeX table output and fix variant stats computation
Changes to code/compute_stats.py: 1. generate_author_comparison_table() now returns (DataFrame, LaTeX_string) 2. LaTeX table formatted exactly as specified with proper scientific notation 3. Fixed compute_average_t_test() to work with variants: - Was constructing baseline model names only - Now filters by train_author and seed columns (works for all variants) Changes to run_stats.sh: - Now uses correct data path for each variant (data/model_results_{variant}.pkl) Results: - Average t-test now works for all variants (was showing 'Insufficient data') - LaTeX tables generated automatically for easy paper inclusion - All 4 conditions supported: baseline, content, function, pos Related to #33
1 parent d58ae9a commit 5b573bb

File tree

6 files changed

+79
-22
lines changed

6 files changed

+79
-22
lines changed

code/compute_stats.py

Lines changed: 52 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -83,8 +83,8 @@ def compute_average_t_test(df, epoch=500):
8383

8484
for author in AUTHORS:
8585
# Get all data for this author-seed combination
86-
model_name = f"{author}_tokenizer=gpt2_seed={seed}"
87-
model_df = df[df['model_name'] == model_name]
86+
# Filter by author and seed columns (works for both baseline and variants)
87+
model_df = df[(df['train_author'] == author) & (df['seed'] == seed)]
8888

8989
# Get data at the specified epoch (or closest if not exact)
9090
epoch_data = model_df[model_df['epochs_completed'] <= epoch].groupby('loss_dataset').tail(1)
@@ -126,6 +126,9 @@ def generate_author_comparison_table(df):
126126
"""
127127
Generate table of t-tests comparing each author's model losses.
128128
This reproduces Table 1 in the paper.
129+
130+
Returns:
131+
tuple: (pandas DataFrame, LaTeX string)
129132
"""
130133
# Get final epoch data
131134
final_df = df.groupby(['train_author', 'loss_dataset', 'seed']).tail(1)
@@ -152,10 +155,46 @@ def generate_author_comparison_table(df):
152155
'Model': author.capitalize(),
153156
't-stat': f'{t_result.statistic:.2f}',
154157
'df': f'{t_result.df:.2f}',
155-
'p-value': f'{t_result.pvalue:.2e}'
158+
'p-value': f'{t_result.pvalue:.2e}',
159+
't_stat_val': t_result.statistic,
160+
'df_val': t_result.df,
161+
'p_val': t_result.pvalue
156162
})
157163

158-
return pd.DataFrame(results)
164+
df_table = pd.DataFrame(results)
165+
166+
# Generate LaTeX table
167+
latex_lines = [
168+
"\\begin{table}[h]",
169+
"\\centering",
170+
"\\small",
171+
"\\begin{tabular}{lccc}",
172+
"\\hline",
173+
"\\textbf{Model} & \\textbf{$t$-stat} & \\textbf{df} & \\textbf{$p$-value}\\\\",
174+
"\\hline"
175+
]
176+
177+
for _, row in df_table.iterrows():
178+
# Format p-value in scientific notation
179+
p_val = row['p_val']
180+
if p_val < 0.01:
181+
exponent = int(np.floor(np.log10(p_val)))
182+
mantissa = p_val / (10 ** exponent)
183+
p_str = f"${mantissa:.2f} \\times 10^{{{exponent}}}$"
184+
else:
185+
p_str = f"${p_val:.4f}$"
186+
187+
latex_lines.append(
188+
f"{row['Model']:<12} & {row['t_stat_val']:.2f} & {row['df_val']:.2f} & {p_str} \\\\"
189+
)
190+
191+
latex_lines.append("\\hline")
192+
latex_lines.append("\\end{tabular}")
193+
latex_lines.append("\\end{table}")
194+
195+
latex_table = "\n".join(latex_lines)
196+
197+
return df_table, latex_table
159198

160199

161200
def main():
@@ -216,8 +255,15 @@ def main():
216255
# 3. Author comparison table
217256
print("\n3. Author Model Comparison Table (Table 1)")
218257
print("-" * 40)
219-
table = generate_author_comparison_table(df)
220-
print("\n" + table.to_string(index=False))
258+
table, latex_table = generate_author_comparison_table(df)
259+
260+
# Display DataFrame table
261+
print("\n" + table[['Model', 't-stat', 'df', 'p-value']].to_string(index=False))
262+
263+
# Display LaTeX table
264+
print("\n\nLaTeX Table Format:")
265+
print("-" * 40)
266+
print(latex_table)
221267

222268
print("\n" + "=" * 60)
223269

paper/main.pdf

259 Bytes
Binary file not shown.

paper/main.tex

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -25,10 +25,11 @@
2525

2626
\title{A Stylometric Application of Large Language Models}
2727

28-
\author{Harrison F. Stropkay, Jiayi Chen, Daniel N. Rockmore, and Jeremy R. Manning\\
28+
\author{Harrison F. Stropkay, Jiayi Chen, Mohammad J. L. Jabelli,\\
29+
Daniel N. Rockmore, and Jeremy R. Manning\\
2930
Dartmouth College \\
3031
Hanover, NH 03755, USA \\
31-
\texttt{\{harrison.f.stropkay.25, jiayi.chen.gr, }\\\texttt{daniel.n.rockmore, jeremy.r.manning\}@dartmouth.edu}}
32+
\texttt{\{harrison.f.stropkay.25, jiayi.chen.gr, mohammad.javad.latifi.jebelli}\\\texttt{daniel.n.rockmore, jeremy.r.manning\}@dartmouth.edu}}
3233

3334
\begin{document}
3435
\maketitle
@@ -167,21 +168,20 @@ \subsection{Model architecture, training, and evaluation}
167168
and to ensure that the models are not overfitting to a specific book or random
168169
sample.
169170

170-
\subsubsection{Investigating the contributions of function words, content words, and parts of speech}
171+
\subsubsection{Investigating the contributions of function words, content
172+
words, and parts of speech}
171173

172174
In order to investigate the contributions of different types of words to the
173175
stylometric signatures captured by our models, we carried out additional
174176
analyses using modified corpora. First, we created content-word-only corpora by
175177
replacing all function words with a special token, \texttt{<FUNC>}. Function
176-
words were identified using scikit-learn's list of English stop words~\citep{PedrEtal11}.
177-
Next, we created function-word-only corpora by replacing all content (i.e.,
178-
non-function) words with a \texttt{<CONTENT>} token. Finally, we created
179-
part-of-speech-only corpora by using the Natural Language Toolkit~\citep[NLTK; ][]{BirdLope04} to
180-
replace each word with its corresponding part-of-speech tag. We then re-trained
181-
our models on each of these modified corpora, following the same methodology as
182-
described above.
183-
184-
178+
words were identified using scikit-learn's list of English stop
179+
words~\citep{PedrEtal11}. Next, we created function-word-only corpora by
180+
replacing all content (i.e., non-function) words with a \texttt{<CONTENT>}
181+
token. Finally, we created part-of-speech-only corpora by using the Natural
182+
Language Toolkit~\citep[NLTK; ][]{BirdLope04} to replace each word with its
183+
corresponding part-of-speech tag. We then re-trained our models on each of
184+
these modified corpora, following the same methodology as described above.
185185

186186
\begin{figure*}[t]
187187
\centering

paper/supplement.pdf

261 Bytes
Binary file not shown.

paper/supplement.tex

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,10 +18,11 @@
1818

1919
\title{\textit{Supplementary materials for}: A Stylometric Application of Large Language Models}
2020

21-
\author{Harrison F. Stropkay, Jiayi Chen, Daniel N. Rockmore, and Jeremy R. Manning\\
21+
\author{Harrison F. Stropkay, Jiayi Chen, Mohammad J. L. Jabelli,\\
22+
Daniel N. Rockmore, and Jeremy R. Manning\\
2223
Dartmouth College \\
2324
Hanover, NH 03755, USA \\
24-
\texttt{\{harrison.f.stropkay.25, jiayi.chen.gr, }\\\texttt{daniel.n.rockmore, jeremy.r.manning\}@dartmouth.edu}}
25+
\texttt{\{harrison.f.stropkay.25, jiayi.chen.gr, mohammad.javad.latifi.jebelli}\\\texttt{daniel.n.rockmore, jeremy.r.manning\}@dartmouth.edu}}
2526

2627
\date{}
2728

run_stats.sh

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -121,10 +121,20 @@ for variant in "${VARIANTS[@]}"; do
121121
echo
122122
if [ "$variant" == "baseline" ]; then
123123
print_info "Computing baseline statistics..."
124-
python code/compute_stats.py --data "$DATA_PATH"
124+
VARIANT_DATA_PATH="data/model_results.pkl"
125+
if [ ! -f "$VARIANT_DATA_PATH" ]; then
126+
print_error "Baseline data not found: $VARIANT_DATA_PATH"
127+
continue
128+
fi
129+
python code/compute_stats.py --data "$VARIANT_DATA_PATH"
125130
else
126131
print_info "Computing statistics for $variant variant..."
127-
python code/compute_stats.py --data "$DATA_PATH" --variant "$variant"
132+
VARIANT_DATA_PATH="data/model_results_${variant}.pkl"
133+
if [ ! -f "$VARIANT_DATA_PATH" ]; then
134+
print_error "$variant data not found: $VARIANT_DATA_PATH"
135+
continue
136+
fi
137+
python code/compute_stats.py --data "$VARIANT_DATA_PATH" --variant "$variant"
128138
fi
129139
echo
130140
done

0 commit comments

Comments
 (0)