Plan

data: ~~dolma 3 (wiki, 10 gb)~~ pile-uncopyrighted: monology/pile-uncopyrighted · Datasets at Hugging Face
model: Olma 3 7B

ID multi-token words in a document [mul][ti][ple] For each layer, attention head.
Find maximum attn score on mul over all tokens following ple.
Find maximum attn score on ple over all token following ple.

We expect attn on ple to be higher. Is this true?

References

Feucht, Sheridan, David Atkinson, Byron C. Wallace, and David Bau. 2024. “Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs.” EMNLP, 9727–39. https://aclanthology.org/2024.emnlp-main.543.

Kallini, Julie, Shikhar Murty, Christopher D Manning, Christopher Potts, and Róbert Csordás. 2025. “MrT5: Dynamic Token Merging for Efficient Byte-Level Language Models.” The Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=VYWBMq1L7H.

Kamoda, Go, Benjamin Heinzerling, Tatsuro Inaba, Keito Kudo, Keisuke Sakaguchi, and Kentaro Inui. 2025. “Weight-Based Analysis of Detokenization in Language Models: Understanding the First Stage of Inference Without Inference.” NAACL (Findings), 6324–43. https://aclanthology.org/2025.findings-naacl.355/.

Lad, Vedang, Jin Hwa Lee, Wes Gurnee, and Max Tegmark. 2025. The Remarkable Robustness of LLMs: Stages of Inference? https://arxiv.org/abs/2406.19384.

Liu, Alisa, Jonathan Hayase, Valentin Hofmann, Sewoong Oh, Noah A. Smith, and Yejin Choi. 2025. “SuperBPE: Space Travel for Language Models.” Second Conference on Language Modeling. https://openreview.net/forum?id=lcDRvffeNP.

Park, Kiho, Yo Joong Choe, Yibo Jiang, and Victor Veitch. 2025. “The Geometry of Categorical and Hierarchical Concepts in Large Language Models.” The Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=bVTM2QKYuA.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
local_data		local_data
output		output
scripts		scripts
visuals		visuals
.gitignore		.gitignore
README.md		README.md
analysis.py		analysis.py
argmax_by_word_length.png		argmax_by_word_length.png
boxplot_2token_words.png		boxplot_2token_words.png
boxplot_2token_words_grid.png		boxplot_2token_words_grid.png
buffer.ipynb		buffer.ipynb
buffer.md		buffer.md
buffer.py		buffer.py
config.py		config.py
data.py		data.py
embeddings.py		embeddings.py
exmaple.png		exmaple.png
filename.py		filename.py
get_model.py		get_model.py
interactive.py		interactive.py
k1k2_polar_test_plot.png		k1k2_polar_test_plot.png
k1k2_test.png		k1k2_test.png
main.py		main.py
model.py		model.py
nalyssi		nalyssi
plot.png		plot.png
pyproject.toml		pyproject.toml
refs.bib		refs.bib
requirements.txt		requirements.txt
sample.txt		sample.txt
survival_polar.png		survival_polar.png
test_pipeline_results_all_paragraph.json		test_pipeline_results_all_paragraph.json
tokenization.py		tokenization.py
utils.py		utils.py
visualizations.py		visualizations.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Plan

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Plan

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages