-
Notifications
You must be signed in to change notification settings - Fork 8
Description
Post a memo that explores the readings and class topics relevant to your final project by Thursday, Feb 12, 11:59 PM. The memo should be 300–500 words and include the following: (1) state your research question (for the imagined final project) succinctly in a single sentence at the beginning (this can and should evolve across the weeks of the quarter as the project becomes more concrete); (2) propose a research design that helps to address that question, and not proposed in any prior memo, informed by at least one of the required readings and one of the supplemental readings from this week; (3) a visual figure or diagram that draws upon pilot (nonhallucinated) data, simulated data with defensible assumptions, or provides a clear conceptual illustration that persuades the reader (aka James and TAs) of the appropriateness and fruitfulness of the design to address your question. You will then pilot this design as the final question in the Week 2 Homework due the following Wednesday.
By 10AM Friday, each student will up-vote (“thumbs up”) what they think are the five most interesting memos for the week!
Required:
AI Foundations: “Towards Monosemanticity: Decomposing Language Models With Dictionary Learning.” Trenton Bricken, Adly Templeton, Joshua Batson,… Shan Carter, Tom Henighan, Chris Olah. 2023.
AI Designs: “Sparse Autoencoders Find Highly Interpretable Features in Language Models.” Hoagy
Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, Lee Sharkey. 2023.
Social Designs: “Linear representations of political perspective emerge in large language models”. Kim, J., Evans, J., & Schein, A. ICLR. 2025.
“Persona vectors: monitoring and controlling character traits in language models”. Chen, R., Arditi, A., Sleight, H., Evans, O., & Lindsey, J. arXiv. 2025.
Supplemental
AI Foundations: “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.” Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., ... & Olah, C. Anthropic. 2024.
“Mapping the Mind of a Large Language Model.” Anthropic. 2024. [Accessible companion to the technical work]
“A Primer on the Inner Workings of Transformer-based Language Models.” Ferrando, J., Sarti, G., Bisazza, A., & Costa-jussà, M. R. arXiv. 2024.
AI Designs: “Activation Addition: Steering Language Models Without Optimization.” Turner, A., Thiergart, L., Udell, D., Leech, G., Mini, U., & MacDiarmid, M. arXiv. 2023.
“Inference-Time Intervention: Eliciting Truthful Answers from a Language Model.” Li, K., Patel, O., Viégas, F., Pfister, H., & Wattenberg, M. NeurIPS. 2023.
“In-Context Vectors: Making In-Context Learning More Effective and Controllable Through Latent Space Steering.” Liu, S., Lei, C., Huang, X., & Huang, Y. ICML. 2024.
“Scaling and Evaluating Sparse Autoencoders.” Gao, L., la Tour, T. D., Tillman, H., Goh, G., Troll, R., Radford, A., ... & Wu, J. OpenAI. 2024.
“Transcoders Find Interpretable LLM Feature Circuits.” Dunefsky, J., Chlenski, P., & Nanda, N. arXiv. 2024.
Social Designs: “Uncovering Latent Arguments in Social Media Messaging by Employing LLMs-in-the-Loop Strategy.” Fang, H., Lim, A., Chew, R., & Bian, J. arXiv. 2024.
“Eliciting Human Preferences with LLMs-as-Questionnaires.” Reddy, S., Michael, J., Ziegler, D., et al. arXiv. 2024.