-
Notifications
You must be signed in to change notification settings - Fork 11
Description
In this talk, I propose to discuss the problem of building explainable AI with the two approaches - causal vs correlational.
I will talk about what mech interp is in large language models like Gemma. It's a way to understand how models answer questions by looking inside them and checking which neurons activate when.
I will discuss the Anthropic's open sourced a python module - circuit-tracer, and also the Neuronpedia portal , helps us find neurons linked to real-world concepts. We will examine specific prompts on transformers and understand the various paths and thoughts that make use reach the output. (It is veery interesting - for me)
I will also talk about my own work on mech interp tooling (modelrecon) - with "activation cube" data structure (this is not a standard - I came up with it) as a means to share and visualize activation data. And also the "counterfactual" library that I am working on to correctly implement intervention testing
My code:
My slides:
https://docs.google.com/presentation/d/1FNd37jW3nB95lko2imfk6A7VGVG0S0H53hUgWocYJ1g/edit?usp=sharing