Transformer Circuits
Mechanistic interpretability (often shortened to mechinterp) tries to understand what’s actually happening inside transformers, rather than treating them as black boxes.
The standard implementation of the transformer from the original transformers paper is not the most interpretable one. They use a computationally efficient form that obscures the structure. As described in more detail in “A Mathematical Framework for Transformer Circuits”, there’s another way of doing the mathematically equivalent computations but in a more comprehensible way.
Residual stream
Residual stream is a (seq_len, d_model) tensor which holds all the vector space within which we operate. We use it to store the information every component in our model decides to read and write. It’s helpful to think of each layer as composing information and “talking” to each other through this stream. It begins with token embeddings and gets progressively read by and written to later components like attention heads, MLPs and finally the unembedding layer.
Attention pattern
In standard implementation we find both query and key vectors for each one of these separately and only then multiply them together. Assuming and are all sequence pairs the standard attention pattern would look like this:
However, instead of treating this as two separate transformations into query and key we can think of it as a single learned similarity function between tokens. Let’s substitute the and directly in the formula:
Using the rule we get:
Here, is the similarity function of shape (d_model, d_model) which we further notate with . This matrix shows that each head has learned a ‘relevance’ of every token to one another.
For example, by visualizing we can see what tokens tend to attend to each other. The intuition behind exactly this transformation ordering is that we want to see the notion of a ‘relevance’ within token embedding space.
We can see that unnormalized form is a bit messy since it does not acknowledge sequence positions at all and shows token-identity-only attention. That said, we can view the normalized form to see which tokens stand out relative to each row’s baseline.
For example, L10H2 seems to show strong semantic correlation between tokens “France” and “Paris”, “king” and “queen” since they attend to each other strongly. There’s also L10H6 which has even stronger attention between “France” and “Paris” which might show head’s specialization in geography rather than gender. That said, since this decomposition ignores positional terms, these correlations are suggestive rather than conclusive. We’d need to verify against actual attention patterns on real inputs.
Value and Output Projection
Another important clarification is that we use per-head projection instead of projecting one big concatenated result. Standard implementation is that we take the resulting (seq_len, d_head * n_heads) output, which is equal to (seq_len, d_model), and multiply it with one big W_O shaped (d_model, d_model). It might seem that this one big matrix is somehow entangling the information between heads but, in fact, the final projection is just total sum of each head’s independent contribution. Instead of concatenating and projecting vectors by every column of W_O we can independently do that directly on each head’s output via smaller W_O shaped (d_head, d_model).
While QK matrix was acting like a router of information, i.e where to look at, helps us to understand what information is being read from or written to. We can think of both independent operations as both things at once.
For instance, we can visualize the to see what information tokens tend to push to write into the residual stream. We embed a single input token and unembed from the result to get a human-interpretable token which is being promoted by specific head. This is different from QK since we’re comparing two tokens there.
From these heatmaps we can see that in L6H8 tokens “France” and “Paris”, “king” and “queen” seem to promote each other. But this is also worth mentioning that these notions act as a fraction of the bigger picture, therefore we’d need to look at this from various angles as well.
Note that all of the weights in the model are fixed since we’re just introspecting already pre-trained model. We’re replacing the internal activations’ presentation form and not changing anything behaviourally (unless we want to also analyse the circuit by patching the specific component/components activations).
What we’ve covered here are the building blocks of how individual attention heads read from and write to the residual stream. Yet, heads don’t work in isolation and we still have a lot to discover: induction heads, composition, various circuit types and how it all works with MLPs.