Superposition Experiment
This article is a continuation of the Superposition Briefly post.
In the previous article on superposition I briefly described the phenomenon. To get a more practical lens on how superposition occurs in practice I’ll be replicating the experiment Anthropic has done in their paper.
Before we even try to visualize and see superposition in practice, we need to identify what kind of data we should be working with. We have the following premises regarding data that’ll lead to interpretable results:
- Sparsity: most features we observe in practice rarely occur.
- Features > Model dimensions: a constraining factor that forces the model to compress.
- Feature Importance: not all features are equally important for a given task.
We set up a small model with and , where is the number of features and is the number of dimensions our model has. We also need to vary the sparsity level and assign different importance to each feature.
As for the synthetic data, the input vectors simulate the mentioned premises. Every (which is a “feature”) has an associated sparsity and importance . Every equals with probability and is uniformly distributed between otherwise. As for the importance, the paper uses geometric decay: . is an arbitrarily chosen base and isn’t a magic number. Looking at :
Importance affects the loss — errors on more significant features are penalized more heavily, so the model prioritizes representing them. The loss is:
Now that we have identified the loss function — what exactly are we trying to minimize the loss for?
The model tries to reconstruct the embeddings of with features via -dimensional space. The model looks like this:
The paper hypothesis suggests that every feature in the -dimensional space can be represented in the lower -dimensional one. We are using linear map , where is the weight matrix. Each column represents the direction of the feature .
We use the transpose of the matrix to recover the original vector.
We also include bias to the recovered result. The reason for doing so is to allow the model to nudge the features to their expected values.
Our model also uses an activation function. This seems to be important for superposition. As I read the paper, it was unintuitive for me at first why it is so important for the model to add non-linearity to superpose.
Visualization
We see that in the densest case (), only diagonal entries are highlighted for the 5 most important features. As sparsity increases, it starts to represent more features — but more noise also emerges.
Analytical insight
Besides showing the actual loss that would be computed while training, the paper analytically explains why superposition is occuring showing this equation:
Feature benefit is …
Interference …
Full deriviation: from MSE to feature benefit + interference
We start deriving from our original MSE loss:
Now we start substituting the value of relative to the :
Knowing that we replace matrix multiplication with explicit sum:
Since we substitute that for case:
Following the original loss equation:
…