See the circuit, then prove it.
Week 4 teaches two things at once. First, induction heads implement a real algorithm. Second, TransformerLens gives you direct access to the activations, weights, hooks, and attention patterns you need to inspect that algorithm instead of guessing from outputs.
Mechanistic interpretability gets stronger when you can move up the evidence ladder. TransformerLens makes that practical. The page is meant to make the notebook legible before you touch the code.
A previous-token head writes a compact “prev = A” feature into the stream.
An induction head uses that feature to find the earlier position whose predecessor was A.
OV carries the payload from that source, so the next-token logit points toward B.
- Explain the direct path and the attention path in one sentence each.
- Use the residual stream as the right mental model for shared accumulation.
- Use skip-trigrams to say what Q, K, and V each do.
- Read previous-token and induction-stripe attention signatures.
- Use
run_with_cache, hooks, and direct logit attribution as a first investigation loop. - Separate activation-level evidence from weight-level reverse engineering.
It turns the model from a black box into an inspectable system.
You get access to the parts that matter for mech interp: activations, weights, hooks, and attention patterns.
Concrete access
- Activations: what each head or MLP wrote into the residual stream.
- Weights: the matrices that decide what gets matched and copied.
- Hooks: named points in the forward pass where you can read or edit values.
- Patterns: attention maps that expose where a head looked.
Why that matters
Mechanistic interpretability is about explaining the algorithm the model learned. Outputs alone give you behavior. Internal access lets you test a story, localize the responsible components, and check whether the weights support the same story.
Bigram behavior can ride the direct path. Richer algorithms need heads.
This split sets up path decomposition, attribution, and the induction story.
Embedding → residual stream → unembedding
The current token already writes features into the residual stream. The unembedding can read those features directly. That supports easy local heuristics like “after New, boost York”.
signal
Source token → head output → later prediction
Heads create extra paths through the model. They can fetch information from earlier positions, transform it, and write new intermediate features. That is where algorithmic behavior starts to show up.
Everything reads from the same stream, and everything writes back to it.
Think output accumulation, not mystery latent space.
Shared accumulation
The residual stream is the model’s common workspace. Each component reads the current state, computes something, and adds its own update. That means one head can write an intermediate feature for a later head to read.
Subspace intuition
- Two components can communicate if they write and read compatible directions in the stream.
- Different features can coexist because they live in partially different subspaces.
- This is why induction can be a multi-step story instead of one head doing everything at once.
The toy problem makes attention legible.
Use a concrete trigram, then use the failure case to see the limits.
Canonical example
If the prompt is ... yesterday I left work and I went to the park. Then I went, the model wants a token that fits what followed the earlier went. Q asks for “the token after the earlier matching context”, K marks which earlier position matches, and V carries the thing to copy.
Bug case
The classic mixup is something like keep ... in mind versus keep ... at bay. Same-layer components cannot condition sequentially on one another, so QK can point to one source while OV carries the wrong payload. That bug is the whole point of separating source selection from copied content.
| Part | Question it answers | Role in the trigram |
|---|---|---|
| QK | Which earlier position matches the current need? | Source selection. It chooses where to look. |
| OV | What feature or token-like information should move forward? | Payload transfer. It carries the thing worth copying. |
| Same-layer limit | Can one head wait for another head from the same layer to finish? | No. They read the same input stream in parallel, so multi-step conditioning needs depth. |
Use a setup where memorized co-occurrence can’t do the job.
The repeated half gives induction a clean target, and the first half stays near random.
A B C D ... A ? → B
Feed the model a random sequence, then repeat it. On the repeated half, when it sees A again, the useful move is to look back to the earlier A and copy the token that followed it, here B.
Why this works
- The random first half gives almost no stable bigram statistics to lean on.
- The repeated half creates a new in-context structure the model can exploit.
- Loss should drop sharply on the repeated half once induction turns on.
- The first half remains near random, which makes the contrast easy to read.
Patterns teach you where to look before you start intervening.
Pattern reading gives you the first clue. Ablation and weight work decide the claim.
One-off diagonal
Each position mainly attends to the token just before it. This is the native signature you want to spot before telling a routing story.
Stripe on the repeated half
Once the second copy starts, the head attends from the current token to the matching token in the first copy. What matters is the offset stripe, not decorative color.
Fallbacks still appear
BOS and current-token attention often show up as parking spots or generic defaults. That does not erase the induction signature. It just means real heads are often mixed-purpose.
Evidence tier
Pattern evidence is clue-level evidence, not proof. It is fast to inspect and great for narrowing the search. Attribution, ablation, and weight analysis are what push the claim upward.
Start with cache, then use hooks to ask what matters.
Use this loop in the notebook: cache, hook, measure.
run_with_cache
Your standard first move. Run the model once and keep the named activations. That gives you attention patterns, head outputs, residual stream slices, and the intermediate tensors you want to inspect.
Hook
A hook is a named attachment point in the forward pass. You can use it to watch a tensor, copy it out, or replace part of it. Same model, same pass, more visibility.
Direct logit attribution
Ask a sharp question: which component pushes the correct answer token up? Attribution turns a vague “this head looks important” into a contribution score tied to the actual prediction.
Observation and intervention
- Observe a head’s pattern or output in the cache.
- Hook the same point to zero it out or replace it.
- Measure what changes in the logits or loss.
What attribution is telling you
- A positive score means the component helps the right answer.
- A weak score can still matter if the circuit composes through later heads.
- That is why attribution and composition belong together.
| Move | Plain-language question | What you expect to see |
|---|---|---|
| Cache activations | Where is the candidate signal? | Head patterns, head outputs, and residual writes that line up with the task. |
| Hook and ablate | Does the model rely on this component? | Performance or target-logit drop when the right head is removed. |
| Direct logit attribution | Which component pushes the right token? | A measurable contribution to the correct next-token logit. |
Induction is a two-layer route, match, copy circuit.
One layer routes the previous token feature. A later layer matches on it and copies the next token.
Layer 0 writes “what came just before this position” into the shared stream.
Layer 1 uses that routed feature to find the earlier position with the same predecessor.
Once the earlier source is selected, the head transports the payload that points toward the correct next token.
Why two layers are required
Same-layer heads all read the same input residual stream. They cannot wait for another head in that layer to write a fresh intermediate feature first. Depth gives you the sequential dependency: route in one layer, match and copy in the next.
Misconception repair
Keep the split clear. One layer routes the useful feature. A later layer uses that feature to choose the source and carry the payload.
Patterns show the behavior. Weight analysis checks the story.
This is the top-end lesson payload that turns “looks right” into “the structure supports it”.
Activation-level evidence
- Attention pattern looks like previous-token or induction.
- Attribution says the head pushes the correct logit.
- Ablation hurts the repeated-half prediction.
That is strong evidence about what the model is doing on this input.
Check whether the head is structurally built to pick the source your story needs.
Check whether the source, once selected, writes the payload your story claims gets copied.
Check whether one head writes a feature a later head can actually read.
QK asks where to look
Take the query side from the destination position and the key side from candidate source positions. Their product tells you which earlier tokens the head is structurally built to match.
For induction, that matching rule should light up the earlier token with the same predecessor.
OV decides what gets carried
Take the value read from the chosen source, then map it through the output matrix. This tells you what feature the head writes back into the residual stream once it has found the source.
For induction, that write should push the token that came after the matched source.
Composition and FactoredMatrix
Ask whether one head writes a feature that another head can read. Keep the large products factored so the structure stays inspectable instead of collapsing into one opaque matrix.
That is the clean bridge from behavior to mechanism.
Reference table, what each analysis is actually checking
| Analysis | Question | Learner-friendly read |
|---|---|---|
| QK analysis | Does source selection line up with the proposed matching rule? | Multiply through the query and key side to see which token relationships are favored. |
| OV analysis | Does the copied payload line up with the proposed output story? | Multiply through the output and value side to inspect what features or token directions the head writes. |
| Composition analysis | Do two heads cooperate as a chain instead of acting alone? | Check whether the write from one head lands in a subspace the later head can read. |
| Factored matrices | Why use them? | They let you inspect big matrix products in a form that is easier to compute with and easier to interpret than expanding everything into one dense matrix. |
Use these to see if the lesson compressed.
Five checks, each hitting a different part of the mechanism.
Mechanism
Why does induction need at least two layers, and what does each layer contribute?
Tooling
You suspect a head is important. Why is run_with_cache usually your first move before intervention?
Evidence tier
You found a clean induction stripe. What extra evidence would move the claim beyond clue level?
Reverse engineering
What would QK analysis tell you that OV analysis would not, and vice versa?
Transfer
If a new task seems to require “find the earlier matching context, then copy what followed”, what signatures, interventions, and weight checks would you expect to try first?
Appendix, quick notebook moves
- Load a
HookedTransformerand inspect the tokenizer, layer count, and head count. - Use
run_with_cacheto collect activations and attention patterns. - Find previous-token and induction candidates from repeated random sequences.
- Use hooks for observation, then basic ablation or replacement.
- Use direct logit attribution, then QK/OV and composition analysis.