ARENA week 4 • Intro to Mech Interp

See the circuit, then prove it.

Week 4 teaches two things at once. First, induction heads implement a real algorithm. Second, TransformerLens gives you direct access to the activations, weights, hooks, and attention patterns you need to inspect that algorithm instead of guessing from outputs.

TransformerLens → inspect internals patterns → clues attribution → contribution weights → mechanism

Why this week matters

Mechanistic interpretability gets stronger when you can move up the evidence ladder. TransformerLens makes that practical. The page is meant to make the notebook legible before you touch the code.

Step 1

Route

A previous-token head writes a compact “prev = A” feature into the stream.

Step 2

Match

An induction head uses that feature to find the earlier position whose predecessor was A.

Step 3

Copy

OV carries the payload from that source, so the next-token logit points toward B.

You should leave able to

Explain the direct path and the attention path in one sentence each.
Use the residual stream as the right mental model for shared accumulation.
Use skip-trigrams to say what Q, K, and V each do.
Read previous-token and induction-stripe attention signatures.
Use run_with_cache, hooks, and direct logit attribution as a first investigation loop.
Separate activation-level evidence from weight-level reverse engineering.

01 • why TransformerLens matters

It turns the model from a black box into an inspectable system.

You get access to the parts that matter for mech interp: activations, weights, hooks, and attention patterns.

Concrete access

Activations: what each head or MLP wrote into the residual stream.
Weights: the matrices that decide what gets matched and copied.
Hooks: named points in the forward pass where you can read or edit values.
Patterns: attention maps that expose where a head looked.

Why that matters

Mechanistic interpretability is about explaining the algorithm the model learned. Outputs alone give you behavior. Internal access lets you test a story, localize the responsible components, and check whether the weights support the same story.

02 • direct path vs attention path

Bigram behavior can ride the direct path. Richer algorithms need heads.

This split sets up path decomposition, attribution, and the induction story.

Direct path

Embedding → residual stream → unembedding

The current token already writes features into the residual stream. The unembedding can read those features directly. That supports easy local heuristics like “after New, boost York”.

simpler
signal

Attention path

Source token → head output → later prediction

Heads create extra paths through the model. They can fetch information from earlier positions, transform it, and write new intermediate features. That is where algorithmic behavior starts to show up.

flowchart TD A[Current token embedding] --> B[Residual stream] B --> C[Unembedding logits] D[Earlier token states] --> E[Attention head] E --> B E --> F[New intermediate feature] F --> C classDef base fill:#d9e7f4,stroke:#3e648d,color:#1a2430,stroke-width:2px; classDef accent fill:#f4e5c7,stroke:#9a6d21,color:#1a2430,stroke-width:2px; class A,B,C base; class D,E,F accent;

03 • residual stream as shared workspace

Everything reads from the same stream, and everything writes back to it.

Think output accumulation, not mystery latent space.

Shared accumulation

The residual stream is the model’s common workspace. Each component reads the current state, computes something, and adds its own update. That means one head can write an intermediate feature for a later head to read.

Token embeddingSeeds local information.

Head writeAdds a transported feature.

MLP writeAdds a transformed feature.

Unembedding readTurns the sum into logits.

Subspace intuition

Two components can communicate if they write and read compatible directions in the stream.
Different features can coexist because they live in partially different subspaces.
This is why induction can be a multi-step story instead of one head doing everything at once.

04 • skip-trigrams unlock Q / K / V

The toy problem makes attention legible.

Use a concrete trigram, then use the failure case to see the limits.

Canonical example

If the prompt is ... yesterday I left work and I went to the park. Then I went, the model wants a token that fits what followed the earlier went. Q asks for “the token after the earlier matching context”, K marks which earlier position matches, and V carries the thing to copy.

t-7yesterday

t-6I

t-5went

t-4to

t-3the

t-2park

t-1then

twent

Bug case

The classic mixup is something like keep ... in mind versus keep ... at bay. Same-layer components cannot condition sequentially on one another, so QK can point to one source while OV carries the wrong payload. That bug is the whole point of separating source selection from copied content.

Part	Question it answers	Role in the trigram
QK	Which earlier position matches the current need?	Source selection. It chooses where to look.
OV	What feature or token-like information should move forward?	Payload transfer. It carries the thing worth copying.
Same-layer limit	Can one head wait for another head from the same layer to finish?	No. They read the same input stream in parallel, so multi-step conditioning needs depth.

05 • repeated random sequence as the induction testbed

Use a setup where memorized co-occurrence can’t do the job.

The repeated half gives induction a clean target, and the first half stays near random.

Board

`A B C D ... A ? → B`

Feed the model a random sequence, then repeat it. On the repeated half, when it sees A again, the useful move is to look back to the earlier A and copy the token that followed it, here B.

8...

Why this works

The random first half gives almost no stable bigram statistics to lean on.
The repeated half creates a new in-context structure the model can exploit.
Loss should drop sharply on the repeated half once induction turns on.
The first half remains near random, which makes the contrast easy to read.

06 • attention-pattern signatures

Patterns teach you where to look before you start intervening.

Pattern reading gives you the first clue. Ablation and weight work decide the claim.

Previous-token head

One-off diagonal

dest → source offset = -1

Each position mainly attends to the token just before it. This is the native signature you want to spot before telling a routing story.

s1s2s3s4s5s6s7s8s9s10

d1d2d3d4d5d6d7d8d9d10

Read it as destination positions on one axis and source positions on the other. The bright near-diagonal line is the clue.

Induction head

Stripe on the repeated half

repeated half offset stripe

Once the second copy starts, the head attends from the current token to the matching token in the first copy. What matters is the offset stripe, not decorative color.

s1s2s3s4s5s6s7s8s9s10

d1d2d3d4d5d6d7d8d9d10

Start here in the notebook. Then check whether attribution and ablation pick out the same head.

Fallbacks still appear

BOS and current-token attention often show up as parking spots or generic defaults. That does not erase the induction signature. It just means real heads are often mixed-purpose.

Evidence tier

Pattern evidence is clue-level evidence, not proof. It is fast to inspect and great for narrowing the search. Attribution, ablation, and weight analysis are what push the claim upward.

07 • hooks, cache, and attribution

Start with cache, then use hooks to ask what matters.

Use this loop in the notebook: cache, hook, measure.

`run_with_cache`

Your standard first move. Run the model once and keep the named activations. That gives you attention patterns, head outputs, residual stream slices, and the intermediate tensors you want to inspect.

Hook

A hook is a named attachment point in the forward pass. You can use it to watch a tensor, copy it out, or replace part of it. Same model, same pass, more visibility.

Direct logit attribution

Ask a sharp question: which component pushes the correct answer token up? Attribution turns a vague “this head looks important” into a contribution score tied to the actual prediction.

Inspect Cache the run, read the attention pattern, then pull the head output or residual slice you care about.

Intervene Attach a hook at the named activation and zero, replace, or patch the candidate signal.

Measure Check loss, target logits, and direct logit attribution to see whether the circuit actually mattered.

Observation and intervention

Observe a head’s pattern or output in the cache.
Hook the same point to zero it out or replace it.
Measure what changes in the logits or loss.

What attribution is telling you

A positive score means the component helps the right answer.
A weak score can still matter if the circuit composes through later heads.
That is why attribution and composition belong together.

Move	Plain-language question	What you expect to see
Cache activations	Where is the candidate signal?	Head patterns, head outputs, and residual writes that line up with the task.
Hook and ablate	Does the model rely on this component?	Performance or target-logit drop when the right head is removed.
Direct logit attribution	Which component pushes the right token?	A measurable contribution to the correct next-token logit.

08 • core induction mechanism

Induction is a two-layer route, match, copy circuit.

One layer routes the previous token feature. A later layer matches on it and copies the next token.

Route

Previous-token head

Layer 0 writes “what came just before this position” into the shared stream.

Match

Induction QK

Layer 1 uses that routed feature to find the earlier position with the same predecessor.

Copy

Induction OV

Once the earlier source is selected, the head transports the payload that points toward the correct next token.

flowchart TD A[Repeated token at current position] --> B[Layer 0 previous-token head] B --> C[Writes prior-token feature into residual stream] C --> D[Layer 1 induction head QK] D --> E[Find earlier token with same predecessor] E --> F[Layer 1 induction head OV] F --> G[Write copied payload toward next-token logit] classDef blue fill:#d9e7f4,stroke:#3e648d,color:#1a2430,stroke-width:2px; classDef teal fill:#d8f0ed,stroke:#2e7c76,color:#1a2430,stroke-width:2px; classDef gold fill:#f4e5c7,stroke:#9a6d21,color:#1a2430,stroke-width:2px; class A,C,E blue; class B,D teal; class F,G gold;

t-1A

layer 0 write“prev = A”

layer 1 queryfind A→?

source hitA B

predictionboost B

Why two layers are required

Same-layer heads all read the same input residual stream. They cannot wait for another head in that layer to write a fresh intermediate feature first. Depth gives you the sequential dependency: route in one layer, match and copy in the next.

Misconception repair

Keep the split clear. One layer routes the useful feature. A later layer uses that feature to choose the source and carry the payload.

09 • reverse engineering: QK, OV, composition

Patterns show the behavior. Weight analysis checks the story.

This is the top-end lesson payload that turns “looks right” into “the structure supports it”.

Activation-level evidence

Attention pattern looks like previous-token or induction.
Attribution says the head pushes the correct logit.
Ablation hurts the repeated-half prediction.

That is strong evidence about what the model is doing on this input.

Check whether the head is structurally built to pick the source your story needs.

Check whether the source, once selected, writes the payload your story claims gets copied.

Compose

Check whether one head writes a feature a later head can actually read.

Step 1

QK asks where to look

Take the query side from the destination position and the key side from candidate source positions. Their product tells you which earlier tokens the head is structurally built to match.

For induction, that matching rule should light up the earlier token with the same predecessor.

Step 2

OV decides what gets carried

Take the value read from the chosen source, then map it through the output matrix. This tells you what feature the head writes back into the residual stream once it has found the source.

For induction, that write should push the token that came after the matched source.

Step 3

Composition and `FactoredMatrix`

Ask whether one head writes a feature that another head can read. Keep the large products factored so the structure stays inspectable instead of collapsing into one opaque matrix.

That is the clean bridge from behavior to mechanism.

Reference table, what each analysis is actually checking

Analysis	Question	Learner-friendly read
QK analysis	Does source selection line up with the proposed matching rule?	Multiply through the query and key side to see which token relationships are favored.
OV analysis	Does the copied payload line up with the proposed output story?	Multiply through the output and value side to inspect what features or token directions the head writes.
Composition analysis	Do two heads cooperate as a chain instead of acting alone?	Check whether the write from one head lands in a subspace the later head can read.
Factored matrices	Why use them?	They let you inspect big matrix products in a form that is easier to compute with and easier to interpret than expanding everything into one dense matrix.

flowchart TD A[Candidate head] --> B[Pattern signature] A --> C[Logit attribution] A --> D[Ablation result] A --> E[QK factored matrix] A --> F[OV factored matrix] E --> G[Source-selection story holds] F --> H[Payload story holds] B --> I[Clue] C --> J[Contribution] D --> K[Causal support] G --> L[Mechanism] H --> L M[Composition scores] --> L classDef blue fill:#d9e7f4,stroke:#3e648d,color:#1a2430,stroke-width:2px; classDef gold fill:#f4e5c7,stroke:#9a6d21,color:#1a2430,stroke-width:2px; classDef teal fill:#d8f0ed,stroke:#2e7c76,color:#1a2430,stroke-width:2px; class A,M blue; class B,C,D,E,F teal; class G,H,I,J,K,L gold;

10 • short mastery check

Use these to see if the lesson compressed.

Five checks, each hitting a different part of the mechanism.

Mechanism

Why does induction need at least two layers, and what does each layer contribute?

Tooling

You suspect a head is important. Why is run_with_cache usually your first move before intervention?

Evidence tier

You found a clean induction stripe. What extra evidence would move the claim beyond clue level?

Reverse engineering

What would QK analysis tell you that OV analysis would not, and vice versa?

Transfer

If a new task seems to require “find the earlier matching context, then copy what followed”, what signatures, interventions, and weight checks would you expect to try first?

Appendix, quick notebook moves

Core moves

Load a HookedTransformer and inspect the tokenizer, layer count, and head count.
Use run_with_cache to collect activations and attention patterns.
Find previous-token and induction candidates from repeated random sequences.
Use hooks for observation, then basic ablation or replacement.
Use direct logit attribution, then QK/OV and composition analysis.

See the circuit, then prove it.

It turns the model from a black box into an inspectable system.

Concrete access

Why that matters

Bigram behavior can ride the direct path. Richer algorithms need heads.

Embedding → residual stream → unembedding

Source token → head output → later prediction

Everything reads from the same stream, and everything writes back to it.

Shared accumulation

Subspace intuition

The toy problem makes attention legible.

Canonical example

Bug case

Use a setup where memorized co-occurrence can’t do the job.

A B C D ... A ? → B

Why this works

Patterns teach you where to look before you start intervening.

One-off diagonal

Stripe on the repeated half

Fallbacks still appear

Evidence tier

Start with cache, then use hooks to ask what matters.

run_with_cache

Hook

Direct logit attribution

Observation and intervention

What attribution is telling you

Induction is a two-layer route, match, copy circuit.

Why two layers are required

Misconception repair

Patterns show the behavior. Weight analysis checks the story.

Activation-level evidence

QK asks where to look

OV decides what gets carried

Composition and FactoredMatrix

Use these to see if the lesson compressed.

Mechanism

Tooling

Evidence tier

Reverse engineering

Transfer

`A B C D ... A ? → B`

`run_with_cache`

Composition and `FactoredMatrix`