ARENA chapter 1.3.1 · Linear Probes

Linear probes ask whether one internal state already contains a clean, usable concept

The chapter starts with a simple question and then tightens it. First: at one chosen token and layer, does the model’s internal state already separate true from false? Then: if you push on that same direction, does the answer move with it?

What the probe is reading. A linear probe is a straight-line readout from one hidden state, meaning one internal activation vector at one token position and layer.
What can fool you. A probe can read a pattern that travels with truth, even if that pattern is not the internal feature actually steering the answer.
What stronger evidence looks like. Find the truth-related direction, add or subtract it at the same site, then check whether the model becomes more likely to answer true or more likely to answer false.
Start with the same sentence role in both examples
true statement
The Spanish gato means cat .
false statement
The Spanish aire means silver .

The final period is the readout site because by the end of the sentence the model has already combined the whole claim there, so the hidden state at that token is the compact summary we want to read.

Then ask whether truth already looks geometric
first ask if true and false naturally pull apart
then find the line from “more false” to “more true”
only after that attach the method names
Read
One hidden state

Probing always reads one chosen token-and-layer site, not the whole model in the abstract. Here, a hidden state just means the model’s internal activation vector at that site.

Patch
Add or subtract the direction

Add or subtract that truth-related line at the same hidden-state site instead of stopping at classification. This kind of direct edit is usually called activation patching.

Judge
See whether the answer moves
01 · simplest chapter companion

What is the chapter actually measuring, and where is it measuring it?

Keep one recurring example on screen and tie every later term back to it: which token are we reading, what hidden state lives there, what line separates true from false, and what happens if we push on that line?

Keep the sentence short and keep the readout token fixed
claim A
The Spanish word gato means cat .
claim B
The Spanish word aire means silver .
Why the period?
It is the chapter’s readout site

By sentence end, the model has accumulated enough context for a verdict-like hidden state, its internal vector for that token, to exist there.

Why keep role fixed?
Then every later plot is about the same object

The geometry plot, layer sweep, and patching test all stay anchored to one comparable state, rather than quietly switching what gets measured.

Principal component analysis (PCA)

PCA gives you a simple low-dimensional view of a huge activation cloud, so you can see whether true and false examples already pull apart.

Difference-of-means (MM)

This is the straight line from the average false state to the average true state. It is the most literal geometric summary of the split.

Logistic regression (LR)

This learns a straight-line classifier that separates true from false as well as it can. It often scores well, but that alone doesn't make it the most interpretable direction.

The probe is not reading “truth in the whole model.” It is reading one chosen hidden state at one chosen token position, then asking what a simple straight-line rule can recover from that one site.
Readout site

Token position is part of the mechanism, not bookkeeping.

Geometry

A visible split says the feature is present enough to read out with one straight-line rule, rather than needing a more complicated nonlinear model.

Stronger claim

You still need a reason to think the model actually uses that direction while answering.

02 · dominant mechanism board

How does a readable direction turn into a stronger mechanism claim?

Follow one continuous chain: choose the hidden state, find the truth-related direction inside it, then push on that same direction and see whether the answer moves.

Find the direction inside the activation geometry
1 · choose state

Read the final-token activation from the true and false sentence pair at a specific layer. An activation is just the hidden-state vector the model produced there.

2 · reveal split

Principal component analysis, or PCA, gives a simple view of the activation cloud and shows whether true and false activations separate instead of fully overlapping.

3 · name the line

One line comes from class averages, difference-of-means or MM. Another comes from a trained linear classifier, logistic regression or LR. The chapter cares about whether those lines tell a believable mechanism story, not only whether they score well on labels.

Push on that same direction and inspect the answer
patch window

The intervention edits that truth-related direction in the few-shot prompt at the intended layers and positions. This is what activation patching means here.

answer shift

If P(TRUE) rises when you add the truth direction and falls when you subtract it, that is stronger evidence that the model is relying on this direction while answering, not that the probe merely found a handy correlation.

Wrong intuition

The best probe is whichever line scores highest on held-out labels.

Better intuition

The more convincing probe is the one that fits the full story: right state, clean geometry, some transfer, and answer-moving intervention.

Carry this into the notebook

Layer choice, token choice, and evidence standard are all part of the interpretability claim.

03 · evidence progression

What makes the case more convincing, step by step?

Read the evidence in order of strength: first you see the split, then you locate it, then you test transfer, then you test intervention.

See a truth split in a simple projection

The notebook’s first striking picture is already geometric: true and false states separate before the strongest causal language arrives. PCA, principal component analysis, is the usual first view because it turns a huge vector cloud into a small picture you can actually read.

Localize where the signal is strongest

Layer sweeps show that truth is not equally readable everywhere. Early-to-mid layers often carry the clearest line.

Check whether the direction transfers

If the same line still works on a new truth dataset, it looks more like a real truth feature and less like a trick that only fit one dataset.

Patch the direction and watch the answer shift

Intervention is the step from “we can see it” to “the model may be using it.” Patching asks what happens when you directly edit the activation, not just read it.

Claim What you should expect Why it matters
Truth is readable in the chosen final-token state. True and false examples pull apart in a simple view, often PCA, before the more technical probe comparison starts. The feature is present in the activations, not invented by the probe.
The signal has a location profile. Some layers, especially early-to-mid ones, look cleaner than others in the sweep. The chapter is locating a representation, not just announcing that one exists somewhere.
Classifier quality and mechanism quality can diverge. LR can score a bit better, while MM can still patch more cleanly. Best separator is not automatically best mechanism handle.
Editing the line changes the answer balance. P(TRUE) and P(FALSE) move after the hidden-state edit used in activation patching. That is the strongest visible step toward causal involvement.
Think “geometry → location → transfer → intervention.”

If the notebook starts to feel like separate demos, this sequence is the thread that stitches them back into one claim.

evidence tier 1
Geometry

PCA separation says truth is prominently readable in the chosen state.

evidence tier 2
Location

Layer sweeps say the feature is not everywhere equally. It has a location profile.

evidence tier 3
Transfer

Cross-dataset success says the line looks more like a truth feature than one dataset trick.

evidence tier 4
Intervention

Patching is the stronger test because it asks whether the answer actually follows the direction.

04 · contrast case

Why compare MM and LR if both are just straight lines?

Because they optimize for different things. One is easier to interpret geometrically, while the other is trained to separate labels as well as it can.

Average true state minus average false state

This is the most geometric version of the probe. It asks what direction points from false activations toward true ones, so its meaning is comparatively legible.

The chapter is not just asking which line classifies best.

It is asking which line still tells the most believable story when you push on the model.

A trained linear classifier direction

This line is optimized for label separation. That can improve raw accuracy, but it can also encourage overclaiming if you treat classification quality as mechanism proof.

The public summary is simple: LR may win on labels, while MM may better expose the direction the model actually seems to use.
05 · same recipe, new targets

What changes in the later sections, and what stays the same?

The recipe stays the same: choose the right representation, read a direction, then test it. What changes is where the relevant signal lives, in one final token, across an assistant response, or spread across a whole request.

target
Truth

A property of the statement itself, so the cleanest readout site is the sentence-ending verdict token.

target
Deception

A property of the model’s stance while answering, so the assistant span matters more than the user prompt alone.

target
High-stakes

A property of the user request, so the signal can be spread across the prompt instead of sitting in one last token.

big lesson
Pooling is a hypothesis

Last-token, assistant-span, and attention pooling are different bets about where the property actually lives in the activations.

Truth

Read the sentence-ending token because the property is concentrated in the full-statement verdict.

Deception

Mask to the assistant response because the label is about the model’s stance while it answers, not about the user prompt by itself.

High-stakes requests

Pool across the full request because the risk signal can be distributed across the prompt. This is where an attention probe can help, meaning a probe that learns which tokens should count more heavily.

AUROC is just “how well does the detector rank risky examples above safe ones?” across many thresholds.

The chapter uses this in the high-stakes section because a detector should rank riskier requests above safer ones instead of only looking good at one arbitrary cutoff. AUROC stands for area under the ROC curve.

An attention probe is the matching readout idea: instead of treating every token equally, it learns where to look in the request before making the prediction.

06 · notebook bridge

How should you read the notebook once you open it?

Treat each figure, table, and hook as one check in a larger argument. Ask what claim it is trying to establish before you get lost in the code.

Projection plot

Ask whether true and false states visually separate before a trained classifier gets involved. That first projection is usually PCA.

Layer curve

Ask where the signal actually peaks instead of assuming every layer is equivalent.

Transfer table

Ask whether the direction still works like a general truth feature, or whether it only fit one narrow dataset.

Token indexing

Check that the notebook really reads the final real token rather than padding or a stray position.

Probe construction

Check whether the line comes from class averages, MM, or from a trained separating classifier, LR.

Intervention hook

Check that the edit lands at the intended layers and prompt positions. That is the concrete form of activation patching.

07 · carry-away mental model

What should you be able to explain before opening the code?

Readable is not the same as used

A clean classifier can still be reading a correlated feature rather than the steering wheel of the model’s behavior.

Truth is not the same as likelihood

A direction that just tracks what sounds plausible can overperform on one dataset while missing the real concept.

Deception probes can pick up confounds

The source warns that morality, tone, or partial-honesty patterns can leak into what looks like deception detection.

Aggregation can blur the signal

If only part of a response is deceptive, averaging across too much text can hide the important local evidence.

Read the right hidden state, extract the truth-related direction, localize where it is strongest, and patch it back into the model to see whether the answer shifts.

That is the jump from a readable feature to a more plausibly used feature, and it is the real attractor of the chapter. A hidden state is just the model’s internal vector, and patching asks what happens when you edit it.

Main idea

A probe is strongest when geometry, transfer, and intervention all line up around one chosen readout site, meaning one carefully defined token or span inside the model.

Wrong intuition

High classification accuracy alone proves the model is using that exact feature.

Better intuition

Readable geometry is the start. Answer-moving intervention is the stronger test.