Linear probes ask whether one internal state already contains a clean, usable concept
The chapter starts with a simple question and then tightens it. First: at one chosen token and layer, does the model’s internal state already separate true from false? Then: if you push on that same direction, does the answer move with it?
The final period is the readout site because by the end of the sentence the model has already combined the whole claim there, so the hidden state at that token is the compact summary we want to read.
Probing always reads one chosen token-and-layer site, not the whole model in the abstract. Here, a hidden state just means the model’s internal activation vector at that site.
Add or subtract that truth-related line at the same hidden-state site instead of stopping at classification. This kind of direct edit is usually called activation patching.
What is the chapter actually measuring, and where is it measuring it?
Keep one recurring example on screen and tie every later term back to it: which token are we reading, what hidden state lives there, what line separates true from false, and what happens if we push on that line?
By sentence end, the model has accumulated enough context for a verdict-like hidden state, its internal vector for that token, to exist there.
The geometry plot, layer sweep, and patching test all stay anchored to one comparable state, rather than quietly switching what gets measured.
PCA gives you a simple low-dimensional view of a huge activation cloud, so you can see whether true and false examples already pull apart.
This is the straight line from the average false state to the average true state. It is the most literal geometric summary of the split.
This learns a straight-line classifier that separates true from false as well as it can. It often scores well, but that alone doesn't make it the most interpretable direction.
Token position is part of the mechanism, not bookkeeping.
A visible split says the feature is present enough to read out with one straight-line rule, rather than needing a more complicated nonlinear model.
You still need a reason to think the model actually uses that direction while answering.
How does a readable direction turn into a stronger mechanism claim?
Follow one continuous chain: choose the hidden state, find the truth-related direction inside it, then push on that same direction and see whether the answer moves.
Read the final-token activation from the true and false sentence pair at a specific layer. An activation is just the hidden-state vector the model produced there.
Principal component analysis, or PCA, gives a simple view of the activation cloud and shows whether true and false activations separate instead of fully overlapping.
One line comes from class averages, difference-of-means or MM. Another comes from a trained linear classifier, logistic regression or LR. The chapter cares about whether those lines tell a believable mechanism story, not only whether they score well on labels.
The intervention edits that truth-related direction in the few-shot prompt at the intended layers and positions. This is what activation patching means here.
If P(TRUE) rises when you add the truth direction and falls when you subtract it, that is stronger evidence that the model is relying on this direction while answering, not that the probe merely found a handy correlation.
The best probe is whichever line scores highest on held-out labels.
The more convincing probe is the one that fits the full story: right state, clean geometry, some transfer, and answer-moving intervention.
Layer choice, token choice, and evidence standard are all part of the interpretability claim.
What makes the case more convincing, step by step?
Read the evidence in order of strength: first you see the split, then you locate it, then you test transfer, then you test intervention.
The notebook’s first striking picture is already geometric: true and false states separate before the strongest causal language arrives. PCA, principal component analysis, is the usual first view because it turns a huge vector cloud into a small picture you can actually read.
Layer sweeps show that truth is not equally readable everywhere. Early-to-mid layers often carry the clearest line.
If the same line still works on a new truth dataset, it looks more like a real truth feature and less like a trick that only fit one dataset.
Intervention is the step from “we can see it” to “the model may be using it.” Patching asks what happens when you directly edit the activation, not just read it.
| Claim | What you should expect | Why it matters |
|---|---|---|
| Truth is readable in the chosen final-token state. | True and false examples pull apart in a simple view, often PCA, before the more technical probe comparison starts. | The feature is present in the activations, not invented by the probe. |
| The signal has a location profile. | Some layers, especially early-to-mid ones, look cleaner than others in the sweep. | The chapter is locating a representation, not just announcing that one exists somewhere. |
| Classifier quality and mechanism quality can diverge. | LR can score a bit better, while MM can still patch more cleanly. | Best separator is not automatically best mechanism handle. |
| Editing the line changes the answer balance. | P(TRUE) and P(FALSE) move after the hidden-state edit used in activation patching. |
That is the strongest visible step toward causal involvement. |
If the notebook starts to feel like separate demos, this sequence is the thread that stitches them back into one claim.
PCA separation says truth is prominently readable in the chosen state.
Layer sweeps say the feature is not everywhere equally. It has a location profile.
Cross-dataset success says the line looks more like a truth feature than one dataset trick.
Patching is the stronger test because it asks whether the answer actually follows the direction.
Why compare MM and LR if both are just straight lines?
Because they optimize for different things. One is easier to interpret geometrically, while the other is trained to separate labels as well as it can.
This is the most geometric version of the probe. It asks what direction points from false activations toward true ones, so its meaning is comparatively legible.
It is asking which line still tells the most believable story when you push on the model.
This line is optimized for label separation. That can improve raw accuracy, but it can also encourage overclaiming if you treat classification quality as mechanism proof.
What changes in the later sections, and what stays the same?
The recipe stays the same: choose the right representation, read a direction, then test it. What changes is where the relevant signal lives, in one final token, across an assistant response, or spread across a whole request.
A property of the statement itself, so the cleanest readout site is the sentence-ending verdict token.
A property of the model’s stance while answering, so the assistant span matters more than the user prompt alone.
A property of the user request, so the signal can be spread across the prompt instead of sitting in one last token.
Last-token, assistant-span, and attention pooling are different bets about where the property actually lives in the activations.
Read the sentence-ending token because the property is concentrated in the full-statement verdict.
Mask to the assistant response because the label is about the model’s stance while it answers, not about the user prompt by itself.
Pool across the full request because the risk signal can be distributed across the prompt. This is where an attention probe can help, meaning a probe that learns which tokens should count more heavily.
The chapter uses this in the high-stakes section because a detector should rank riskier requests above safer ones instead of only looking good at one arbitrary cutoff. AUROC stands for area under the ROC curve.
An attention probe is the matching readout idea: instead of treating every token equally, it learns where to look in the request before making the prediction.
How should you read the notebook once you open it?
Treat each figure, table, and hook as one check in a larger argument. Ask what claim it is trying to establish before you get lost in the code.
Ask whether true and false states visually separate before a trained classifier gets involved. That first projection is usually PCA.
Ask where the signal actually peaks instead of assuming every layer is equivalent.
Ask whether the direction still works like a general truth feature, or whether it only fit one narrow dataset.
Check that the notebook really reads the final real token rather than padding or a stray position.
Check whether the line comes from class averages, MM, or from a trained separating classifier, LR.
Check that the edit lands at the intended layers and prompt positions. That is the concrete form of activation patching.
What should you be able to explain before opening the code?
A clean classifier can still be reading a correlated feature rather than the steering wheel of the model’s behavior.
A direction that just tracks what sounds plausible can overperform on one dataset while missing the real concept.
The source warns that morality, tone, or partial-honesty patterns can leak into what looks like deception detection.
If only part of a response is deceptive, averaging across too much text can hide the important local evidence.
That is the jump from a readable feature to a more plausibly used feature, and it is the real attractor of the chapter. A hidden state is just the model’s internal vector, and patching asks what happens when you edit it.
A probe is strongest when geometry, transfer, and intervention all line up around one chosen readout site, meaning one carefully defined token or span inside the model.
High classification accuracy alone proves the model is using that exact feature.
Readable geometry is the start. Answer-moving intervention is the stronger test.