ARENA Fundamentals • Vision models • CNNs & ResNets

From flat pixels to trainable feature hierarchies

Why image models need convolution, why deep stacks need residual paths, and how those ideas become a faithful PyTorch ResNet34 you can actually reuse.

Chapter thesis

Core move

Replace a flattened image classifier with shared local detectors, then stack those detectors inside residual blocks that stay trainable at depth.

Core question

What architectural constraints make image models both data-efficient and optimizable?

Core mechanism

Convolution supplies locality and parameter sharing, while residual addition and batch norm keep deeper feature hierarchies trainable.

Core payoff

You can assemble ResNet34 from custom PyTorch modules, copy torchvision weights, and repurpose the backbone for CIFAR-10 feature extraction.

Locality > flattening Shared filters Residual gradient path BatchNorm train/eval split Pretrained transfer

Mastery contract

Explain the vision mismatch

I can explain why flattening a shifted image makes a plain MLP relearn the same pattern in many positions.

Reason about conv layers

I can describe what a Conv2d filter detects, how stride and padding change output shape, and why shared weights matter.

Track the training object

I can explain why the model returns logits, how cross_entropy uses them, and why softmax is not baked into the classifier.

Explain residual optimization

I can explain how a residual branch plus identity path helps deep networks keep gradients usable.

Reconstruct ResNet34

I can sketch the stem, block groups, [3,4,6,3] layout, projection branch rule, and classifier head.

Explain transfer and fidelity

I can explain why BatchNorm2d must respect train/eval mode, why parameter order matters for state_dict copy, and why only the new head should remain trainable in feature extraction.

Pass when 6/8 diagnostic items are correct. Must-pass: residual-path intuition, BatchNorm train/eval behavior, ResNet34 group structure, pretrained-weight-copy constraint.

01 · thesis + persistent toy problem

The same digit, shifted a few pixels

Start with a handwritten digit that moves slightly inside the image frame. To a flattened MLP, that shift looks like a very different input vector. To a convolutional model, it is still the same local stroke pattern appearing in a new position. The whole lesson builds from that one contrast: first local detection, then feature hierarchies, then stable deep training, then transfer.

Canonical storyboard

One visual thesis

Read this as one mechanism surface, not four unrelated cards: flattening destroys spatial bias, convolution restores reusable local detection, and residualized depth turns those local detectors into transferable features.

Flattened view

The digit is turned into one long vector. Nearby pixels in the image no longer have a privileged relationship in the representation.

digit A

digit B

Shifted image, scrambled coordinates

A small translation changes many vector positions at once, so a dense layer does not automatically reuse what it learned about this pattern elsewhere.

pixel 19pixel 20pixel 21pixel 22 pixel 143pixel 144pixel 145pixel 146

same stroke, different vector slots

Shared local detector

One filter can slide over the image and fire on the same stroke pattern wherever it appears.

101 010 010

fire fire

Hierarchy + stable depth

Early local detectors combine into larger features, then residual paths and batch norm keep the deeper stack trainable enough to reach a full ResNet.

edgestrokepartdigitlogits

feature hierarchy gets wider and deeper

Transition 1

Flattening loses spatial bias.

Transition 2

Convolution reuses the same detector across positions.

Transition 3

Deep composition needs optimization scaffolding, not just more layers.

What this proves

Claim

The lesson is not just “CNNs classify images.” It is “the right inductive bias plus the right optimization scaffold turns local visual patterns into reusable deep features.”

Unlock

The toy problem keeps the entire page grounded. It explains MLP mismatch, conv reuse, pooling compression, residual depth, and transfer in one line.

Contrast

MLP: same digit, many different coordinates. CNN: same filter, same local pattern, new location.

Mechanism

SimpleMLP → Conv2d → pooling → residual block → ResNet34 → frozen backbone + linear head

Warning

Stride internals are useful, but they are appendix material. The core path is about locality, stable depth, and faithful transfer.

02 · prerequisite and concept graph

Build the CNN story as one dependency chain

The page starts with trainable modules and logits, then sharpens the one question that matters for vision: what structure does the model preserve when the image shifts?

Concept DAG

flowchart TD
  A[nn.Module + nn.Parameter]
  B[Logits + cross-entropy]
  C[SimpleMLP baseline]
  D[Image locality mismatch]
  E[Conv2d: local shared filters]
  F[Pooling + shape compression]
  G[Deep stacks degrade]
  H[Residual path + BatchNorm]
  I[ResNet34 assembly]
  J[Pretrained parity → feature extraction]

  A --> C
  B --> C
  C --> D
  D --> E
  E --> F
  F --> G
  G --> H
  H --> I
  I --> J

Dependency notes

Roots

nn.Module + nn.Parameter, logits + cross_entropy

Unlock path

SimpleMLP baseline → mismatch → conv → residualized deep stack

Evidence path

residual intuition → train/eval behavior → weight-copy fidelity → feature extraction payoff

Payoff

Custom-build understanding becomes pretrained-model reuse.

Bigger MLPs solve the same problem anyway.

Wrong because images have translation structure that dense weights do not encode.

Residual blocks are just extra depth.

Wrong because the identity path is the optimization story.

BatchNorm is a tiny implementation detail.

Wrong because train/eval behavior changes outputs and matters for transfer.

A visually similar ResNet is close enough.

Wrong because parameter order and branch declarations affect weight-copy compatibility.

03 · core mechanism model

Follow the chapter’s actual build path, not a generic CNN slogan

The source chapter climbs in a specific order: build a trainable MLP, notice why flattened pixels are the wrong bias, swap in local shared filters and pooling, then add residual blocks and BatchNorm so a faithful ResNet34 can inherit pretrained weights and become a feature extractor.

Dominant mechanism board

same digit

same local stroke

same reusable feature path

1 · mismatch

Flattening breaks coordinate reuse

A small shift changes lots of vector slots, so the MLP has to relearn the same stroke in many places.

position A

position B

SimpleMLP can optimize logits while still using the wrong image bias

2 · local reuse

One shared filter follows the pattern

Convolution asks the same question at every location, then max pooling keeps the strongest local evidence.

101 010 010

fire fire

3 · stable depth

Residual blocks turn depth into usable depth

The block learns a correction on the left while the right branch preserves an additive route for signal and gradients.

Left branch

conv → bn → relu → conv → bn
learn the update

Right branch

identity
or 1x1 conv + bn

add, then ReLU

Why it matters

Residual addition keeps the optimization path open when the stack gets deep.

BatchNorm rule

Train mode uses batch stats, eval mode uses running stats.

4 · faithful payoff

ResNet34 becomes transferable only if the build is exact

The chapter payoff is not just a deeper CNN. It is a faithful architecture that can inherit weights and freeze its backbone.

Stem

7x7 conv → bn → relu → maxpool

Body

[3,4,6,3] groups, with downsampling at new group starts.

Head

avgpool → linear, then swap the classifier for transfer.

Fidelity rule

Module, parameter, and buffer order must match for state_dict parity.

Transfer rule

Freeze the backbone, train only the replacement head.

Build checkpoints

The chapter only works if these four moves happen in order

Baseline

SimpleMLP plus logits and F.cross_entropy proves the training scaffold first.

Bias fix

Conv2d restores locality and shared reuse for translated images.

Optimization fix

ResidualBlock plus BatchNorm2d keeps the hierarchy trainable.

Do not blur these constraints

Projection rule

The skip path stays identity only when stride and channel count preserve shape.

Average pool role

Late global compression is not the same operation as early max pooling.

Weight copy rule

Declaration order is part of the mechanism, because it controls pretrained compatibility.

Misconception correction

Wrong intuition

ResNet34 is just a longer CNN.

Correction

ResNet34 is a residualized CNN whose identity paths, normalization, staged downsampling, and exact implementation order are what make the extra depth trainable and the pretrained copy faithful.

04 · claim → evidence matrix

What each central claim needs before you trust it

This matrix now follows the chapter’s real claims and build checks: logits and cross-entropy, translation mismatch, the different roles of max pooling and average pooling, residual optimization, BatchNorm mode behavior, weight-copy parity, and frozen-backbone transfer.

Evidence ladder

Scale the evidence to the kind of claim

This larger board makes the middle of the page feel less like stacked admin cards and more like one decision: what kind of evidence is actually strong enough for this claim?

T1 observationalVisible architecture effects

Shifted digits, shared-filter reuse, max-pool compression, and the late average-pool collapse are things the learner can inspect directly.

T2 interventionalChange the contract

Residual optimization and BatchNorm2d behavior only become clear when you reason about what changes after an intervention.

T3 structuralExact fidelity or it fails

Weight-copy parity is structural evidence, because approximate resemblance does not make state_dict loading succeed.

Use T1 for

logits in the baseline build, locality, shared filters, max pooling, and average pooling.

Use T2 for

residual optimization, BatchNorm train/eval behavior, and frozen-backbone transfer.

Use T3 for

parameter order, buffer order, successful state_dict load, and prediction parity.

Claim matrix

Claim	Minimum tier	Method	Expected signature
Classification training should use logits, not probabilities baked into the model.	T1 observational	Inspect the baseline training loop where `SimpleMLP` returns logits and `F.cross_entropy` consumes them directly.	The classifier emits raw scores for 10 classes, and softmax appears only when probabilities are needed for interpretation.
Flattened MLPs are poorly matched to translated images.	T1 observational	Contrast two shifted digit inputs and show how flattening scrambles coordinate reuse.	Same object, different vector positions, no built-in locality.
Convolutions encode locality and parameter sharing.	T1 observational	Shared-filter storyboard over two shifted images.	The same filter fires on the same stroke pattern in both positions.
Max pooling compresses space while preserving strongest local evidence.	T1 observational	Before/after feature-map patch panel.	Reduced spatial map with strongest activation retained.
Average pooling plays a different late-stage role from max pooling.	T1 observational	Compare early `MaxPool2d` compression to the final `AveragePool` step before the linear classifier.	Max pooling preserves strongest local evidence in windows, while average pooling collapses each final channel map to one global feature value.
Residual paths solve an optimization problem, not just an architecture-aesthetics problem.	T2 interventional	Contrast plain deep stack intuition with residual identity-path explanation from the lesson arc.	Gradients have an additive route around the nonlinear branch.
`BatchNorm2d` changes behavior between train and eval mode and is part of stable deep training.	T2 interventional	Show batch-stat vs running-stat branch logic.	Train uses current batch stats, eval uses running stats, outputs depend on mode.
Faithful ResNet34 replication requires exact module and parameter order.	T3 structural	Use the chapter’s weight-copy procedure plus the left-branch-first declaration note in `ResidualBlock`.	Successful `state_dict` transfer and prediction parity depend on matching parameter and buffer ordering.
Pretrained visual features transfer to new tasks when the backbone is frozen and the head is replaced.	T2 interventional	Feature extraction procedure on CIFAR-10 with `requires_grad_(False)`, a replaced final linear layer, and an optimizer targeting only that head.	Only the new classifier remains trainable, while the copied backbone stays fixed and accuracy rises above chance quickly.

Converts source contracts into visuals, evidence panels, and diagnostics without changing architecture facts.

Loop

Predict what should be true, falsify loose stories, intervene on the right variable, then update the explanation.

Mechanism-thinking loop

Predict

What should a shifted digit do under an MLP? Under a conv stack?

Falsify

What observation would show the explanation is too vague or too strong?

Intervene

Switch train/eval mode, remove residual identity, or change the branch shape rule.

Update

Tighten the story until the architecture and the expected behavior line up cleanly.

Agent task packet A · architecture-fidelity check

Verify

Stem, [3,4,6,3] group counts, projection-branch rule, and avgpool → linear head all match the source.

Inputs

cnn fill spec plus the canonical ResNet34 source section.

Output

One short QA note with pass/fail bullets and a conflict callout if any architecture wording drifts.

Fail signatures

“Skip path always identity”, wrong group counts, vague downsampling, or a missing projection condition.

Agent task packet B · feature-extraction integrity check

Verify

Weights copy first, the backbone freezes, the final linear head is replaced, and only that head stays trainable.

Inputs

cnn fill spec plus the feature-extraction source section.

Output

One short QA note confirming frozen-backbone transfer and the BatchNorm train/eval caveat.

Fail signatures

The page implies whole-model finetuning, drops mode switching, or treats transfer as generic reuse.

07 · diagnostic gate + remediation

Can the learner retell the mechanism from memory?

The gate stays focused on transfer-ready understanding. If something breaks, the repair path points back to the exact board or card that fixes it.

8 diagnostic items

Why does a shifted digit create a problem for a flattened MLP but not for a convolutional filter?

Why should a classifier trained with cross-entropy return logits rather than probabilities from the final module?

Given a conv layer with stride > 1, what changes in the output and why?

When can the right branch of a residual block be the identity, and when must it be a learned projection?

What optimization problem do residual connections address in deep nets?

What changes between BatchNorm2d train mode and eval mode?

Why does parameter declaration order matter when copying torchvision weights into a custom ResNet34?

In feature extraction, which weights should remain trainable and why?

Pass = at least 6 correct Must-pass = items 4, 6, 7, 8

Remediation branches

Branch 1 · locality confusion

Miss items 1 or 3, then revisit toy-problem frames 1 to 3 and the conv shape card until locality and reuse come back in one sentence.

Branch 2 · optimization confusion

Miss items 4 or 5, then revisit the residual misconception panel and the stable-depth column until the identity path reads like an optimization route, not decoration.

Branch 3 · mode or fidelity confusion

Miss items 6 or 7, then revisit the BatchNorm card and the weight-copy notes until train vs eval and ordering sensitivity are crisp.

Branch 4 · transfer confusion

Miss item 8, then revisit the transfer payoff and primitive 6 until the frozen backbone and replacement head can be named cleanly.

08 · rubric snapshot + release decision

Final rubric snapshot: 86 / 100, SHIP

This page now reflects the final ship-gate outcome rather than a pending-audit placeholder. The last blocker was Section 8 itself, and that blocker is now closed.

Final ship-gate result

Release decision

SHIP Final ship-gate audit result: 86 / 100.

Rubric read

Mastery contract, concept structure, evidence rigor, mechanism design, diagnostics, visual architecture, and governance readiness all clear the release bar at the final gate.

Why it ships

The page is now a frozen canonical artifact instead of a page that says it is merely ready to be judged.

Blocker checklist

Actual scored snapshot shown

Resolved Section 8 now displays the live score and final release decision.

Old pending gate removed

Resolved The obsolete ready-for-audit framing and stale gate wording are gone.

Canonical release state

Resolved The page now ends with the final audited state required by spec.

09 · supporting / optional layers

Keep the core path clean. Put extra depth here.

These layers help after the main architecture story has already clicked.

Supporting layer A

Low-level stride appendix

Use this only after the core path is done. Content: as_strided, low-level conv1d, conv2d, and maxpool2d. Framing: deep understanding appendix, not prerequisite for the main ResNet story.

Supporting layer B

Debugging note

Forward hooks can localize NaNs inside a custom ResNet implementation. Keep it short: the goal is to show a practical debugging handle, not to crowd the lesson arc.

Supporting layer C

Datasets ladder

MNIST

Toy problem and baseline MLP mismatch.

ImageNet parity

Faithful ResNet34 copy with pretrained agreement.

CIFAR-10 transfer

Frozen backbone plus new linear head.

Bonus implementation notes that matter in practice

BatchNorm mode split

Train mode uses the current batch mean and variance, then updates running buffers. Eval mode uses those running buffers. That is why transfer workflows must respect model.train() and model.eval(), especially when the backbone is frozen.

Weight-copy fidelity

A rough ResNet lookalike is not enough. The declaration order of left-branch modules and projection modules must line up with torchvision so state_dict keys land on the right tensors and prediction parity can succeed.

From flat pixels to trainable feature hierarchies

The same digit, shifted a few pixels

Build the CNN story as one dependency chain

Follow the chapter’s actual build path, not a generic CNN slogan

Flattening breaks coordinate reuse

One shared filter follows the pattern

Residual blocks turn depth into usable depth

ResNet34 becomes transferable only if the build is exact

The chapter only works if these four moves happen in order

What each central claim needs before you trust it

Scale the evidence to the kind of claim

The operations you should already be able to name before coding

Subclass nn.Module

Return logits + use cross_entropy

Reason about Conv2d shapes

Build a ResidualBlock

Use BatchNorm2d correctly

Freeze backbone for feature extraction

Keep the protocol light, but concrete

Can the learner retell the mechanism from memory?

Final rubric snapshot: 86 / 100, SHIP

Keep the core path clean. Put extra depth here.

Low-level stride appendix

Debugging note

Datasets ladder

Subclass `nn.Module`

Return logits + use `cross_entropy`

Reason about `Conv2d` shapes

Build a `ResidualBlock`

Use `BatchNorm2d` correctly