ARENA Fundamentals • Vision models • CNNs & ResNets

From flat pixels to trainable feature hierarchies

Why image models need convolution, why deep stacks need residual paths, and how those ideas become a faithful PyTorch ResNet34 you can actually reuse.

Core move
Replace a flattened image classifier with shared local detectors, then stack those detectors inside residual blocks that stay trainable at depth.
Core question
What architectural constraints make image models both data-efficient and optimizable?
Core mechanism
Convolution supplies locality and parameter sharing, while residual addition and batch norm keep deeper feature hierarchies trainable.
Core payoff
You can assemble ResNet34 from custom PyTorch modules, copy torchvision weights, and repurpose the backbone for CIFAR-10 feature extraction.
Locality > flattening Shared filters Residual gradient path BatchNorm train/eval split Pretrained transfer
Explain the vision mismatch
I can explain why flattening a shifted image makes a plain MLP relearn the same pattern in many positions.
Reason about conv layers
I can describe what a Conv2d filter detects, how stride and padding change output shape, and why shared weights matter.
Track the training object
I can explain why the model returns logits, how cross_entropy uses them, and why softmax is not baked into the classifier.
Explain residual optimization
I can explain how a residual branch plus identity path helps deep networks keep gradients usable.
Reconstruct ResNet34
I can sketch the stem, block groups, [3,4,6,3] layout, projection branch rule, and classifier head.
Explain transfer and fidelity
I can explain why BatchNorm2d must respect train/eval mode, why parameter order matters for state_dict copy, and why only the new head should remain trainable in feature extraction.
Pass when 6/8 diagnostic items are correct. Must-pass: residual-path intuition, BatchNorm train/eval behavior, ResNet34 group structure, pretrained-weight-copy constraint.
01 · thesis + persistent toy problem

The same digit, shifted a few pixels

Start with a handwritten digit that moves slightly inside the image frame. To a flattened MLP, that shift looks like a very different input vector. To a convolutional model, it is still the same local stroke pattern appearing in a new position. The whole lesson builds from that one contrast: first local detection, then feature hierarchies, then stable deep training, then transfer.

One visual thesis
Read this as one mechanism surface, not four unrelated cards: flattening destroys spatial bias, convolution restores reusable local detection, and residualized depth turns those local detectors into transferable features.
Flattened view
The digit is turned into one long vector. Nearby pixels in the image no longer have a privileged relationship in the representation.
digit A
digit B
Shifted image, scrambled coordinates
A small translation changes many vector positions at once, so a dense layer does not automatically reuse what it learned about this pattern elsewhere.
pixel 19pixel 20pixel 21pixel 22 pixel 143pixel 144pixel 145pixel 146
same stroke, different vector slots
Shared local detector
One filter can slide over the image and fire on the same stroke pattern wherever it appears.
101 010 010
fire fire
Hierarchy + stable depth
Early local detectors combine into larger features, then residual paths and batch norm keep the deeper stack trainable enough to reach a full ResNet.
edgestrokepartdigitlogits
feature hierarchy gets wider and deeper
Transition 1
Flattening loses spatial bias.
Transition 2
Convolution reuses the same detector across positions.
Transition 3
Deep composition needs optimization scaffolding, not just more layers.
Claim
The lesson is not just “CNNs classify images.” It is “the right inductive bias plus the right optimization scaffold turns local visual patterns into reusable deep features.”
Unlock
The toy problem keeps the entire page grounded. It explains MLP mismatch, conv reuse, pooling compression, residual depth, and transfer in one line.
Contrast
MLP: same digit, many different coordinates. CNN: same filter, same local pattern, new location.
Mechanism
SimpleMLP → Conv2d → pooling → residual block → ResNet34 → frozen backbone + linear head
Warning
Stride internals are useful, but they are appendix material. The core path is about locality, stable depth, and faithful transfer.
02 · prerequisite and concept graph

Build the CNN story as one dependency chain

The page starts with trainable modules and logits, then sharpens the one question that matters for vision: what structure does the model preserve when the image shifts?

flowchart TD
  A[nn.Module + nn.Parameter]
  B[Logits + cross-entropy]
  C[SimpleMLP baseline]
  D[Image locality mismatch]
  E[Conv2d: local shared filters]
  F[Pooling + shape compression]
  G[Deep stacks degrade]
  H[Residual path + BatchNorm]
  I[ResNet34 assembly]
  J[Pretrained parity → feature extraction]

  A --> C
  B --> C
  C --> D
  D --> E
  E --> F
  F --> G
  G --> H
  H --> I
  I --> J
              
Roots
nn.Module + nn.Parameter, logits + cross_entropy
Unlock path
SimpleMLP baseline → mismatch → conv → residualized deep stack
Evidence path
residual intuition → train/eval behavior → weight-copy fidelity → feature extraction payoff
Payoff
Custom-build understanding becomes pretrained-model reuse.
Bigger MLPs solve the same problem anyway.
Wrong because images have translation structure that dense weights do not encode.
Residual blocks are just extra depth.
Wrong because the identity path is the optimization story.
BatchNorm is a tiny implementation detail.
Wrong because train/eval behavior changes outputs and matters for transfer.
A visually similar ResNet is close enough.
Wrong because parameter order and branch declarations affect weight-copy compatibility.
03 · core mechanism model

Follow the chapter’s actual build path, not a generic CNN slogan

The source chapter climbs in a specific order: build a trainable MLP, notice why flattened pixels are the wrong bias, swap in local shared filters and pooling, then add residual blocks and BatchNorm so a faithful ResNet34 can inherit pretrained weights and become a feature extractor.

same digit
same local stroke
same reusable feature path
1 · mismatch

Flattening breaks coordinate reuse

A small shift changes lots of vector slots, so the MLP has to relearn the same stroke in many places.
position A
position B
SimpleMLP can optimize logits while still using the wrong image bias
2 · local reuse

One shared filter follows the pattern

Convolution asks the same question at every location, then max pooling keeps the strongest local evidence.
101 010 010
fire fire
Conv2d
Same detector, new position.
Feature map
Records where the stroke appears.
MaxPool2d
Compresses space, keeps the strongest hit.
3 · stable depth

Residual blocks turn depth into usable depth

The block learns a correction on the left while the right branch preserves an additive route for signal and gradients.
Left branch
conv → bn → relu → conv → bn
learn the update
Right branch
identity
or 1x1 conv + bn
add, then ReLU
Why it matters
Residual addition keeps the optimization path open when the stack gets deep.
BatchNorm rule
Train mode uses batch stats, eval mode uses running stats.
4 · faithful payoff

ResNet34 becomes transferable only if the build is exact

The chapter payoff is not just a deeper CNN. It is a faithful architecture that can inherit weights and freeze its backbone.
Stem
7x7 conv → bn → relu → maxpool
Body
[3,4,6,3] groups, with downsampling at new group starts.
Head
avgpool → linear, then swap the classifier for transfer.
Fidelity rule
Module, parameter, and buffer order must match for state_dict parity.
Transfer rule
Freeze the backbone, train only the replacement head.
Build checkpoints

The chapter only works if these four moves happen in order

Baseline
SimpleMLP plus logits and F.cross_entropy proves the training scaffold first.
Bias fix
Conv2d restores locality and shared reuse for translated images.
Optimization fix
ResidualBlock plus BatchNorm2d keeps the hierarchy trainable.
Do not blur these constraints
Projection rule
The skip path stays identity only when stride and channel count preserve shape.
Average pool role
Late global compression is not the same operation as early max pooling.
Weight copy rule
Declaration order is part of the mechanism, because it controls pretrained compatibility.
Wrong intuition
ResNet34 is just a longer CNN.
Correction
ResNet34 is a residualized CNN whose identity paths, normalization, staged downsampling, and exact implementation order are what make the extra depth trainable and the pretrained copy faithful.
04 · claim → evidence matrix

What each central claim needs before you trust it

This matrix now follows the chapter’s real claims and build checks: logits and cross-entropy, translation mismatch, the different roles of max pooling and average pooling, residual optimization, BatchNorm mode behavior, weight-copy parity, and frozen-backbone transfer.

Scale the evidence to the kind of claim

This larger board makes the middle of the page feel less like stacked admin cards and more like one decision: what kind of evidence is actually strong enough for this claim?

T1 observationalVisible architecture effects
Shifted digits, shared-filter reuse, max-pool compression, and the late average-pool collapse are things the learner can inspect directly.
T2 interventionalChange the contract
Residual optimization and BatchNorm2d behavior only become clear when you reason about what changes after an intervention.
T3 structuralExact fidelity or it fails
Weight-copy parity is structural evidence, because approximate resemblance does not make state_dict loading succeed.
Use T1 for
logits in the baseline build, locality, shared filters, max pooling, and average pooling.
Use T2 for
residual optimization, BatchNorm train/eval behavior, and frozen-backbone transfer.
Use T3 for
parameter order, buffer order, successful state_dict load, and prediction parity.
Claim Minimum tier Method Expected signature
Classification training should use logits, not probabilities baked into the model. T1 observational Inspect the baseline training loop where SimpleMLP returns logits and F.cross_entropy consumes them directly. The classifier emits raw scores for 10 classes, and softmax appears only when probabilities are needed for interpretation.
Flattened MLPs are poorly matched to translated images. T1 observational Contrast two shifted digit inputs and show how flattening scrambles coordinate reuse. Same object, different vector positions, no built-in locality.
Convolutions encode locality and parameter sharing. T1 observational Shared-filter storyboard over two shifted images. The same filter fires on the same stroke pattern in both positions.
Max pooling compresses space while preserving strongest local evidence. T1 observational Before/after feature-map patch panel. Reduced spatial map with strongest activation retained.
Average pooling plays a different late-stage role from max pooling. T1 observational Compare early MaxPool2d compression to the final AveragePool step before the linear classifier. Max pooling preserves strongest local evidence in windows, while average pooling collapses each final channel map to one global feature value.
Residual paths solve an optimization problem, not just an architecture-aesthetics problem. T2 interventional Contrast plain deep stack intuition with residual identity-path explanation from the lesson arc. Gradients have an additive route around the nonlinear branch.
BatchNorm2d changes behavior between train and eval mode and is part of stable deep training. T2 interventional Show batch-stat vs running-stat branch logic. Train uses current batch stats, eval uses running stats, outputs depend on mode.
Faithful ResNet34 replication requires exact module and parameter order. T3 structural Use the chapter’s weight-copy procedure plus the left-branch-first declaration note in ResidualBlock. Successful state_dict transfer and prediction parity depend on matching parameter and buffer ordering.
Pretrained visual features transfer to new tasks when the backbone is frozen and the head is replaced. T2 interventional Feature extraction procedure on CIFAR-10 with requires_grad_(False), a replaced final linear layer, and an optimizer targeting only that head. Only the new classifier remains trainable, while the copied backbone stays fixed and accuracy rises above chance quickly.
05 · primitive SOP cards

The operations you should already be able to name before coding

Each card captures what the primitive is for, what success looks like, and what failure usually means.

Subclass nn.Module

Purpose
Define trainable blocks with registered parameters and a readable forward path.
Success
Weights appear in parameters(); the module is callable and inspectable.
Failure
Plain tensors never register, or the forward path becomes opaque.
Boundary
Safe to delegate directly from the source contract.

Return logits + use cross_entropy

Purpose
Train on raw class scores, not probabilities baked into the model.
Success
F.cross_entropy consumes batch-by-class logits directly.
Failure
Softmax appears inside the model and blurs the training object.
Boundary
Softmax is for interpretation later, not for the model head.

Reason about Conv2d shapes

Purpose
Track kernel, stride, padding, channels, and output shape as one object.
Success
Predict the output map and explain what each channel detects.
Failure
Channels and spatial dimensions get mixed, or conv becomes a black box.
Boundary
Simplify formulas visually, but keep the source conventions intact.

Build a ResidualBlock

Purpose
Compose a learned left branch with an identity-or-projection right branch.
Success
Identity for matching shapes, 1x1 conv + BatchNorm when shape changes.
Failure
Skip path omitted, projection misunderstood, or left/right ordering drifts.
Boundary
Keep left-branch-first declaration because pretrained copy depends on it.

Use BatchNorm2d correctly

Purpose
Normalize per-channel activations while tracking running stats for eval.
Success
Train mode uses batch stats; eval mode uses running stats; affine terms still apply.
Failure
Train and eval are treated as identical, or buffers are mistaken for parameters.
Boundary
Visualize both branches, but keep the buffer semantics explicit.

Freeze backbone for feature extraction

Purpose
Reuse pretrained visual features and train only a replacement classifier head.
Success
All params freeze except the new linear head and the optimizer targets only it.
Failure
The whole network keeps training, turning transfer into training from scratch.
Boundary
Compress loop detail if needed, but preserve the frozen-backbone contract.
06 · human-agent orchestration protocol

Keep the protocol light, but concrete

The human decides what understanding matters. The agent helps turn architecture facts, visuals, and checks into reusable teaching structure.

Human
Decides what counts as understanding: locality, residual optimization, and pretrained reuse are the must-keep ideas.
Agent
Converts source contracts into visuals, evidence panels, and diagnostics without changing architecture facts.
Loop
Predict what should be true, falsify loose stories, intervene on the right variable, then update the explanation.
Predict
What should a shifted digit do under an MLP? Under a conv stack?
Falsify
What observation would show the explanation is too vague or too strong?
Intervene
Switch train/eval mode, remove residual identity, or change the branch shape rule.
Update
Tighten the story until the architecture and the expected behavior line up cleanly.
Verify
Stem, [3,4,6,3] group counts, projection-branch rule, and avgpool → linear head all match the source.
Inputs
cnn fill spec plus the canonical ResNet34 source section.
Output
One short QA note with pass/fail bullets and a conflict callout if any architecture wording drifts.
Fail signatures
“Skip path always identity”, wrong group counts, vague downsampling, or a missing projection condition.
Verify
Weights copy first, the backbone freezes, the final linear head is replaced, and only that head stays trainable.
Inputs
cnn fill spec plus the feature-extraction source section.
Output
One short QA note confirming frozen-backbone transfer and the BatchNorm train/eval caveat.
Fail signatures
The page implies whole-model finetuning, drops mode switching, or treats transfer as generic reuse.
07 · diagnostic gate + remediation

Can the learner retell the mechanism from memory?

The gate stays focused on transfer-ready understanding. If something breaks, the repair path points back to the exact board or card that fixes it.

1
Why does a shifted digit create a problem for a flattened MLP but not for a convolutional filter?
2
Why should a classifier trained with cross-entropy return logits rather than probabilities from the final module?
3
Given a conv layer with stride > 1, what changes in the output and why?
4
When can the right branch of a residual block be the identity, and when must it be a learned projection?
5
What optimization problem do residual connections address in deep nets?
6
What changes between BatchNorm2d train mode and eval mode?
7
Why does parameter declaration order matter when copying torchvision weights into a custom ResNet34?
8
In feature extraction, which weights should remain trainable and why?
Pass = at least 6 correct Must-pass = items 4, 6, 7, 8
Branch 1 · locality confusion
Miss items 1 or 3, then revisit toy-problem frames 1 to 3 and the conv shape card until locality and reuse come back in one sentence.
Branch 2 · optimization confusion
Miss items 4 or 5, then revisit the residual misconception panel and the stable-depth column until the identity path reads like an optimization route, not decoration.
Branch 3 · mode or fidelity confusion
Miss items 6 or 7, then revisit the BatchNorm card and the weight-copy notes until train vs eval and ordering sensitivity are crisp.
Branch 4 · transfer confusion
Miss item 8, then revisit the transfer payoff and primitive 6 until the frozen backbone and replacement head can be named cleanly.
08 · rubric snapshot + release decision

Final rubric snapshot: 86 / 100, SHIP

This page now reflects the final ship-gate outcome rather than a pending-audit placeholder. The last blocker was Section 8 itself, and that blocker is now closed.

Release decision
SHIP Final ship-gate audit result: 86 / 100.
Rubric read
Mastery contract, concept structure, evidence rigor, mechanism design, diagnostics, visual architecture, and governance readiness all clear the release bar at the final gate.
Why it ships
The page is now a frozen canonical artifact instead of a page that says it is merely ready to be judged.
Actual scored snapshot shown
Resolved Section 8 now displays the live score and final release decision.
Old pending gate removed
Resolved The obsolete ready-for-audit framing and stale gate wording are gone.
Canonical release state
Resolved The page now ends with the final audited state required by spec.
09 · supporting / optional layers

Keep the core path clean. Put extra depth here.

These layers help after the main architecture story has already clicked.

Low-level stride appendix

Use this only after the core path is done. Content: as_strided, low-level conv1d, conv2d, and maxpool2d. Framing: deep understanding appendix, not prerequisite for the main ResNet story.

Debugging note

Forward hooks can localize NaNs inside a custom ResNet implementation. Keep it short: the goal is to show a practical debugging handle, not to crowd the lesson arc.

Datasets ladder

MNIST
Toy problem and baseline MLP mismatch.
ImageNet parity
Faithful ResNet34 copy with pretrained agreement.
CIFAR-10 transfer
Frozen backbone plus new linear head.
Bonus implementation notes that matter in practice

Train mode uses the current batch mean and variance, then updates running buffers. Eval mode uses those running buffers. That is why transfer workflows must respect model.train() and model.eval(), especially when the backbone is frozen.

A rough ResNet lookalike is not enough. The declaration order of left-branch modules and projection modules must line up with torchvision so state_dict keys land on the right tensors and prediction parity can succeed.