From flat pixels to trainable feature hierarchies
Why image models need convolution, why deep stacks need residual paths, and how those ideas become a faithful PyTorch ResNet34 you can actually reuse.
Conv2d filter detects, how stride and padding change output shape, and why shared weights matter.cross_entropy uses them, and why softmax is not baked into the classifier.[3,4,6,3] layout, projection branch rule, and classifier head.BatchNorm2d must respect train/eval mode, why parameter order matters for state_dict copy, and why only the new head should remain trainable in feature extraction.The same digit, shifted a few pixels
Start with a handwritten digit that moves slightly inside the image frame. To a flattened MLP, that shift looks like a very different input vector. To a convolutional model, it is still the same local stroke pattern appearing in a new position. The whole lesson builds from that one contrast: first local detection, then feature hierarchies, then stable deep training, then transfer.
SimpleMLP → Conv2d → pooling → residual block → ResNet34 → frozen backbone + linear headBuild the CNN story as one dependency chain
The page starts with trainable modules and logits, then sharpens the one question that matters for vision: what structure does the model preserve when the image shifts?
flowchart TD
A[nn.Module + nn.Parameter]
B[Logits + cross-entropy]
C[SimpleMLP baseline]
D[Image locality mismatch]
E[Conv2d: local shared filters]
F[Pooling + shape compression]
G[Deep stacks degrade]
H[Residual path + BatchNorm]
I[ResNet34 assembly]
J[Pretrained parity → feature extraction]
A --> C
B --> C
C --> D
D --> E
E --> F
F --> G
G --> H
H --> I
I --> J
nn.Module + nn.Parameter, logits + cross_entropySimpleMLP baseline → mismatch → conv → residualized deep stackresidual intuition → train/eval behavior → weight-copy fidelity → feature extraction payoffFollow the chapter’s actual build path, not a generic CNN slogan
The source chapter climbs in a specific order: build a trainable MLP, notice why flattened pixels are the wrong bias, swap in local shared filters and pooling, then add residual blocks and BatchNorm so a faithful ResNet34 can inherit pretrained weights and become a feature extractor.
Flattening breaks coordinate reuse
One shared filter follows the pattern
Residual blocks turn depth into usable depth
conv → bn → relu → conv → bnlearn the update
or
1x1 conv + bnResNet34 becomes transferable only if the build is exact
7x7 conv → bn → relu → maxpool[3,4,6,3] groups, with downsampling at new group starts.avgpool → linear, then swap the classifier for transfer.state_dict parity.The chapter only works if these four moves happen in order
SimpleMLP plus logits and F.cross_entropy proves the training scaffold first.Conv2d restores locality and shared reuse for translated images.ResidualBlock plus BatchNorm2d keeps the hierarchy trainable.What each central claim needs before you trust it
This matrix now follows the chapter’s real claims and build checks: logits and cross-entropy, translation mismatch, the different roles of max pooling and average pooling, residual optimization, BatchNorm mode behavior, weight-copy parity, and frozen-backbone transfer.
Scale the evidence to the kind of claim
This larger board makes the middle of the page feel less like stacked admin cards and more like one decision: what kind of evidence is actually strong enough for this claim?
BatchNorm2d behavior only become clear when you reason about what changes after an intervention.state_dict loading succeed.state_dict load, and prediction parity.| Claim | Minimum tier | Method | Expected signature |
|---|---|---|---|
| Classification training should use logits, not probabilities baked into the model. | T1 observational | Inspect the baseline training loop where SimpleMLP returns logits and F.cross_entropy consumes them directly. |
The classifier emits raw scores for 10 classes, and softmax appears only when probabilities are needed for interpretation. |
| Flattened MLPs are poorly matched to translated images. | T1 observational | Contrast two shifted digit inputs and show how flattening scrambles coordinate reuse. | Same object, different vector positions, no built-in locality. |
| Convolutions encode locality and parameter sharing. | T1 observational | Shared-filter storyboard over two shifted images. | The same filter fires on the same stroke pattern in both positions. |
| Max pooling compresses space while preserving strongest local evidence. | T1 observational | Before/after feature-map patch panel. | Reduced spatial map with strongest activation retained. |
| Average pooling plays a different late-stage role from max pooling. | T1 observational | Compare early MaxPool2d compression to the final AveragePool step before the linear classifier. |
Max pooling preserves strongest local evidence in windows, while average pooling collapses each final channel map to one global feature value. |
| Residual paths solve an optimization problem, not just an architecture-aesthetics problem. | T2 interventional | Contrast plain deep stack intuition with residual identity-path explanation from the lesson arc. | Gradients have an additive route around the nonlinear branch. |
BatchNorm2d changes behavior between train and eval mode and is part of stable deep training. |
T2 interventional | Show batch-stat vs running-stat branch logic. | Train uses current batch stats, eval uses running stats, outputs depend on mode. |
| Faithful ResNet34 replication requires exact module and parameter order. | T3 structural | Use the chapter’s weight-copy procedure plus the left-branch-first declaration note in ResidualBlock. |
Successful state_dict transfer and prediction parity depend on matching parameter and buffer ordering. |
| Pretrained visual features transfer to new tasks when the backbone is frozen and the head is replaced. | T2 interventional | Feature extraction procedure on CIFAR-10 with requires_grad_(False), a replaced final linear layer, and an optimizer targeting only that head. |
Only the new classifier remains trainable, while the copied backbone stays fixed and accuracy rises above chance quickly. |
The operations you should already be able to name before coding
Each card captures what the primitive is for, what success looks like, and what failure usually means.
Subclass nn.Module
parameters(); the module is callable and inspectable.Return logits + use cross_entropy
F.cross_entropy consumes batch-by-class logits directly.Reason about Conv2d shapes
Build a ResidualBlock
1x1 conv + BatchNorm when shape changes.Use BatchNorm2d correctly
Freeze backbone for feature extraction
Keep the protocol light, but concrete
The human decides what understanding matters. The agent helps turn architecture facts, visuals, and checks into reusable teaching structure.
[3,4,6,3] group counts, projection-branch rule, and avgpool → linear head all match the source.cnn fill spec plus the canonical ResNet34 source section.cnn fill spec plus the feature-extraction source section.Can the learner retell the mechanism from memory?
The gate stays focused on transfer-ready understanding. If something breaks, the repair path points back to the exact board or card that fixes it.
BatchNorm2d train mode and eval mode?Final rubric snapshot: 86 / 100, SHIP
This page now reflects the final ship-gate outcome rather than a pending-audit placeholder. The last blocker was Section 8 itself, and that blocker is now closed.
Keep the core path clean. Put extra depth here.
These layers help after the main architecture story has already clicked.
Low-level stride appendix
Use this only after the core path is done. Content: as_strided, low-level conv1d, conv2d, and maxpool2d. Framing: deep understanding appendix, not prerequisite for the main ResNet story.
Debugging note
Forward hooks can localize NaNs inside a custom ResNet implementation. Keep it short: the goal is to show a practical debugging handle, not to crowd the lesson arc.
Datasets ladder
Bonus implementation notes that matter in practice
Train mode uses the current batch mean and variance, then updates running buffers. Eval mode uses those running buffers. That is why transfer workflows must respect model.train() and model.eval(), especially when the backbone is frozen.
A rough ResNet lookalike is not enough. The declaration order of left-branch modules and projection modules must line up with torchvision so state_dict keys land on the right tensors and prediction parity can succeed.