Lab 03: Deep Learning with Pytorch
Demo Code
Download the demo code Here for NN regression as the starter code.
Background: CNN Dimension Calculations
When building a CNN, the trickiest part is getting the dimensions right — especially at the boundary between convolutional layers and fully connected layers. This section explains how to compute output dimensions step by step.
Conv2d Parameters
nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0) has five key parameters:
| Parameter | What it controls |
|---|---|
in_channels |
Number of input feature maps (e.g., 3 for RGB images) |
out_channels |
Number of filters = number of output feature maps |
kernel_size |
Size of each filter (e.g., 3 means 3$\times$3) |
stride |
How far the filter moves each step (default 1) |
padding |
Zeros added around the border (default 0) |
Output Size Formula
For a 2D convolution, the spatial output size is:
\[H_{out} = \lfloor \frac{H_{in} + 2 \times \text{padding} - \text{kernel_size}}{\text{stride}} \rfloor + 1\]The same formula applies to width. The number of output channels is simply out_channels.
So the full output shape is: (batch, out_channels, H_out, W_out).
Visual Example: CIFAR-10 (32\(\times\)32 RGB)
Let’s trace the dimensions through a concrete network to see how each parameter affects the output:
Input image: (batch, 3, 32, 32)
│
▼
┌──────────────────────────────┐
│ Conv2d(3, 16, kernel_size=5)│ in_channels=3 (RGB)
│ stride=1, padding=0) │ out_channels=16 (16 filters)
│ │ H_out = (32 + 0 - 5)/1 + 1 = 28
└──────────────────────────────┘
Output: (batch, 16, 28, 28)
│
▼
┌──────────────────────────────┐
│ MaxPool2d(2, 2) │ Halves spatial dimensions
└──────────────────────────────┘
Output: (batch, 16, 14, 14)
│
▼
┌──────────────────────────────┐
│ Conv2d(16, 32, kernel_size=3│ in_channels=16 (from prev layer)
│ padding=1) │ out_channels=32
│ │ H_out = (14 + 2 - 3)/1 + 1 = 14
└──────────────────────────────┘ padding=1 preserves spatial size!
Output: (batch, 32, 14, 14)
│
▼
┌──────────────────────────────┐
│ MaxPool2d(2, 2) │ Halves again
└──────────────────────────────┘
Output: (batch, 32, 7, 7)
│
▼
┌──────────────────────────────┐
│ Flatten │ 32 × 7 × 7 = 1568
└──────────────────────────────┘
Output: (batch, 1568)
│
▼
┌──────────────────────────────┐
│ Linear(1568, 10) │ 1568 must match flattened size!
└──────────────────────────────┘
Output: (batch, 10)
Effect of Each Parameter
kernel_size — How much spatial size shrinks
With no padding and stride=1, each conv layer shrinks the spatial size by kernel_size - 1:
Input: 32×32
kernel_size=3 kernel_size=5
───────────── ─────────────
Output: 30×30 (lost 2) 28×28 (lost 4)
padding — Preserving spatial size
Adding padding = kernel_size // 2 compensates for the shrinkage:
Input: 32×32, kernel_size=3
padding=0 padding=1
───────── ─────────
Output: 30×30 32×32 (preserved!)
Input: 32×32, kernel_size=5
padding=0 padding=2
───────── ─────────
Output: 28×28 32×32 (preserved!)
Rule of thumb: padding = kernel_size // 2 keeps the spatial size unchanged (for stride=1).
stride — Downsampling in the conv layer itself
Stride > 1 reduces spatial size more aggressively (sometimes used instead of pooling):
Input: 32×32, kernel_size=3, padding=1
stride=1 stride=2
──────── ────────
Output: 32×32 16×16 (halved!)
in_channels / out_channels — Depth of feature maps
These control how many features the layer reads and produces. They don’t affect spatial size:
Input: (batch, 3, 32, 32)
Conv2d(3, 16, 3, padding=1) → (batch, 16, 32, 32) 16 filters
Conv2d(3, 64, 3, padding=1) → (batch, 64, 32, 32) 64 filters
Conv2d(3, 128, 3, padding=1) → (batch, 128, 32, 32) 128 filters
^^^
only the channel dim changes
Key rule: The in_channels of each layer must match the out_channels of the previous layer.
The Critical Connection: Conv \(\rightarrow\) Linear
The most common bug is getting the Linear input size wrong. After your last conv/pool layer, you must flatten the tensor and compute the total number of features:
flattened_size = out_channels × H_out × W_out
For example, if the last pool outputs (batch, 32, 7, 7):
x = x.view(-1, 32 * 7 * 7)reshapes to(batch, 1568)nn.Linear(32 * 7 * 7, num_classes)takes this as input
Debugging tip: If you’re unsure about the size, add a print statement in forward():
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
print(x.shape) # ← check this, then remove
...
Quick Reference: MaxPool2d
MaxPool2d(kernel_size, stride) uses the same spatial formula. The most common usage is MaxPool2d(2, 2), which halves both height and width.
Task
Follow the PyTorch tutorial TRAINING A CLASSIFIER to write the image classifier. Feel free to copy the code into VS Code block by block when going through the tutorial. At the end, you should have a runnable code.
To improve the performance, try to tune the hyperparameters. The tunable hyperparameters include but are not limited to (in the order of importance):
- Network Architecture
- CNN output channels (more channels = more expressive power)
- Number of layers (deeper networks can learn more complex features)
- Kernel size (3\(\times\)3 is standard; 5\(\times\)5 captures more context per layer)
- Batch size
- Optimizer (try Adam instead of SGD)
- Learning rate
- Epoch number
- Activation functions
Hint: The primary reason for low accuracy in the baseline setup is the weak expressive capacity of the model (i.e., the model is too simple). Focus on increasing
out_channelsand adding more conv layers first.
Write Report
In addition to the program output, you need to include the following items in your report:
- The final result (in program output)
- What changes you made
- Lessons learned from tuning this model:
- What are important vs. unimportant in deep learning?
Deliverables and Rubrics
Overall, you need to complete the environment installation and be able to run the demo code. You need to submit:
- (50 pts) A PDF from running your code in Jupyter notebook with accuracy reported in the program output.
- (50 pts) The rest of 50 pts is decided by your performance:
Criterion
The goal is to increase the accuracy above the baseline 54%.
| Accuracy on test (%) | Grade |
|---|---|
| <= 54 | 10 |
| 54~59 | 20 |
| 59~64 | 30 |
| 64~69 | 40 |
| >69 | 50 |