Demo Code

Download the demo code Here for NN regression as the starter code.

Background: CNN Dimension Calculations

When building a CNN, the trickiest part is getting the dimensions right — especially at the boundary between convolutional layers and fully connected layers. This section explains how to compute output dimensions step by step.

Conv2d Parameters

nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0) has five key parameters:

Parameter What it controls
in_channels Number of input feature maps (e.g., 3 for RGB images)
out_channels Number of filters = number of output feature maps
kernel_size Size of each filter (e.g., 3 means 3$\times$3)
stride How far the filter moves each step (default 1)
padding Zeros added around the border (default 0)

Output Size Formula

For a 2D convolution, the spatial output size is:

\[H_{out} = \lfloor \frac{H_{in} + 2 \times \text{padding} - \text{kernel_size}}{\text{stride}} \rfloor + 1\]

The same formula applies to width. The number of output channels is simply out_channels.

So the full output shape is: (batch, out_channels, H_out, W_out).

Visual Example: CIFAR-10 (32\(\times\)32 RGB)

Let’s trace the dimensions through a concrete network to see how each parameter affects the output:

Input image: (batch, 3, 32, 32)
                    │
                    ▼
     ┌──────────────────────────────┐
     │  Conv2d(3, 16, kernel_size=5)│   in_channels=3 (RGB)
     │  stride=1, padding=0)        │   out_channels=16 (16 filters)
     │                              │   H_out = (32 + 0 - 5)/1 + 1 = 28
     └──────────────────────────────┘
            Output: (batch, 16, 28, 28)
                    │
                    ▼
     ┌──────────────────────────────┐
     │  MaxPool2d(2, 2)             │   Halves spatial dimensions
     └──────────────────────────────┘
            Output: (batch, 16, 14, 14)
                    │
                    ▼
     ┌──────────────────────────────┐
     │  Conv2d(16, 32, kernel_size=3│   in_channels=16 (from prev layer)
     │         padding=1)           │   out_channels=32
     │                              │   H_out = (14 + 2 - 3)/1 + 1 = 14
     └──────────────────────────────┘   padding=1 preserves spatial size!
            Output: (batch, 32, 14, 14)
                    │
                    ▼
     ┌──────────────────────────────┐
     │  MaxPool2d(2, 2)             │   Halves again
     └──────────────────────────────┘
            Output: (batch, 32, 7, 7)
                    │
                    ▼
     ┌──────────────────────────────┐
     │  Flatten                     │   32 × 7 × 7 = 1568
     └──────────────────────────────┘
            Output: (batch, 1568)
                    │
                    ▼
     ┌──────────────────────────────┐
     │  Linear(1568, 10)            │   1568 must match flattened size!
     └──────────────────────────────┘
            Output: (batch, 10)

Effect of Each Parameter

kernel_size — How much spatial size shrinks

With no padding and stride=1, each conv layer shrinks the spatial size by kernel_size - 1:

Input: 32×32
                kernel_size=3               kernel_size=5
                ─────────────               ─────────────
Output:         30×30 (lost 2)              28×28 (lost 4)

padding — Preserving spatial size

Adding padding = kernel_size // 2 compensates for the shrinkage:

Input: 32×32, kernel_size=3
                padding=0                   padding=1
                ─────────                   ─────────
Output:         30×30                       32×32 (preserved!)

Input: 32×32, kernel_size=5
                padding=0                   padding=2
                ─────────                   ─────────
Output:         28×28                       32×32 (preserved!)

Rule of thumb: padding = kernel_size // 2 keeps the spatial size unchanged (for stride=1).

stride — Downsampling in the conv layer itself

Stride > 1 reduces spatial size more aggressively (sometimes used instead of pooling):

Input: 32×32, kernel_size=3, padding=1

                stride=1                    stride=2
                ────────                    ────────
Output:         32×32                       16×16 (halved!)

in_channels / out_channels — Depth of feature maps

These control how many features the layer reads and produces. They don’t affect spatial size:

Input: (batch, 3, 32, 32)

Conv2d(3, 16, 3, padding=1)  → (batch, 16, 32, 32)   16 filters
Conv2d(3, 64, 3, padding=1)  → (batch, 64, 32, 32)   64 filters
Conv2d(3, 128, 3, padding=1) → (batch, 128, 32, 32)  128 filters
                                       ^^^
                          only the channel dim changes

Key rule: The in_channels of each layer must match the out_channels of the previous layer.

The Critical Connection: Conv \(\rightarrow\) Linear

The most common bug is getting the Linear input size wrong. After your last conv/pool layer, you must flatten the tensor and compute the total number of features:

flattened_size = out_channels × H_out × W_out

For example, if the last pool outputs (batch, 32, 7, 7):

  • x = x.view(-1, 32 * 7 * 7) reshapes to (batch, 1568)
  • nn.Linear(32 * 7 * 7, num_classes) takes this as input

Debugging tip: If you’re unsure about the size, add a print statement in forward():

def forward(self, x):
    x = self.pool(F.relu(self.conv1(x)))
    x = self.pool(F.relu(self.conv2(x)))
    print(x.shape)  # ← check this, then remove
    ...

Quick Reference: MaxPool2d

MaxPool2d(kernel_size, stride) uses the same spatial formula. The most common usage is MaxPool2d(2, 2), which halves both height and width.

Task

Follow the PyTorch tutorial TRAINING A CLASSIFIER to write the image classifier. Feel free to copy the code into VS Code block by block when going through the tutorial. At the end, you should have a runnable code.

To improve the performance, try to tune the hyperparameters. The tunable hyperparameters include but are not limited to (in the order of importance):

  • Network Architecture
    • CNN output channels (more channels = more expressive power)
    • Number of layers (deeper networks can learn more complex features)
    • Kernel size (3\(\times\)3 is standard; 5\(\times\)5 captures more context per layer)
  • Batch size
  • Optimizer (try Adam instead of SGD)
  • Learning rate
  • Epoch number
  • Activation functions

Hint: The primary reason for low accuracy in the baseline setup is the weak expressive capacity of the model (i.e., the model is too simple). Focus on increasing out_channels and adding more conv layers first.

Write Report

In addition to the program output, you need to include the following items in your report:

  • The final result (in program output)
  • What changes you made
  • Lessons learned from tuning this model:
    • What are important vs. unimportant in deep learning?

Deliverables and Rubrics

Overall, you need to complete the environment installation and be able to run the demo code. You need to submit:

  • (50 pts) A PDF from running your code in Jupyter notebook with accuracy reported in the program output.
  • (50 pts) The rest of 50 pts is decided by your performance:

Criterion

The goal is to increase the accuracy above the baseline 54%.

Accuracy on test (%) Grade
<= 54 10
54~59 20
59~64 30
64~69 40
>69 50