Convolutional Neural Networks Explained…with American Ninja Warrior

Lauren Holzbauer
Insight
Published in
19 min readDec 30, 2019

--

Let’s use our ninja skills to figure out what CNNs are really doing.

Lauren Holzbauer was an Insight Fellow in Summer 2018.

By this time, many people know that the convolutional neural network (CNN) is a go-to tool for computer vision. But why exactly are CNNs so well-suited for computer vision tasks, such as facial recognition and object detection?

Computer vision is an interdisciplinary scientific field that deals with how computers can be made to gain high-level understanding from digital images or videos.” — Wikipedia

In our last blog, we walked through a super fast introduction of “vanilla” flavored neural networks. We talked about how a vanilla neural net is made up of one or more “hidden” layers. Each neuron in a layer is connected to each neuron that lives in the next layer. In turn, each neuron in that layer is connected to each neuron in the following layer, and so on. In this way, all of the layers in a vanilla neural net are referred to as “fully connected.” As we will see, the CNN incorporates both “fully connected” and “partially connected” layers. These partially connected layers are what make the CNN such an excellent tool for working with images.

Let’s dig in!

The CNN: A fundamental shift in how we approach computer vision

CNNs are an exciting flavor of neural net. They were originally developed by Yann LeCun, et al in 1989 to recognize handwritten digits. The network architecture LeCun’s team perfected in 1998 is called “LeNet-5” and was later widely deployed as a commercial document recognition system by major U.S. banks for automatically reading handwritten and machine-written checks.

At that time, traditional computer vision techniques incorporated what is called “hand-crafted feature extraction.” Researchers applied custom designed algorithms to images of interest in order to identify certain types of features in those images. For example, the Harris corner detector (used in robots!) returns the pixel locations of all of the corners it detects in the input image. In Figure 1, you can see the results of the Harris corner detector applied to an image of Jessie Graff competing in the 2017 American Ninja Warrior Finals:

Figure 1 — Harris corner detection applied to an image of Jessie Graff competing in the 2017 American Ninja Warrior Finals. Original image is on the left. Image with detected corners highlighted in red is on the right.

These feature-extraction algorithms rely on hand-designed heuristics. For example, an algorithm that detects edges will pay attention to areas in the image where the intensity changes very suddenly. Once these features have been extracted from the image, they can be fed into any machine learning algorithm (e.g. support vector machine) for object detection or another computer vision task.

This approach can work reasonably well, but the big problem is that a custom feature extraction algorithm needs to be designed for each new task at hand. For example, the features relevant to dog breed classification will probably be different from the features you will need to detect human faces. A major breakthrough with CNNs was that they allowed for automatic pattern recognition; they could be implemented without any time consuming pre-processing like feature extraction.

In this way, CNNs fundamentally altered the way we approach computer vision problems. When we were using feature-extraction methods to classify images of dogs and cats, for example, we would tell the computer: “I want you to detect dogs and cats and this is how you’re going to do it.” With CNNs, we instead say: “Computer, here are many examples of a dog and many examples of a cat. I want you to detect dogs and cats but I don’t care how you do it.” As we will see, this fundamental shift in methodology led to a new generation of computer vision models with unprecedented ability.

LeNet-5: A classic CNN architecture

Since LeNet-5, there have been many different CNN architectures developed over the years, such as AlexNet (2012), GoogLeNet (2014), VGGNet (2014), and ResNet (2015). When researchers use the term “architecture,” they just mean things like the total number of layers in the network, the number of each type of layer, how the layers are ordered, and other details. The architecture of a network can affect its accuracy, training time, run time, and memory requirements. Check out this article that goes into a little more detail about all of these different architectures.

Let’s take a look at what the LeNet-5 architecture looks like in Figure 2:

Figure 2 — The LeNet-5 architecture, from “Gradient-Based Learning Applied to Document Recognition” by Yann LeCun et al., Proc. of the IEEE, 1998.

Today, LeNet-5 is considered a “classic” CNN architecture. If you want to implement your own CNN, drawing inspiration from the LeNet-5 formula is a great place to start. As you can see in Figure 2, LeNet-5 takes a 32x32 pixel input image. It has seven layers: 3 convolutional layers, 2 subsampling (“pooling”) layers, and 2 fully connected layers. Next, we will explain how each layer works, why they are ordered this way, and how everything comes together to form such a powerful model. Let’s start with the convolutional layer.

The Convolutional Layer

First, a smidge of theoretical background

When you first saw the term “convolution,” you might have recognized it as the mathematical operation commonly used in signal processing. Suppose you have a system (e.g. a circuit) and you measure the output signal, f(t), given an exceedingly narrow input pulse (approximated as a Dirac delta function). This is great, however, you really want to be able to predict the output signal, y(t), given any input pulse, g(t), you choose. To compute y(t), just convolve the two signals, f*g. What you’re really doing here is sweeping g(t) across f(t) to generate y(t). This operation can be summarized in equation form:

Check out this YouTube video, Signals and Systems: Convolution Theory, by UConn for a more in-depth example. The main thing to remember is that one function is swept over another function to generate some kind of output. Take a look at this animation from Wikipedia to visualize what is happening:

Figure 3 — animation from Wikipedia showing the convolution (black curve) of a box-shaped pulse (red box) with an exponential signal (blue curve).

In neural networks, the mechanics of a convolutional layer is not exactly identical to the mathematical operation, but the general idea is the same: something called a “kernel” gets swept over an input array and generates an output array.

A warped wall detector: a qualitative look at kernels

If you visualize your input image as a 2-D array of pixels, imagine a much smaller 2-D array called a “kernel” sweeping across it, kind of like this:

Figure 4 — Visualization of a kernel (blue box) sweeping over all of the pixels in an image.

Imagine that you’re a competitive ninja and you’re trying to figure out which obstacles you should train on the most in order to boost your chances of success. You have lots of photos of different American Ninja Warrior course configurations that you have collected over the years, like this one:

Figure 5 — A typical American Ninja Warrior course.

You decide that the first thing you want to do is count up all of the course configurations that include a warped wall. You don’t want to sit around counting warped walls by hand, so you decide to train a CNN to perform this “object detection” task automatically for you.

Let’s think conceptually about how this CNN might work within the blueprint of what we just talked about (remember the kernels sweeping across pixel arrays?). Imagine a kernel that has somehow learned to detect warped walls. As it sweeps across the input image, it looks down at all the pixels within each one of its “receptive fields” and detects whether or not a warped wall is there or not (it only looks at one receptive field at a time). If a warped wall is present, the kernel sends a “yes” to the output array. If not, the kernel sends a “no.” Then it sweeps to its next receptive field and looks for more warped walls. It keeps doing this until it has swept over the entire input image.

In short, the convolution of the kernel and the input image generates an output array of yes’s and no’s. Take a look at Figure 6 to visualize what is going on:

Figure 6 — a “warped wall detector.” A kernel detects warped walls by sweeping across an input image. The convolution of the kernel and input image generates an output array of yes’s and no’s.

The output array then becomes the input array for the next layer of the network. Of course, this is a highly simplified conceptualization of what’s really happening. You may be asking: How does the kernel know what types of features it’s looking for? What exactly is the kernel? Where does the kernel come from?

The kernel and where it comes from

The kernel is just a 2-D array of WEIGHTS. This bears repeating:

A kernel is a 2-D array of weights.

Remember how linear regression trains its weights using gradient descent? A CNN does the same thing: it trains all of its weights during backpropagation. The weights associated with the convolutional layers in a CNN are what make up the kernels (remember that not every layer in a CNN is a convolutional layer). Until the weights are trained, none of the kernels know which “features” they should detect.

So if each kernel is just an array of weights, how do these weights operate on the input image during convolution? The network simply performs an element-wise multiplication between the kernel and the input pixels within its receptive field, then sums everything up and sends that value to the output array. In Figure 7, you can see how the first element in the output array is calculated:

Figure 7 — element-wise multiplication.

Once the first element of the output array has been filled in, the kernel sweeps over to its next stop. The next element of the output array is calculated, then the next, and so on until the kernel has swept over the entire input image and the entire output array has been filled. Take a look at Figure 8 to help visualize this process. The tiny numbers in the dark blue box sweeping over the input image correspond to the kernel weights. Notice how these weights never change as the kernel performs its full sweep:

Figure 8 — a kernel sweeps over an image, generating each element of the output array along the way.

In Figures 7 and 8, the kernel we used was [[2,1], [0,2]], but that was just an arbitrary example so that we could get a feel for how element-wise multiplication works. Remember that each element of a kernel is actually a WEIGHT that the network learns during backpropagation.

The set of weights assigned to a kernel actually have rich visual meaning and encode which “feature” that kernel will look for as it sweeps across an image.

A deeper look at how kernels encode visual meaning

Let’s go back to our earlier example of a dog and cat classifier. If we had to dictate to the computer which features it should use in order to discriminate between dogs and cats, we might focus on the ears (floppy vs. pointy), the nose (big vs. small), and the eyes (round pupils vs. vertical pupils).

CNNs train their weights automatically, so we have no control over which features the network chooses to use. However, we can come up with our own kernels to get a feel for how they can be used to detect different features. Take a look at four simple kernels in Figure 9:

Figure 9 — Kernels.

The kernels displayed in Figure 9 detect horizontal lines (top-left), vertical lines (top-right), 45 degree lines (bottom-left), and 135 degree lines (bottom-right). Each kernel is shown as an array of weights (left) and a pixel representation (right). Notice how the pixel representations all look kind of like filters. For example, the pixel representation of the “horizontal lines” kernel blocks everything behind it except for a horizontal strip running across the center. In fact, kernels can actually be represented as a small image the size of the receptive field!

Coming up with kernels that have interesting applications is pretty math-ish, so we won’t get into the details here (if you’d like to read more about kernels, read these computer science notes from Cornell). You can actually use kernels to apply many interesting effects to an image, such as sharpening, blurring, and embossing. This is actually how your favorite graphics editors work!

We can approximate a pretty good edge detection kernel by combining all of the kernels in Figure 9 (using an element-wise sum): [[-1,-1,-1], [-1,8,-1], [-1,-1,-1]]. We can easily convolve this kernel with an input image using the Python image processing library, OpenCV. Take a look at the code below and the result in Figure 10:

Figure 10 — Original image (left) convolved with an edge detecting kernel (right).

In Figure 10 we display the convolution of our image of Jessie Graff with the edge detection kernel we just came up with. Wow! The output array is actually ANOTHER IMAGE…

Feature hierarchies

If the input and output arrays for all of the convolutional layers are images, then we can visualize a CNN as a stack of images:

Figure 11 — Hierarchical layers.

Figure 11 displays an input image (bottom panel) with two convolutional layers stacked on top. Each pixel in the first convolutional layer (middle panel) can “see” only the pixels contained within its receptive field in the input image. Now let’s take a look at the second convolutional layer (top panel). Each pixel can see all of the pixels contained within its receptive field in the first layer. In turn, each pixel within that receptive field can see all of the pixels in the input image that are contained within its receptive field. This means that as you travel towards the top-most layers of the network, each pixel has more and more information about the input image encoded in it. In this way, the structure of CNNs can be thought of as “hierarchical.”

A hierarchical structure makes so much sense when you’re working with images! Kernels in the lower convolutional layers focus on detecting small-scale features while kernels in the upper convolutional layers focus on detecting large-scale features.

The CNN works by learning feature hierarchies with features from higher levels of the hierarchy formed by the composition of lower level features. For example, consider the feature hierarchy for a warped wall:

Figure 12 — feature hierarchy for a warped wall.

In Figure 12, we can see how large-scale features (middle tier) are amalgamations of many different small-scale features (bottom tier). Interestingly, research indicates that animal brains actually process images in a similar, hierarchical manner. The hierarchical structure of CNNs is why we can take a network that was originally trained for a specific task and reuse the bottom layers of that network for a completely different task with surprisingly excellent results. This is called using a “pre-trained model.”

Feature maps

Just one more thing before we move on. The image that results from the convolution of the input image and kernel is called a “feature map.” You can design a convolutional layer with as many feature maps as you want; each map will have its own kernel associated with it. One convolutional layer can have hundreds of feature maps!

Taking another look at the LeNet-5 architecture (Figure 13), you can see that the first convolutional layer has 6 feature maps and the second has 16. When you hear someone say that a CNN is “wide,” they mean that each convolutional layer has many feature maps.

Figure 13— Revisiting the LeNet-5 architecture, from “Gradient-Based Learning Applied to Document Recognition” by Yann LeCun et al., Proc. of the IEEE, 1998.

Phew!

The convolutional layer is some serious business. That was a long discussion, but hopefully you are now much more comfortable with the inner workings of the convolutional layer and why it’s so amazing for images. Let’s quickly wrap up our discussion of the LeNet-5 architecture by touching on the pooling layer and the fully connected layer.

The Pooling Layer

Now that we’ve talked in depth about the convolutional layer, the pooling layer is simple! The “pooling” layer, sometimes called a “subsampling” layer, is similar to a convolutional layer in that it sweeps a “pooling kernel” across the entire input image. This pooling kernel does not have any weights associated with it; it simply applies an aggregation function (e.g. mean, max) to all of the pixel values contained within its receptive field and sends that value to the output array.

The most common type of pooling layer is probably the “max pooling layer.” At each “stop” along the way as the pooling kernel sweeps across the input image, it surveys all of the pixels contained within its receptive field. It selects the pixel with the maximum value, then sends that value to the output array. All of the other pixel values are eliminated, ninja-warrior style.

You might be thinking, as I was when I first learned about pooling layers: “This is crazy! We’re losing so much information this way!” Indeed we are. Not only are we discarding many, many pixel values but we’re also losing some positional information. When the pooling kernel sends each max pixel value to the output layer, the network no longer “knows” exactly where that pixel is positioned in the original image, just that it’s located somewhere within its corresponding receptive field. That being said, pooling layers totally work, and they work well.

Here are the major benefits to incorporating a pooling layer into your network:

  • Reduces computational load (which reduces both train and run times)
  • Reduces memory usage
  • Reduces the total number of weights that must be trained (which is especially helpful for images since they have so many pixels!)
  • Encourages “location invariance” (helps the network be more tolerant of small shifts in images)
  • Limits the risk of overfitting

The Fully Connected Layer

Convolutional and pooling layers are referred to as “partially connected” layers. This is because, unlike each layer in a vanilla neural net, not every single pixel in the input image is directly connected to the output layer by a weight.

In the first convolutional layer of LeNet-5, the 5x5 kernel has 25 weights associated with it. As this kernel sweeps across the entire 32x32 input image, the pixel values interact only with these 25 weights as information flows into the next layer. Since there are 6 feature maps, we have a total of 25x6 = 150 weights associated with the first layer.

In a “fully connected” layer, each pixel in the input image is directly connected to each neuron in the next layer by a weight. The input image has 32x32 = 1,024 pixels. In the first convolutional layer of LeNet-5, the original image has been shrunk down to 28x28 = 784 pixels. If the input image was fully connected to the output image, we would need 784 neurons for each pixel in the input image. This would give us 1,024 pixels x 784 neurons = 802,816 total weights. Compare that to 150 if we use a convolutional layer instead!

If you’re using a fully connected layer to detect features in an image, you need to train a huge number weights. Let’s go back to our earlier example of the warped wall detector in Figure 6:

A “warped wall detector.”

The input image has one warped wall in the upper left corner and one in the lower right corner. With a fully connected layer, we would essentially be training one set of weights to detect the warped wall in the upper left and another set of weights to detect the warped wall in the lower right. Convolutional layers are efficient because you only train one set of weights (a kernel) to detect a warped wall in any position. This kernel is then swept across the entire input image, giving it the superpower of being able to detect a warped wall ANYWHERE.

This is why we reserve full connection for the final top-most layers in the network. Up to this point in the network architecture, all of the convolutional and pooling layers have essentially been performing an automatic feature extraction (compared to the manual feature selection we talked about earlier) by building many feature hierarchies. Once you get to the top-most convolutional layer, the features that are being detected are very high-level. Once these high-level features are passed on to the fully connected layers, you have crossed over into the… classification zone.

This “classification zone” is usually made up of just a few fully connected layers. This is where the network learns how to sort all of those high-level features into separate classes (e.g. “Dog vs Cat”). The final fully connected layer is your output. The number of neurons in this layer should be the same as the number of classes you have. For example, the LeNet-5 architecture was trained with the MNIST dataset to detect handwritten digits, so its final output layer should have 10 neurons. Finally, the output values for each neuron flow through a softmax activation layer, which converts these raw values into estimated class probabilities.

Wrapping up CNNs with a thought experiment

When I was first learning about CNNs, I found that the exercise of actually sitting down and calculating the total number of weights in a given architecture was extremely helpful. Let’s go ahead and do that for LeNet-5.

In Layer 1, a convolutional layer generates 6 feature maps by sweeping 6 different 5x5 kernels over the input image. Each kernel has 5x5 = 25 weights associated with it plus a bias term (just like linear regression). That means that each feature map has a total of 26 weights associated with it. With 6 maps, this gives us a total of 26x6 = 156 total weights for Layer 1.

The second layer is a pooling layer and has no weights associated with it, so we move on to Layer 3: the second convolutional layer. Calculating the weights for this layer is a little trickier than doing so for Layer 1. Layer 3 generates 16 feature maps by sweeping 16 different 5x5 kernels over the input. But what exactly is the input for this layer?

The input for Layer 1 was just the original image, so that was easy. Remember how Layer 1 generated 6 feature maps? Those 6 maps were then sent through Layer 2, a pooling layer. All the pooling layer did was decrease the size of each map from 28x28 to 14x14, so we still have 6 maps. Those 6 14x14 maps are the input for Layer 3. This means that in order to generate the 16 output maps in Layer 3, each one of its 16 kernels will sweep over all of the 6 input maps. Each kernel has 5x5 = 25 weights, and each one sweeps over 6 input maps which gives us 25x6 = 150 plus the bias term for a total of 151 weights. To get the total number of weights associated with Layer 3, multiply 151 weights per output map by 16 output maps to get 2,416 weights.

We can continue adding up weights like this for each layer in the network. Hopefully adding up all of the parameters in this way was helpful in understanding how all of these layers work together to (1) achieve automatic feature extraction and (2) learn how to use the highest-order features to make classification decisions.

For the sake of full disclosure, LeNet-5 has a few minor characteristics that are not typical for modern CNN architectures. For example, in Figure 12 you can see that LeNet-5 uses a Radial Basis Function (RBF) for the final output layer instead of the more common strategy: the dot product of the inputs and weight vector with a Softmax activation function. Since CNNs are no longer built this way, we won’t go into any further detail here (but you can read about them here: "Key Deep Learning Architectures: LeNet-5" and of course in the original paper). Due to these minor atypicalities, our above calculation of the total number of weights is slightly off. However, the point of the exercise was really just to better understand how a CNN architecture operates.

Wrapping up our wrap-up of CNNs, here are two quick rules of thumb to keep in mind when building your own CNN:

  • Smaller kernels with more feature maps are generally considered better than using larger kernels.
  • Deeper (more layers) is generally considered better than wider (more feature maps).

Summary

The CNN has become the go-to, state-of-the-art tool for computer vision tasks. CNNs differ from vanilla neural nets in that they incorporate partially connected layers (convolutional and pooling layers). A CNN can be thought of as two parts: (1) automatic feature extraction and (2) classification. Before CNNs were developed, researchers had to extract features from images by hand.

The convolutional and pooling layers perform automatic feature extraction while minimizing the number of weights the network has to train. The bottommost layers extract low-level features while the upper layers extract higher-level features; this generates what we call “feature hierarchies.” Features from higher levels of the hierarchy are compositions of lower level features. The highest order features are sent to the top-most, fully connected layers of the network for classification.

CNNs are especially suited for working with images because:

  • They preserve spatial information by accepting a 2-D array as input. In order to send an image through a vanilla NN, you would have to flatten it into a 1-D array.
  • They keep the number of weights from exploding via partially connected layers (convolutional and pooling layers).
  • They can detect the same feature anywhere in the image. A vanilla NN can only detect features that are in identical positions within the image.
  • They are structured hierarchically, which is similar to how animal brains process images. This structure allows the same network (i.e. a “pre-trained” model) to be recycled and reused for many different tasks.

In early image classification models, we had to extract features from images by hand. We effectively told the algorithm which features to use in order to classify images. These features were inspired by how we humans perceive the world. For example, the eyes, nose, forehead, and chin of a human face are important markers we use when we recognize friends and family. But the way humans and computers see the world is very different and these models didn’t work that well. Now, we allow the algorithm to choose which features it will use to discriminate between different classes of objects. This is the ninja superpower of convolutional neural networks!

I hope you found this overview of CNNs to be a valuable introduction to the underlying theory and mechanics of this powerful tool. Check out my next post, where we hit the gym with two real-world implementations of a CNN.

Are you interested in working on high-impact projects and transitioning to a career in data? Sign up to learn more about the Insight Fellows programs and start your application today.

--

--