Tianyu Ma, Alan Q. Wang, Adrian V. Dalca, Mert R. Sabuncu
Blog post written by: Markus Karmann

Introduction


Convolutional neural networks (CNN) are a very popular neural network architecture in computer vision. They are quite successful, reaching state-of-the-art in various tasks such as image classification [1]. While they perform astonishingly well in certain tasks, outperforming classical approaches, they still come with their own set of problems. They can be very sensitive to noise in the input and if the task is not easy to learn, might have problems generalizing. This means: A neural network might perform well on the training set, but does not generalize well enough and perform poorly on the test set. The authors of this paper present a novel approach to tackle this problem by introducing a hyper network in their CNNs. Instead of optimizing the weights of the convolutional layers during training, they train another neural network to predict those weights. In this blog post, we will take a look at this new hyper convolution and how it works in detail while checking some of its features by running our own little experiment on the MNIST dataset [2]. The code of these experiments can be found here: https://github.com/mkarmann/hyper-conv

Normal Convolution


First, let us take a short look at the normal 2D convolutions that we are all familiar with. Excluding the bias, Figure 1 shows the weights of a convolutional layer with a kernel size of 5x5 and 8 in- and 8 output channels after training the entire CNN on the MNIST classification task.

Figure 1: Showing the weights of the a single convolutional layer of a CNN trained on MNIST classification excluding the bias. Each 5x5 heatmap corresponds to the weights of a 5x5 kernel with w=5 and h=5. This convolutional layer has 8 input and 8 output channels and therefore 8·8=64 kernels.


Looking at this, the weights don’t seem to follow any pattern or shape. They almost look like noise. A reason for this is that there is no restriction on weights that are close to each other. Each weight is trained independently. We will see later how the Hyper Convolution will change this by introducing spatial awareness to the kernels. But before we do that, we need to take a look at a common problem with convolutional layers. If we ignore the bias again, we can calculate the number of trainable parameters required for a convolutional layer with the following product:

Here we have the heigh h and width w of the kernel as well as the number of input Nin and output channels Nout. The problem with this is that the number of parameters increases quickly when increasing the width and height of the kernel. For example, a 3x3 kernel with 10 input- and 10 output channels results in 3·3·10·10=900 parameters while a 6x6 kernel results already in 6·6·10·10=3600. Doubling the kernel size results in 4x the number of parameters! This is a problem, as a greater number of parameters makes the neural network more powerful and it might just remember all of the training data and does not generalize enough, performing badly on the test set.

As mentioned before, I trained a convolutional neural network on the MNIST dataset for this blog post. The network’s task is to classify black-and-white images of handwritten digits from 0 to 9. This task is way simpler than the medical imaging experiments from the original paper and we have a sufficient amount of train and test data. Therefore generalization is not an issue and we achieve an accuracy of 99% in our test set. But the authors of the paper also showed another issue that we can take a look at. What we can check is how robust the network is against noise in the images. CNNs are usually very sensitive against noise, even allowing single pixel changes, so-called adversarial attacks, to change the entire outcome [3]. By artificially adding noise to the test set, we can check the network’s ability to handle noisy images. Our normal CNNs performance on this can be seen in Figure 2.

Figure 2: Showing the effects of noise on the CNNs classification accuracy. A salt & pepper noise of 0 means no pixel was altered and 1 means the entire image is random, making the image unpredictable.


It clearly shows that adding more noise results in less accurate predictions. Starting with the Salt and Pepper noise, the accuracy drops below 50% when about 30% of the pixels are altered. Getting further down to essentially randomly guessing when about 60% of the pixels are noise (10% accuracy is the average if you randomly guess from a pool of 10 cases). For the Gaussian Noise, we also observe a similar decline in accuracy. Now that we have looked at the normal CNN, it will be interesting to see how the Hyper-Convolution will compare to this and what improvements we can observe!

Hyper Convolution


Introducing: Hyper Convolution. Instead of training the weights of a convolutional layer directly, we use another neural network to predict those weights. This concept is illustrated in Figure 3. The Hyper-Network is a simple fully connected neural network. It takes a position as input and then returns for each kernel its weight at that position. So for a kernel with height 5 and width 5, this network will run 5·5=25 times to predict all the weights of the entire convolutional layer. With this trick, we now can increase the size of the kernel as we want without influencing the number of learnable parameters. If we have a bigger kernel size we just have to run the Hyper-Network on a wider range of positions and predict more weights. This way we have the number of parameters independent of the kernel size!

Figure 3: Illustration of the Hyper-Network concept. The Hyper-Network takes the position of a kernel cell as input and predicts the weight for each kernel.


But let us look a bit more in detailed at the actual number of parameters required. As said before, the convolution itself does not have any trainable parameters anymore, all the parameters we care about now are the weights of the Hyper-Network. Excluding the bias again, these can be calculated with the following formula:

As expected, we see that there is no longer an h or w for the width or height of the kernel in the calculation. But now we have some new variables of the Hyper-Network. Ni which is the number of neurons at layer i. Then there is L, which is the number of hidden layers. Our main interest here is on the left side of the summation (blue text color), because in neural networks we often increase the in and output channels Nin and Nout the deeper we get. Assuming that we don’t want to change the number of channels, we can reduce the number of parameters by using fewer neurons in the last layer NL.This last layer can be used to restrict the capability of the Hyper-Network.

So you might be wondering, what does it mean to restrict the Hyper-Networks capability and can we actually see this restriction in the predicted weights themselves? The answer to that question is yes, indeed! Figure 4 shows these predicted weights after training the Hyper-Network.

Figure 4: Showing the predicted weights of a single hyper-convolutional layer of a CNN trained on the MNIST classification task. Excluding the bias, each heatmap shows the predicted weights of a 5x5 kernel. 


The weights in each kernel seem way more smooth. This happens because we give positions as input to the Hyper-Network. Weights in the kernels that are close to each other now also have more similar values. So we introduced some sort of spatial awareness to the kernel weights. By restricting the number of neurons NL in the final hidden layer we can force the Hyper-Network to predict simpler weights patterns. Essentially making the Hyper-Network less powerful. This enables us to control of how complex the kernel’s weight patterns can get.

Now let’s take a look, at how well the hyper convolution performs on the MNIST task. First, the test accuracy is similar to the one of the normal CNN. Having an accuracy of 98% instead of 99%. This is still good though, because it shows that with fewer parameters we still reach similar performance. But more importantly, if we take a look at Figure 5, we can see that the hyper convolution performs clearly better in the noisy cases.

Figure 5: Comparing the effects of noise on a normal CNN vs. a CNN using hyper convolutions. Both network types got the same noise seed in order to keep the comparison as fair as possible.


The Hyper-CNN gets less distracted by noisy images. While the normal CNN’s accuracy was dropping to 50% after 30% of the pixels are altered, the hyper CNN can still predict with the same accuracy when 40% of the pixels are altered. A similar result can also be observed for the Gaussian Noise. So while the hyper network might lack 1% behind in accuracy in images without noise, as soon as there is some noise present, it clearly outperforms the normal convolutional neural network. This is also what the authors of the paper noticed. They then concluded that this is due to the smoother kernels. They also provided a resource for research which was analyzing the effect of smooth kernels on noisy images, showing that smooth kernels in the first layers of a CNN make the neural network more robust against noise and such adversarial attacks [4].

Conclusion


Hyper Convolutions are a great way to introduce spatial information to convolutional neural network kernels and reduce the number of learnable parameters. Especially for small datasets they seem to improve generalization quite well. For easier tasks like the MNIST classification, where generalization is not such an issue anymore, they still show more robustness against noise.

For more details on how the MNIST experiments were done and what differences there are compared to the original paper’s implementation, please refer to the README.md of the repository: https://github.com/mkarmann/hyper-conv

References


[1] Chen, X., Liang, C., Huang, D., Real, E., Wang, K., Liu, Y., ... & Le, Q. V. (2023). Symbolic discovery of optimization algorithms. arXiv preprint arXiv:2302.06675.

[2] LeCun, Y., Cortes, C. & Burges, C. (n.d.). THE MNIST DATABASE of handwritten digits. Retrieved June 15, 2023, http://yann.lecun.com/exdb/mnist/

[3] Balda, E. R., Behboodi, A., & Mathar, R. (2020). Adversarial examples in deep neural networks: An overview. Deep Learning: Algorithms and Applications, 31-65.

[4] Wang, H., Wu, X., Huang, Z., & Xing, E. P. (2020). High-frequency component helps explain the generalization of convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8684-8694)


 

  • Keine Stichwörter