This is a blog post about the paper 'Curriculum Loss: Robust Learning and Generalization against Label Corruption' by Yueming Lyu and Ivor W. Tsang published at ICLR 2020 [1].
Introduction
In this blog post the problem of noisy data in training sets for neural networks is addressed. Noisy data means that the dataset contains wrongly labeled data. This can happen for example because of website crawlers which search the internet for usable data and automatically labels them sometimes not right. Another reason is that data can have similar traits like looking a specific way that makes it hard for humans to label them correctly.
Figure 1: Which ones are dogs and which ones are towels? [2]
This example shows that there will be probably some towels labeled as dogs if someone is creating a towel vs. dog dataset because the images are really similar.
This raises the question if training with such noisy data will create problems with the results and if so if we have to stop using web crawlers or if there is a solution to this problem.
To address the first question: yes, it is not a good idea to train with noisy data when using ordinary training strategies. Usually someone would train a network with some training distribution and would later apply the finished model to some test distribution that is similar to the training distribution.
If the distributions were different it would be like learning for a biology test but at test time you would be told to talk about the biological processes on the moon. It is kind of the same but also it's not.
There is also the problem that a deep neural network is really good with memorizing data but that's the last thing that someone would want for a training with wrong labels. To put it short, it would result in bad generalization skills of your trained network.
The problem is that we don't want to stop using web crawlers just to avoid this issue, how else would we get big datasets? And if you have a different method of creating one there is still the problem of data with similar traits.
So let's start the search for a solution by looking at how people handled noisy data in the past. The used techniques can be sorted into three areas:
Transition based methods try to change the architecture of the network to counteract the issue [3], regularization based methods are using model regularization or adversarial training [4] and then there are sample selection based methods(SSBM) [5] (The authors of the paper saw SSBM as the most promising approach). It tries to select the correctly labeled samples for training and is based on the idea of curriculum learning where a model is trained with samples that are ordered in a meaningful way. To use the example of learning for a biology test again, it is like starting with studying a simple leaf before moving on the the whole tree and then the whole forrest. The topic slowly starts to get more and more difficult.
All of these methods are working but they are only heuristic based.
That's why the authors concluded that they need some kind of a loss function that is robust to outliers, to these wrong labels. That means that a model using this loss function should classify the samples with the right label correctly and misclassify the noisy samples because the correct answer is not known. Moreover the loss function should be easy and fast to use because like every loss function it will be used for mini batch updates.
The simplest loss function is possibly the 0-1 loss: J(y,\hat{y}) = \begin{cases} 1 & \text{if $y \neq \hat{y} $} \\ 0 & \text{if $y = \hat{y} $} \end{cases} with target label y and output label \hat{y}
Figure 2: Comparison of different loss functions with the 0-1 loss in black [6]
It treats each sample the same and doesn't give some of them more influence that others. As a result, it should handle a small number of outliers or wrongly labeled samples. This is especially true if you compare the 0-1 loss to the hinge loss (green doted line in the graph). The more different the real and outputed label are getting the more the loss function punishes the model. That's definitely not the right behavior because it's not the models fault that there is no logic behind the noisy labels. This all points to just using the 0-1 loss to solve our problem with noisy data but there are also downsides to using it. The loss function is difficult to optimize because it has zero gradient everywhere except the breaking point and it can only handle a small amount of wrong labels but that's not enough most of the time.
As a result, the authors concluded that they want a upper bound of the 0-1 loss because when you minimize an upper bound you are minimizing the base function as well. This upper bound should be easier to optimize and should handle a good amount of noise. And because they were not the only ones thinking of creating a 0-1 loss upper bound, they want to be tighter than every other upper bound.
Proposed Solution
The authors therefore proposed two losses: curriculum loss and noise pruned curriculum loss to tackle the problem of label corruption.
Curriculum Loss (CL)
They are claiming that their novel 'curriculum loss' is a tighter upper bound of 0-1 loss, that it can be used easy and fast in mini batch updates and that it somehow automatically selects samples for training (that's why they called it curriculum loss).
To understand the CL function we need to start again with the 0-1 loss. This is the 0-1 loss from the beginning with target label y and output label \hat{y}:
J(y,\hat{y}) = \begin{cases} 1 & \text{if $y \neq \hat{y} $} \\ 0 & \text{if $y = \hat{y} $} \end{cases} |
But in the paper they used a function that looks different (classification margin u = y\hat{y} and indicator function 1(u < 0) = \begin{cases} 1 & \text{if $u < 0$} \\ 0 & \text{otherwise} \end{cases}):
J(u) = 1(u < 0) |
To show you that they are the same take a look at this truth table:
y | \hat{y} | u | J(y,\hat{y}) | J(u) |
---|---|---|---|---|
1 | -1 | -1 | 1 | 1 |
1 | 1 | 1 | 0 | 0 |
-1 | -1 | 1 | 0 | 0 |
-1 | 1 | -1 | 1 | 1 |
As you can see they produce the same results.
Now that we have the 0-1 loss function used in the paper let's take a look at an upper bound of it. The easiest way to get an upper bound function of something is just to use a function that is always bigger. That's the idea behind this function (l(u)\geq 1(u<0)):
\widehat{J}(u) = \sum_{i=1}^{n} l(u_i) |
Because we have all the parts now we can continue to understanding the curriculum loss function:
Q(u) = \displaystyle \min_{v \in \{0,1\}^n } \max (\sum_{i=1}^{n} v_i l(u_i), n- \sum_{i=1}^{n} v_i+ \sum_{i=1}^{n} 1(u_i < 0)) |
It contains our 0-1 loss and the simple upper bound from before. Then only the maximum is used and because it is a loss function it gets minimized. This minimization where they select the samples that minimized the formula is exactly the reason that the function selects samples and is called curriculum loss. Furthermore it can be proven that the curriculum is an upper bound of the 0-1 loss but tighter than the simple upper bound function, therefore it is a tighter upper bound.
The last claim that we need to investigate is that CL is easy and fast. To do that let's take a look at how to use it:
Figure 3: The algorithm of curriculum loss
As you can see first the classification margin and the loss values of all samples are getting calculated and sorted in an ascending order. That means that the algorithm starts with looking at the sample with the smallest loss value first. Then the idea is that all the loss values will be summed up in L and L gets compared to a threshold to decide if a sample will get selected. To be more exact the algorithm will iterate through all samples, accumulate their loss values and then compare the sum with a threshold. If it is smaller or equal to the threshold the sample will get selected and if the sum is bigger the current sample won't get picked. Then it continues with the next sample.
To show you how this procedure looks like in reality look at this graph:
Figure 4: Shows the selection process with loss l1 \leq l2 \leq l3 \leq l4 and sample v_i [7]
On the x-axis you can see the different iteration steps and on the y-axis the (loss) values. The red line shows the threshold and the black and red crossed dot the samples that correspond to the loss values shown on the left. The first sample will get picked from the algorithm because the loss value is below the red threshold. The same reason holds for sample two and three. But for the forth sample the value is not below the threshold anymore (see red arrow) and that's why this sample won't get picked. This continues for all of the samples.
It can be proven that the run time of this algorithm is in O(nlogn) and as you saw it's not using any complex operation. It was just summing and comparing values and that's why you can say that CL is easy and fast.
Noise Pruned Curriculum Loss (NPCL)
The authors also proposed another loss function that is a generalized version of CL called 'noise pruned curriculum loss'. This loss function should be able to handle a larger amount of noise because it prunes the noisy samples during training.
To begin with let's take a look again at the CL function:
Q(u) = \displaystyle \min_{v \in \{0,1\}^n } \max (\sum_{i=1}^{n} v_i l(u_i), n- \sum_{i=1}^{n} v_i+ \sum_{i=1}^{n} 1(u_i < 0)) |
and that's the function of the noise pruned curriculum loss:
L(u) = \displaystyle \min_{v \in \{0,1\}^{(1- \epsilon)n} } \max (\sum_{i=1}^{(1- \epsilon)n} v_i l(u_i) , (1- \epsilon)n- \sum_{i=1}^{(1- \epsilon)n} v_i) |
They definitely look really similar but NPCL doesn't use the 0-1 loss as a part of the maximum and it only sums up until (1-\epsilon)n and not until n (the number of samples). The \epsilon resembles the amount of wrong labels in the dataset so called noise rate and (1-\epsilon)n is the number of correctly labeled samples respectively. Summing up only to (1-\epsilon)n then means that the loss function only looks at the clean samples and cuts or pruned the rest, the noisy ones. This behavior explains how NPCL is able to handle a larger amount of noise.
Empirical Study
After looking at the theory behind CL we will now move on to the experiments.
First of all, we need to choose a loss function because CL integrated \widehat{J}(u) = \sum_{i=1}^{n} l(u_i) that uses l(u) and its only prerequisite was to be larger that the indication function 1(u \leq 0).
The authors of the paper chose the hinge loss but it is possible to choose every other existing loss function as well.
They compared CL and NPCL as a baseline with the cross entropy loss, a generalized version of it called generalized cross entropy and three methods which are using one ore two trained models to select the clean samples. They are called Co-Teaching (two model select samples to train each other), Co-Teaching+ and MentorNet (a model learns from a mentor net that selects training samples) all of which can be sorted into the sample selection based methods from the beginning.
As for the architecture they used a network architecture proposed in the Co-Teaching paper, DenseNet, ResNet and VGG and trained them on MNIST, CIFAR10, CIFAR100 and the ImageNet.
To introduce noise into the datasets different methods were used: symmetry flipping where the wrong labels will be drawn uniformly from K-1 classes, pair flipping where the corrupted label is similar to the right one and semantic noise where a weak classifier assigns the labels which results in noise samples.
The authors measured the performances based on test accuracy (percentage of misclassification) and label precision. For label precision (LP) the number of clean samples were divided through the number of selected samples to measure selection accuracy. A model with high LP would update with less noise data which would result in a more robust performance.
Lastly an interesting note: they used a burn-in-period. This means that they did a full batch training before using their methods and that the network had time to look at all the samples beforehand.
The Results
Figure 5: Comparison of NPCL with the different baseline methods with 35% pair flipping as label corruption on MNIST
As you can see here NPCL in red performed the best and the standard and generalized cross entropy loss the worst in both performance measurements (test accuracy and label precision). What is interesting about this diagram is that it shows that the sample selection based methods Co-Teaching(+) and MentorNet got good results but not as good as NPCL despite having special models trained for selecting clean samples and not 'just' a different loss function like NPCL.
Figure 6: Comparing the effect of different networks on the different methods
This time the impact of different architectures were compared and CL was also used in this comparison. For CIFAR10 CL got better results and for CIFAR100 NPCL performed the best. The interesting fact here is that the results of the different methods don't differ much from each other compared to the big difference between them on the first diagram.
Figure 7: Comparison of the different methods on Tiny-ImageNet dataset with 20% and 50% symmetric noise
For this experiment they used the Tiny ImageNet dataset, two different noisy rate with a higher percentage on the right and plotted the variance as well. As you can see NPCL performed again the best and only slightly worse on the higher noise rate. Because the variance is plotted you can see how the results differ between the five runs especially for CL. If you compare the results on both diagrams you can see that the variance for CL increased with the higher noise rate but for NPCL there is only a slight difference. This can be seen as a hint that NPCL can perform better for larger noise rates.
Conclusion
To sum up this blog post we saw that curriculum loss is a tighter upper bound of the 0-1 loss and that both curriculum loss and noise pruned curriculum loss can be used for training with a dataset containing wrong labels called robust learning. Later when we looked at the results of some experiments they showed that CL and NPCL performed better than compared method on noisy datasets so they are more robust than these baselines. This was particularly interesting because the methods that performed the best out of the baseline methods were the ones using special models trained to find noisy samples. But they were still not as good as the network using CL or NPCL which are only loss functions.
Maybe you can take home from this that bigger or more models are not always the best strategy to solve problems.
Curriculum Loss in medical application
Because this blog post was written for the Seminar 'Deep Learning for Medical Applications' we should also take a look at how all of this could be applied to the medical field. Firstly, there is something called interobserver variability which means that different people that label data will not always agree which label is the correct one. This is especially a problem if the data has something to do with rare illnesses which are difficult to recognize. Furthermore, like in every field not a lot of data is available so the researchers have to make the most out of it and should be able to use noisy data as well. Something that is unique in medicine compared to other areas is that one decision can possibly decide over life or death like if cancer was correctly classified or overlooked. And medicine is definitely the last field were you want your model to memorize wrong labels. This shows that label corruption is also a problem in medicine and CL and NPCL could be helpful there as well.
Own View of the paper
Lastly a bit about what I was thinking while reading the paper 'Curriculum Loss: Robust Learning and Generalization against Label Corruption' and writing this blog post. I definitely think that this whole topic of label corruption is important to investigate and work on. Datasets are such a big part of the whole AI world and how to get better ones or use them in the best way possible deserves a lot more attention. This paper is definitely a good step in the right direction.
It was well structured and the topics were nicely explained. That's why I would recommend to you to read it yourself, the authors looked at a lot more than I could possibly put in a short blog post.
But despite all of this they were two small points I wish they did better in the Empirical Study section. While talking a lot about both curriculum loss and noise pruned curriculum loss most of the diagrams only compared NPCL with the baselines and I think it would also be interesting to investigate the difference between CL and NPCL in experiments a bit more. Secondly all of the diagrams except one displayed only the average over five runs and not the variance as well to show how different the results were.
But all in all it was definitely an interesting paper.
References
[1] Yueming Lyu and Ivor W. Tsang. "Curriculum Loss: Robust Learning and Generalization against Label Corruption". In: ICLR 2020.
[2] https://imgur.com/a/K4RWn 20.06.2020
[3] Jacob Goldberger and Ehud Ben-Reuven. “Training deep neural-networks using a noise adaptation layer”. In: ICLR 2017.
[4] Takeru Miyato, Andrew M Dai, and Ian Goodfellow. “Adversarial training methods for semi-supervised text classification”. In: arXiv preprint arXiv:1605.07725 (2016).
[5] Lu Jiang et al. “Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels”. In: arXiv preprint arXiv:1712.05055 (2017).
[6] Peter Wittek. "Quantum Machine Learning: What Quantum Computing Means to Data Mining". 2014. isbn: 9780128009536.
[7] Yueming Lyu. "Curriculum Loss: Robust Learning and Generalization against Label Corruption". Presentation at ICLR 2020.