1.Introduction
The datasets used to train large Neural-Networks are either crowd-sourced [1] or downloaded via web-crawlers [2] but both are prone to error either due to human error in annotation or sometimes wrong image gets downloaded (for example if you search for jaguar animal’s images online, jaguar car’s images get downloaded). If the dataset is big, it becomes time-consuming and expensive to collect such high-quality dataset and then clean it. Large Neural Networks have the tendency to memorize training data of clean labels first and then those of noisy labels, so if the training dataset contains noisy labels the network will memorize wrong mapping thus deteriorating the performance of model and giving poor result on test dataset [3].
2.Previous Work
1. Label Transition Matrix Estimation
F-correction: estimate noise transition matrix (T) using label noise, T is a CxC matrix where C is number of classes; Tij is the probability of transition of true label ‘i’ flipping into ‘j’ label. But this technique failed for high noise (asymmetric noise 60%) and also it was hard to estimate T for large number of classes [4].
2. Sample Selection and Small-Loss Instance
I) Co-teaching: it involves 2 networks which are trained simultaneously, first network selects samples with small-loss instances and feed them into the second network for training and vice-versa. In this way each network filters different error. But as number of epochs increases, 2 networks consensus gradually converges and co-teaching method reduces to self-training MentorNet [5].
II) Co-teaching+: same as co-teaching but with disagreement-strategy. For each minibatch pass data through both networks and select instances which gets different predictions from both network; then use these selected samples to do co-teaching. This disagreement strategy keeps the network diverse thus preventing from converging [6].
3. Correcting Noisy labels to True labels
I) Joint-Optimization: used backbone network with high learning rate, they made use of concept that high learning rate suppresses the memorization ability of a deep neural network and thus prevents it from completely fitting to labels. Alternatively updates model’s weights and labels of dataset [7].
II) PENCIL: same as joint-optimization but used different loss function, used labels probability distribution to represent each image’s label, backpropagate to update this distribution then update model’s weight iteratively [8].
4. Graph based techniques:
I) Graph Convolutional Label Noise Cleaner: corrects low-confidence anomaly scores via high confidence ones. Used two GCN:
a) feature-similarity: used concept that anomalous snippets have some similar character
b) temporal-consistency: anomalous snippets appear in temporal proximity of each other [9].
II) GCN for learning with few clean and many noisy labels: used GCN to predict a class relevance score per image based on connections to clean images in the graph, used relevance score to reweight samples. Used GCN as binary classifier discriminating clean from noisy data and used ‘probabilities’ as relevance [10].
III) Face-Graph: Large scale label noise cleaning for face recognition, used graph-based node prediction technique to predict whether a node is signal (true identity) or noise (false identity) [11].
3.Methodology
Above discussed non-graph-based approaches removed too many instances and were unreliable. Whereas graph-based approaches were either limited to specific domain or required a clean subset of data. This proposed approach uses GNN without the need of clean subset and nor depends on specific domain, here they reweight each input instead of each input weighted equally thus minimizing a weighted loss function.
p: number of samples in batch f_i (Ɵ): i^{th} instance loss w_i: weight of sample. Figure 1: Concept of iterate optimization
This technique captures structural relationship among labels at two levels, instance and distribution levels.
It refines instance-level relations with the help of instance similarity distribution obtained from distribution-level relations. It consists of 2 phases executed alternatively(see figure 1):
I) reasoning-phase: obtain instance similarity distribution feature and use it to refine instance graph.
II) classification-phase: reconstructed instance graph used to make reliable predictions.
Graphs used (executed ‘K’ iterations):
1) Instance Graph: initially at k=0 instance graph nodes initialized by the embeddings got by passing the image through a CNN then passing through a fully connected NN(f_{emb} ) which gives ‘m’ dimension feature vector (see figure 2)
Now calculate instance similarity edge weights (E_{ij}^{I(k)} ),
Dissimilarity vector calculated between node i and j represents how far or close adjacent nodes are in Euclidean space, then passed through a NN(f_E^{ I(k) } ) giving a probability of similarity which is multiplied with previous iteration’s edge weight (E_{ij}^{I(k-1)}).
2) Distribution Graph: Nodes are initialized by aggregation of edges of instance graph; each node is a ‘p’ dimension vector (V_i), each j^{th} value of i^{th}vector is the relation between i^{th} and j^{th} samples.
At k=0:
k>0:
∑_{j=1}^pE_{ij}^{I(k)} is the aggregation of instance graph edge vectors connecting node i and j.
∑_{j=1}^pV_{j}^{D(k-1)} is the aggregation of previous iteration distribution graph node vectors.
Then concatenate these 2 vectors and pass through a NN(f_V^{ (D)}) to get a ‘p’ dimensions node vector (V_i^{D(k) }).
Now calculate dissimilarity vectors (dis^{D(k)}) between distribution features of distribution graph nodes.
Then pass them through a NN (f_{E^{ (D)}}) to get probability of similarity from the perspective of global distribution, which in turn multiplied with previous iteration edge weight (E_{ij}^{D(k-1)} ) .
Refining Instance Graph
Using the distribution graph to refine the instance graph nodes.
∑_{j=1}^p( E_{ij}^{D(k)} ) ||V_j^{I(k-1)}: concatenation of distribution graph edge weights (E_{ij}^{D(k)} ) and instance graph previous iteration node vectors (V_j^{I(k-1)} ) and then summed over batch
Now pass this concatenated vector sum and previous iteration instance graph’s ith node vector (V_i^{I(k-1)}) through a NN(f_{V^I}) to get ‘m’ dimensions refined node vector.
Loss Function
Neighboring samples one hot encoding multiplied with edge weights passed into softmax to get class prediction of each node.
\hat{y} _j:j^{th} example label
Softmax:
Instance-loss:
y_i: ground truth label
Distribution-loss:
Total-loss (summed over K iterations):
\lambda_I and \lambda_D : hyperparameters.
4. Experiments and Results
Method was evaluated on CIFAR-10, CIFAR-100 [12] and Clothing-1M [13] datasets
Clothing-1M extracted from online uploaded tagged pictures of clothes which probably contains error [13], but CIFAR-10 and CIFAR-100 were corrupted using:
I) Symmetric-flipping: randomly replace labels of some percent of samples with other possible labels [14].
II) Asymmetric-flipping: mistakes only within very similar classes e.g., Truck->Automobile or Bird->Airplane [15].
Clothing-1M dataset experimented in 3 ways:
I) Noisy dataset without extra clean data
II) Verification labels not to train model directly but to initialize GNN so that graph has stable initial topology.
III) Both noisy dataset and 50K clean labels to first train model on noisy dataset and then fine-tuned using clean dataset.
For CIFAR-10 and CIFAR-100 they used Resnet-12 model and for Clothing-1M Resnet-34 pretrained on ImageNet, Adam-optimizer, learning-rate=1e-3, epochs=1e5, \lambda_I=1.0, \lambda_D=0.1.
Results
*At symmetry 20% which is natural error rate, all models performed well, Dual-Graph performed significantly better
*At symmetry 50% and asymmetry 40% Co-teaching begins to fail but Dual-Graph performs significantly better
*At symmetry 80% is the hardest, still Dual-Graph achieved best result (see table 1).
*Dual-Graph beats other models in asymmetric-flipping also by significant margins (see table 2).
*CIFAR 100 overall accuracy is less when compared to performance on CIFAR 10 since 100 classes but still Dual-Graph performed the best in all case (see table 3).
So overall memorization effect[16] of network improves, in other models test accuracy first reaches a very high level and then gradually decrease due to corrupted labels but Dual-Graph stops this decreasing effect and consistently achieves higher test accuracy.
Also, as number of iterations(K) increases, model becomes more robust to corrupted labels but only up to certain extend after which it won’t add any new information due to over-smoothing effect (see figure 5). And if \lambda_D=0 in loss function, the models performance deteriorates showing the importance of distribution graph in correcting label noise.
5.Conclusion
This method exploits Graph-Neural-Network to capture structural relationship among labels at 2 different levels and used the iterative optimization technique to refine the instance graph node embeddings via distribution feature embeddings captured by distribution graph. Since the distribution-level relation is robust to label noise, this network propagates distribution-level relation as supervised signals to refine instance-level similarity. Combining these two-level relations gives an end-to-end training paradigm to counteract noisy labels while generating reliable predictions.
6.Student Review
Strengths
Great method to reduce label noise because generalization strategy like L2 regularization, early-stopping and dropout are helpful but don’t guarantee optimization as they prevent the network from reducing the training loss. This method doesn’t prevent model from reducing loss but instead reweight the instance loss contribution. Also, not domain specific as it can be trained on any type of images, so it can correct any type of label noise on images. This method also doesn’t require any clean subset for pretraining which would be a hectic task and error prone.
Weakness
This paper didn’t provide any code to verify the approach. Authors didn’t try any other model beside Resnet and they only corrected image labels noise, they could have tested their method on images related to a specific domain like medical images. Also, they didn’t try this method on different modality like text data, this method could use text embeddings got by some transformers networks to initialize the nodes of instance graph and correct the labels.
7. References
[1]. Yan Yan, Romer Rosales, Glenn Fung, Ramanathan Subramanian, and Jennifer Dy. Learning from multiple annotators with varying expertise. Machine Learning, 95(3):291–327, 2014.
[2] Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. A multi-view embedding space for modelling internet images, tags, and their semantics. International journal of computer vision, 106(2):210–233, 2014.
[3] Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J. Gordon. An empirical study of example forgetting during deep neural network learning. In International Conference on Learning Representations (ICLR), 2019.
[4] Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1944–1952, 2017.
[5] Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Advances in neural information processing systems (NIPS), pages 8527– 8537, 2018.
[6] Xingrui Yu, Bo Han, Jiangchao Yao, Gang Niu, Ivor W Tsang, and Masashi Sugiyama. How does disagreement help generalization against label corruption? In International Conference on Machine Learning (ICML), 2019.
[7] Daiki Tanaka, Daiki Ikami, Toshihiko Yamasaki, and Kiyoharu Aizawa. Joint optimization framework for learning with noisy labels. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5552–5560, 2018.
[8] Kun Yi and Jianxin Wu. Probabilistic end-to-end noise correction for learning with noisy labels. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7017–7025, 2019.
[9] Jia-Xing Zhong, Nannan Li, Weijie Kong, Shan Liu, Thomas H Li, and Ge Li. Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1237–1246, 2019.
[10] Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, Ondrej Chum, and Cordelia Schmid. Graph convolutional networks for learning with few clean and many noisy labels. In European Conference on Computer Vision (ECCV), 2020.
[11] Yaobin Zhang, Weihong Deng, Mei Wang, Jiani Hu, Xian Li, Dongyue Zhao, and Dongchao Wen. Globallocal gcn: Large-scale label noise cleansing for face recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
[12] Alex Krizhevsky. Learning multiple layers of features from tiny images. University of Toronto, 05 2012.
[13] Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. Learning from massive noisy labeled data for image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2691–2699, 2015.
[14] Brendan van Rooyen, Aditya Krishna Menon, and Robert C. Williamson. Learning with symmetric label noise: The importance of being unhinged. In International Conference on Neural Information Processing Systems (NIPS), page 1018, 2015.
[15] Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1944–1952, 2017.
[16] Devansh Arpit, Stanislaw Jastrzkebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, and Simon Lacoste-Julien. A closer look at memorization in deep networks. In International Conference on Machine Learning (ICML), volume 70, pages 233–242, 2017.