I Introduction
Deep Neural Networks (DNNs) achieve a remarkable success on many computer vision tasks due to the largescale datasets with reliable and clean annotations
[chen2018encoder] [girshick2015fast] [he2016deep]. However, collecting such datasets with precise annotations is expensive and timeconsuming. There are two alternative solutions to alleviate this issue: crowdsourcing from nonexperts and online queries by search engines. Unfortunately, the obtained annotations inevitably contain noisy labels. As DNNs have the capability to memorize all training samples, they will eventually overfit the noisy labels, leading to poor generalization performance [tanaka2018joint] [zhang2016understanding].In order to reduce the effect of noisy labels, an efficient strategy is to select clean samples out of the noisy ones and then only use these clean ones to update the parameters of the network. Following this research line, many studies have shown impressive progresses, where Decoupling [malach2017decoupling], Coteaching [han2018co] and Coteaching+ [yu2019does] are three representative methods. All these methods train two networks simultaneously, but the difference is that Decoupling and Coteaching+ select training samples depending on the disagreement between two different networks, while in Coteaching, each network views its smallloss instances as clean ones, and teaches them to its peer network. Recently, the stateoftheart method named JoCoR [wei2020combating] achieves the excellent performance. JoCoR applies an agreement maximization principle that trains two networks with a joint loss (including the conventional supervised loss and the coregularization loss) and use the joint loss to select smallloss instances.
In this paper, different from the above methods which require to train two networks simultaneously, we propose a simple and effective method to distinguish clean samples only using one single network. We find that the prediction consistency under different image transforms (such as scaling, rotation, flipping) in one network is beneficial to select clean samples. As shown in Figure 1, we feed the original and transformed (horizontally flip) images into one single network, and observe the KullbackLeibler (KL) Divergence of clean and noisy samples on CIFAR10 dataset with 50% symmetric noise and 40% asymmetric noise. From Figure 1, we can see that the KL Divergence of clean samples are much smaller than noisy samples under two different noise levels. It suggests that the network can reach an agreement on the original and transformed images for clean samples during training. While for noisy samples, the KL Divergence remains a larger value. This observation motivates us to distinguish clean and noisy samples by considering the KL Divergence between the original images and the transformed images.
Based on the above observation, we propose a novel approach to identify mislabeled samples. Specifically, we follow the agreement maximization principle to train a network with a joint loss, including the classification losses and the KL loss between two inputs (the original images and transformed images). Moreover, we further adapt a selfensembling method to construct a teacher model, the predictions of the teacher model are used as the online soft labels. Then we combine offline hard labels with online soft labels to provide robust supervisory information for training and further alleviate the noise effects.
Intuitively, our proposed method can be regarded as adding some perturbations on the training samples. Since we observe that the perturbation can have stronger effect on the noisy samples than that on clean samples, making clean and noisy samples being distinguished more easily. Furthermore, our method only requires to train one single network, which is more suitable for real applications.
In summary, our contribution in this paper are threefold:

We propose a simple and effective approach to learn from noisy labels, which introduces the prediction consistency of the images under different spatial transforms (such as scaling, rotation, flipping) in one single network to select clean samples efficiently.

We design a classification loss by using the offline hard labels and online soft labels to provide reliable supervisions for network training.

We conduct comprehensive experiments on CIFAR10, CIFAR100 and Clothing1M datasets, and achieve the stateoftheart performance. In most cases, such as symmetry50%, symmetry80% and asymmetry40% on CIFAR10, and symmetry20%, symmetry50% and symmetry80% on CIFAR100, the performance of our method exceeds the second best method by up to 3.70%12.78%.
Ii Related Works
Iia Learning with Noisy Labels
The techniques to alleviate the effect of noisy labels can be divided into two categories: (1) detecting noisy labels and then cleansing potential noisy labels or reduce their impacts. (2) directly training noiserobust models with noisy labels.
One solution of detecting and cleansing noisy labels is to select clean samples out of the noisy ones, and then use them to update the network. For example, Decoupling [malach2017decoupling], Coteaching [han2018co], Coteaching+ [yu2019does] and the recent stateoftheart JoCoR [wei2020combating] train two networks by introducing “agreement” or “disagreement” strategies to select clean samples for training. O2UNet [huang2019o2u]
adjusts the hyperparameters of the network and ranks the normalized average loss of every sample to detect the noisy samples. MentorNet
[jiang2018mentornet] learns a datadriven curriculum to supervise the training of a student network. The second attempt is based on reweighting or relabeling methods. Han et al. [han2019deep] propose a deep selflearning framework to relabel the noisy samples by extracting multiple prototypes for one category. Ren et al. [ren2018learning]weights each sample in the loss function based on the gradient directions compared to those on a clean set. Arazo et al.
[arazo2019unsupervised] model the persample loss distribution with a beta mixture model and correct the loss by relying on the network prediction. In a similar spirit, Li et al. [li2020dividemix]use two networks to perform sample selection via a gaussian mixture model and further apply the semisupervised learning technique, MixMatch
[berthelot2019mixmatch], to handle noisy labels.For the second type, Xiao et al. [xiao2015learning]
model the relationships between images, class labels and label noises with a probabilistic graphical model and further integrate it into an endtoend deep learning model. Guo et al.
[guo2018curriculumnet] propose a CurriculumNet, where training data are divided into several subsets by ranking their complexity via distribution density, and then the subsets are formed as a curriculum to teach the model in understanding noisy labels gradually. Li et al. [li2019learning] propose a metalearning based method to avoid the model overfitting to the specific noise by generating synthetic noisy labels.Motivated by the above studies, we propose a simple and effective method to select clean samples out of the noisy ones by observing the category distribution and visual attention map between the original images and the transformed images in one single network. This achieves an aggregated inference which combines the predictions from different spatial transforms to improve the classification accuracy.
IiB Image Transformation
Human visual perception shows good consistency under certain spatial transforms, such as scaling, rotation and flipping, which has motivated the data augmentation strategy widely used in DNNs. [gidaris2018unsupervised] learns image representations by training convolutional networks to recognize the geometric transformation that is applied to the input image. [guo2019visual] proposes a twobranch network with original images and its transformed images as inputs and introduces a new attention consistency loss that measures the attention heatmap consistency between two branches and achieves a new stateoftheart for multilabel image classification. [lee20_sla]
augments original labels via selfsupervision of input transformation to learn a single unified task with respect to the joint distribution of the original and selfsupervised labels.
In this paper, we propose the first usage of the transform consistency for noisylabeled image classification. Specifically, we regard the transformation operation performed on the images as the added disturbance. For clean samples, the network’s predictions are not affected by the added disturbance, resulting in a consistent prediction between two inputs, while for noisy samples, the network’s predictions may be inconsistent or oscillate strongly, which allow us to distinguish the clean samples out of the noisy ones.
IiC Selfensemble Learning
Selfensemble has been widely studied in semisupervised learning. These methods form a consensus prediction of the unknown labels using the outputs of the network in training on different epochs. For example, temporal ensembling
[laine2016temporal] maintains an exponential moving average (EMA) of label predictions on each training example, and penalizes predictions that are inconsistent with this target. Mean teacher [tarvainen2017mean] averages model weights over training steps to form a targetgenerating teacher model. Recently, the selfensembling strategy is used for noisylabeled image classification. MLNT [li2019learning] designs a noisetolerant algorithm which constructs a teacher model to give more reliable predictions unaffected by the synthetic noise. SELF [nguyen2019self] attempts to identify clean labels progressively using selfforming ensembles of models and predictions.To effectively mitigate the negative influence of noisy labels, in this paper, we introduce the selfensembling method to construct a teacher model by using the exponential moving average (EMA) of the model snapshots in each training iterations. Then the predictions of the teacher model are used as the online soft labels, which provide a more stable supervisory signal than noisy hard labels to guide the model’s training. Note that this teacher model does not require additional training.
In this section, we give the details of our proposed method. Our approach aims at identifying potentially noisy labels during training and keeping the network from receiving supervision of the clean labels. The overall framework is illustrated in Figure 2. Specifically, an original image and its transformed image are both fed into one single network, and then we apply a joint loss, i.e., two classification losses and one KL loss, to train the network. Based on our observed principle that the model can keep consistent predictions on two inputs of clean samples and inconsistent predictions on noisy labels, we select the samples with small joint loss to update the model’s parameters. In order to further alleviate the negative influence of label noise, apart from the offline hard labels, our framework further incorporates online soft pseudo labels by using the selfensembling strategy into the training process.
Problem statement. Image classification can be formulated with a training set , where is an image and is the onehot label over classes which may contain noise. Let
denote the model’s output softmax probability parameterized by
.Classification loss. In our framework, the two inputs of the network are and , respectively, where and represents the transformation operation such as scaling, rotation and flipping. Their label confidences can be predicted as and . In general, the objective function is the empirical risk of crossentropy loss, which is formulated by:
(1) 
(2) 
where is the total number of samples. However, since contains noise, the model would overfit the noisy labels and result a poor classification performance.
To address this problem, we construct a teacher model parameterized by following the selfensembling method to generate reliable soft pseudo labels for supervising the model’s training. Specifically, at current iteration , we update with
(3) 
where is a smoothing coefficient hyperparameter within the range [0,1), and indicates the parameter of the teacher model in the previous iteration . Then the soft classification loss for optimizing with the soft pseudo labels generated from the teacher model can be formulated as:
(4) 
(5) 
Therefore, the classification loss of and are given by:
(6) 
(7) 
KL loss.
Based on the observation that the model can make consistent predictions on the original and transformed images of clean samples, and inconsistent predictions on most noisy samples, we apply the KullbackLeibler (KL) Divergence as KL loss to measure the probability distributions of the network for two inputs. In practice, the KL loss can be expressed as
(8) 
where
(9) 
(10) 
Here is the number of categories. and represent the predicted probability distributions, i.e., and , repectively.
Overall loss. We integrate the abovementioned losses. The total loss can be formulated as:
(11) 
where is the parameter weighting the classification loss and KL loss.
Smallloss selection. We use the total loss to select the smallloss instances. Specifically, the network feeds forward a minibatch of data from first, then we select a small proportion of instances in which have small total loss:
(12) 
where is a parameter that controls how many smallloss instances should be selected. For , we follow the same update strategy as [han2018co] [yu2019does] [wei2020combating]. Specifically, since the DNNs start to learn from clean samples in initial phases and gradually adapt to noisy ones during training, we apply a large to use more smallloss instances to train the model in the beginning. With the increase of epoch, as the DNNs will overfit the noisy data gradually, we should update the with a smaller value to keep less smallloss instances selected in each mini batch. The update strategy for is given by:
(13) 
where is the current iteration, represents how many epochs for linear drop rate, and is the actual noise rate in the whole dataset which is closely related to the noise ratio in the noisy classes. If is not known in advanced, can be inferred using validation set [liu2015classification] [yu2018efficient].
Finally, we calculate the average loss of the selected instances and backpropagate them to update the network’s parameters:
(14) 
Algorithm 1 delineates the overall training process.
Iii Experiments
Iiia Datasets and implementation details
Datasets. We extensively evaluate our approach on CIFAR10, CIFAR100 [krizhevsky2009learning] and Clothing1M [xiao2015learning] datasets. Both CIFAR10 and CIFAR100 contain 50K training images and 10K test images of size 32 32, which involve 10 classes and 100 classes, respectively. Clothing1M contains 1 million images, which consists of 50k training images, 14k validation images and 10k test images. Human annotators are asked to mark a set of 25k labels as a clean set, but we do not use these in our experiments. Note that the overall label accuracy of Clothing1M is 61.54%.
Implementation. For all experiments in our paper, we follow the same setting as JoCoR [wei2020combating]. Specifically, for CIFAR10 and CIFAR100, we use a 7layer CNN network architecture. For Clothing1M, we use a 18layer ResNet.
We use the Adam optimizer with a momentum of 0.9 for all experiments. For CIFAR10 and CIFAR100, the network is trained for 200 epochs. We set the initial learning rate as 0.001 and the batch size as 128. For Clothing1M, the network is trained for 15 epochs. We set the constant learning rate as and the batch size as 64. In all experiments, we take the horizontal flipping as the transformation operation, and the ensemble momentum in Eq. 3 is set to 0.999. The hyperparameters of total loss weight and linear drop rate for all experiments are given in Table I.
Dataset  FlippingRate  
CIFAR10  Symmetry20%  0.8  10 
Symmetry50%  0.5  20  
Symmetry80%  0.3  40  
Asymmetry40%  0.7  50  
CIFAR100  Symmetry20%  0.8  40 
Symmetry50%  0.8  30  
Symmetry80%  0.2  30  
Asymmetry40%  0.9  30  
Clothing1M  0.6  5 
Following previous works [han2018co] [yu2019does] [wei2020combating], we verify our method with two types of label noise: symmetric and asymmetric
. The symmetric label noise is generated by using a random onehot vector to replace the groundtruth label of a sample. The asymmetric label noise is designed to mimic the structure of realworld label noise, such as CAT
DOG, BIRDAIRPLANE. In each experiment, we also follow the same setting to assume the noise rate is known, therefore the in Eq. (13) can be obtained. Specifically, for symmetric label noise, equals to (i.e., ). For asymmetric label noise, due to the fact that the actual noise rate in the whole dataset is half of the noisy rate in the noisy classes, we set .Measurements. We use the test accuracy and label prediction to measure the performance, i.e., test accuracy = (# of correct predictions) / (# of test dataset) and label precision = (# of clean labels) / (# of all selected labels). Intuitively, the algorithm with higher label precision is more robust to the label noise. All experiments are repeated five times and the error bar for STD in each figure has been highlighted as a shade.
Baseline. We use the conventional training with the crossentropy loss on noisy datasets (abbreviated as Standard) as our baseline and compare our method with following stateoftheart approaches:

Decoupling [malach2017decoupling], which trains two networks simultaneously, and then updates models only using the instances that have different predictions from these two networks.

Coteaching [han2018co], which trains two networks simultaneously, and each network uses its smallloss instances to update its peer network’s parameters.

Coteaching+ [yu2019does], which trains two networks simultaneously using the disagreementupdate step (data update) and crossupdate step (parameters update).

JoCoR [wei2020combating], which trains two networks with a pseudosiamese paradigm and update their parameters simultaneously by a joint loss.
FlippingRate  Standard  Decoupling  Coteaching  Coteaching+  JoCoR  Ours 
Symmetry20%  
Symmetry50%  
Symmetry80%  
Asymmetry40% 
IiiB Comparison with stateoftheart methods
IiiB1 Results on CIFAR10 dataset
The test accuracy on CIFAR10 dataset is shown in Table II. From the comparison results, we can clearly see that our method performs the best in all four cases. Specifically, in the easiest Symmetry20% case, all methods work well, but our method still achieves an improvement more than 1.85% over the best baseline method JoCoR (89.85% vs. 88.00%). When the noise rate raises to 50%, the test accuracy of Decoupling decreases lower than 80%, but the other three methods remain above 80%, where Coteaching performs better than JoCoR and Coteaching+. Our method exceeds Coteaching by 3.70% (86.85% vs. 83.15%). In the Symmetry80% case, which means that the network trains with extremely noisy labels, methods based on “disagreement” strategy, i.e., Decoupling and Coteaching+, cannot work well in this case. Coteaching and JoCoR only reach to 63.01% and 60.59% respectively. Surprisingly, our method largely exceeds these two methods by 10.83% and 13.25% respectively. In the hardest Asymmetry40%, Decoupling and Coteaching+ also fail, where Decoupling works much worse than the Standard method. In contrast, Coteaching, JoCoR and our method perform better. Among them, our method is still the best and exceeds the second best method Coteaching by 6.64% (85.87% vs. 79.23%).
In the top of Figure 3, we show test accuracy vs. number of epochs, which can help us to clearly see the memorization effects of DNNs. For example, the Standard method first learns from clean samples in initial stages and then gradually adapts to noisy ones, therefore the curve of test accuracy first reaches a high level and then decreases gradually. While for Decoupling, the curve of test accuracy first rises from 0 to 100 epochs, then gradually decreases from 100 epochs in two cases (i.e., Symmetry50% and Symmetry80%). In other two cases (i.e., Symmetry20% and Asymmetry40%), the test accuracy gradually increases and finally stabilizes. The test accuracy curves of the other four methods follow the similar trends, i.e., they first increase and then level off, but the Coteaching+ drops slightly in Asymmetry40%.
In order to further explain such good performance, we plot the label precision vs. number of epochs on the bottom of Figure 3. Only Decoupling, Coteaching, Coteaching+, JoCoR and our method are considered here because these algorithms contain the operation of clean instances selection. We can see that the label precision curves of Decoupling and Coteaching+ fail to select the clean instances in all four cases. The label precision curve of Coteaching+ first increases at a high level and then gradually decreases, while for Decoupling, the label precision is always at a lower level. These curves show that the “disagreement” strategy cannot deal with noisy labels at all because this strategy does not utilize the memorization effects of DNNs during training. In contrast, the label precision curves of the other three methods increase during the training, and then remain a high precision, which means that they can pick up clean instances successfully. Among them, our method can achieve a higher label precision only after training a few epochs, and has always provided more accurate supervisions for the subsequent training process, resulting in a best classification performance.
IiiB2 Results on CIFAR100 dataset
Table III shows the test accuracy on CIFAR100 dataset. Different from CIFAR10 dataset, CIFAR100 dataset is more difficult for noisylabeled image classification because it contains 100 classes. Similarly, the test accuracy of our method on CIFAR100 dataset achieves the excellent performance again. Specifically, in the easiest Symmetry20% case, all methods work well, but our method exceeds the second best method JoCoR 4.85% (62.00% vs. 57.15%). When the noise rate raises to 50%, our method is still the best and outperforms the second method JoCoR 6.59% (55.54% vs. 48.95%). In the extreme Symmetry80% case, Decoupling, Coteaching+ and JoCoR fail to achieve good performance, i.e., they are all below 20%. Coteaching also only achieves 22.08%, but our method achieves 34.86%, which is much higher than Coteaching by 12.78%. In the hardest case (i.e., Asymmetry40%), Decoupling, Coteaching and JoCoR are below 35%. Coteaching+ achieves the best performance (45.19%), but our method is comparable with Coteaching+ (43.56%). Meanwhile, our method achieves the better performance on label precision than Coteaching+ in this case.
FlippingRate  Standard  Decoupling  Coteaching  Coteaching+  JoCoR  Ours 
Symmetry20%  
Symmetry50%  
Symmetry80%  
Asymmetry40% 
Figure 4 shows the test accuracy and label precision vs. epochs. We can observe the memorization effects of DNNs using different methods again. Similar with the results on CIFAR10 dataset, the Standard method first increases a high level and then decreases gradually. Decoupling and Coteaching+ always worse than the other three approaches in Symmetry20%, 50% and 80%, and the label precision curves of them also explain the results that they have no ability to deal with the noisy labels. Instead, Coteaching, JoCoR and our method mitigate the memorization issue. They can select clean instances out of the noisy ones successfully, resulting in a good performance. Only in Asymmetry40%, our method is slightly lower than Coteaching+, but our method achieves the better performance on label precision than Coteaching+. Besides, by observing the convergence process of the model trained with five methods, we can see that our method can stabilize and achieve the better classification performance only after about 100 epochs.
IiiB3 Results on Clothing1M dataset
We demonstrate the efficiency of our proposed method on realworld noisy labels using Clothing1M dataset. The results are shown in Table IV. We report the best test accuracy across all epochs and the last test accuracy at the end of training. Our method gains the best performance compared to the stateoftheart methods in best and last test accuracy and achieves a significant improvement in last accuracy of 3.14% over Standard method, and an improvement of 0.24% over the second best method JoCoR.
Methods  Best  Last 
Standard  67.44  66.99 
Decoupling  68.48  67.32 
Coteaching  69.21  68.51 
Coteaching+  59.32  58.79 
JoCoR  70.67  69.89 
Ours  70.77  70.13 
IiiC Ablation study
In this section, we evaluate two components, i.e., transform consistency and soft classification loss, in our proposed method by conducting ablation studies on CIFAR10 and CIFAR100 datasets. Results are shown in Table V. We create a baseline model that utilizes only offline hard labels for training (the first row).
Dataset  CIFAR10  CIFAR100  
Noise type  Symmetry  Asymmetry  Symmetry  Asymmetry  
Methods/Noise ratio  20%  50%  80%  40%  20%  50%  80%  40%  
& w/o 


















w/o 









Ours 








Dataset  CIFAR10  CIFAR100  
Noise type  Symmetry  Asymmetry  Symmetry  Asymmetry  
Transform/Noise ratio  20%  50%  80%  40%  20%  50%  80%  40%  
Scaling 









Rotation 









Rotation 









Rotation 









Rotation 









Vertical flipping 









Horizontal flipping 








Effectiveness of transform consistency. To observe the effectiveness of the KL loss, we set in Eq.11. The results are shown in the second row of Table V (denoted as ). The test accuracy of four cases drops by 0.39% to 5.62% on CIFAR10 dataset and 6.42% to 8.87% on CIFAR100 dataset. These experiments fully demonstrate that the clean and noisy samples can be distinguished more easily by adding KL loss and further verify the effectiveness of transform consistency.
Effectiveness of soft classification loss. In order to investigate the effectiveness of the soft classification loss, we perform the experiment by only utilizing the offline hard labels as the supervisory information (i.e., in Eq. 6 and in Eq. 7). The performance is presented in the third row of Table V (denoted as w/o ). The test accuracy of four cases drops by 0.62% to 1.50% on CIFAR10 dataset and 0.38% to 1.71% on CIFAR100 dataset. These results suggest that the online soft labels are more reliable due to the better decoupling between the past average models of the two networks, which can effectively mitigate the negative influence of hard noisy labels and avoid bias amplification even when the networks have much erroneous outputs in the early training epochs.
IiiD Effectiveness of different transforms
We conduct experiments to investigate that the prediction consistency under different image transforms on CIFAR10 and CIFAR100 datasets. Specifically, we focus on a subset of frequently used transforms: scaling, rotation, and flipping. Scaling means the input images are resized to 48 48, and then cropped to 32 32. Rotation contains , , and , and flipping contains vertical flipping and horizontal flipping. The results are shown in Table VI. Among them, the test accuracy of horizontal flipping gains the best performance on two datasets, and the test accuracy of scaling is the worst on two datasets except two cases on CIFAR10, i.e., Symmetry50% and Symmetry80%. While for rotation and vertical flipping, the results in four cases on two datasets present comparable performance.
Iv Conclusion
In this paper, we propose a simple and effective approach that utilize the transform consistency to identify mislabeled samples. Specifically, we train one single network with a joint loss between two inputs (the original image and its transformed image), which includes two classification losses and one KL loss. Furthermore, we design a classification loss by using the offline hard labels and online soft labels to provide more reliable supervisions for training a robust model. We conduct comprehensive experiments on CIFAR10, CIFAR100 and Clothing1M datasets and achieve the stateoftheart performance.
Comments
There are no comments yet.