self training with noisy student improves imagenet classification

Scott Dorsey Engelbert Son, Articles S

These works constrain model predictions to be invariant to noise injected to the input, hidden states or model parameters. Our experiments showed that self-training with Noisy Student and EfficientNet can achieve an accuracy of 87.4% which is 1.9% higher than without Noisy Student. The algorithm is iterated a few times by treating the student as a teacher to relabel the unlabeled data and training a new student. We start with the 130M unlabeled images and gradually reduce the number of images. task. We use a resolution of 800x800 in this experiment. Our experiments showed that self-training with Noisy Student and EfficientNet can achieve an accuracy of 87.4% which is 1.9% higher than without Noisy Student. These CVPR 2020 papers are the Open Access versions, provided by the. We present Noisy Student Training, a semi-supervised learning approach that works well even when labeled data is abundant. However, the additional hyperparameters introduced by the ramping up schedule and the entropy minimization make them more difficult to use at scale. - : self-training_with_noisy_student_improves_imagenet_classification We use EfficientNet-B0 as both the teacher model and the student model and compare using Noisy Student with soft pseudo labels and hard pseudo labels. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). It is expensive and must be done with great care. Our experiments showed that self-training with Noisy Student and EfficientNet can achieve an accuracy of 87.4% which is 1.9% higher than without Noisy Student. and surprising gains on robustness and adversarial benchmarks. "Self-training with Noisy Student improves ImageNet classification" pytorch implementation. Aerial Images Change Detection, Multi-Task Self-Training for Learning General Representations, Self-Training Vision Language BERTs with a Unified Conditional Model, 1Cademy @ Causal News Corpus 2022: Leveraging Self-Training in Causality Semi-supervised medical image classification with relation-driven self-ensembling model. The comparison is shown in Table 9. We use the standard augmentation instead of RandAugment in this experiment. Noisy Student can still improve the accuracy to 1.6%. These significant gains in robustness in ImageNet-C and ImageNet-P are surprising because our models were not deliberately optimizing for robustness (e.g., via data augmentation). During the learning of the student, we inject noise such as dropout, stochastic depth, and data augmentation via RandAugment to the student so that the student generalizes better than the teacher. Afterward, we further increased the student model size to EfficientNet-L2, with the EfficientNet-L1 as the teacher. Secondly, to enable the student to learn a more powerful model, we also make the student model larger than the teacher model. In other words, small changes in the input image can cause large changes to the predictions. Our study shows that using unlabeled data improves accuracy and general robustness. Amongst other components, Noisy Student implements Self-Training in the context of Semi-Supervised Learning. Finally, for classes that have less than 130K images, we duplicate some images at random so that each class can have 130K images. to use Codespaces. The top-1 and top-5 accuracy are measured on the 200 classes that ImageNet-A includes. This work introduces two challenging datasets that reliably cause machine learning model performance to substantially degrade and curates an adversarial out-of-distribution detection dataset called IMAGENET-O, which is the first out- of-dist distribution detection dataset created for ImageNet models. We then select images that have confidence of the label higher than 0.3. sign in Self-Training Noisy Student " " Self-Training . Ranked #14 on But training robust supervised learning models is requires this step. This paper proposes to search for an architectural building block on a small dataset and then transfer the block to a larger dataset and introduces a new regularization technique called ScheduledDropPath that significantly improves generalization in the NASNet models. It has three main steps: train a teacher model on labeled images use the teacher to generate pseudo labels on unlabeled images The learning rate starts at 0.128 for labeled batch size 2048 and decays by 0.97 every 2.4 epochs if trained for 350 epochs or every 4.8 epochs if trained for 700 epochs. On ImageNet, we first train an EfficientNet model on labeled images and use it as a teacher to generate pseudo labels for 300M unlabeled images. In our experiments, we observe that soft pseudo labels are usually more stable and lead to faster convergence, especially when the teacher model has low accuracy. Yalniz et al. Work fast with our official CLI. It implements SemiSupervised Learning with Noise to create an Image Classification. During the generation of the pseudo A tag already exists with the provided branch name. Especially unlabeled images are plentiful and can be collected with ease. This way, the pseudo labels are as good as possible, and the noised student is forced to learn harder from the pseudo labels. Figure 1(a) shows example images from ImageNet-A and the predictions of our models. Hence we use soft pseudo labels for our experiments unless otherwise specified. A. Krizhevsky, I. Sutskever, and G. E. Hinton, Temporal ensembling for semi-supervised learning, Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks, Workshop on Challenges in Representation Learning, ICML, Certainty-driven consistency loss for semi-supervised learning, C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy, R. G. Lopes, D. Yin, B. Poole, J. Gilmer, and E. D. Cubuk, Improving robustness without sacrificing accuracy with patch gaussian augmentation, Y. Luo, J. Zhu, M. Li, Y. Ren, and B. Zhang, Smooth neighbors on teacher graphs for semi-supervised learning, L. Maale, C. K. Snderby, S. K. Snderby, and O. Winther, A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, Towards deep learning models resistant to adversarial attacks, D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten, Exploring the limits of weakly supervised pretraining, T. Miyato, S. Maeda, S. Ishii, and M. Koyama, Virtual adversarial training: a regularization method for supervised and semi-supervised learning, IEEE transactions on pattern analysis and machine intelligence, A. Najafi, S. Maeda, M. Koyama, and T. Miyato, Robustness to adversarial perturbations in learning from incomplete data, J. Ngiam, D. Peng, V. Vasudevan, S. Kornblith, Q. V. Le, and R. Pang, Robustness properties of facebooks resnext wsl models, Adversarial dropout for supervised and semi-supervised learning, Lessons from building acoustic models with a million hours of speech, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), S. Qiao, W. Shen, Z. Zhang, B. Wang, and A. Yuille, Deep co-training for semi-supervised image recognition, I. Radosavovic, P. Dollr, R. Girshick, G. Gkioxari, and K. He, Data distillation: towards omni-supervised learning, A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko, Semi-supervised learning with ladder networks, E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, Proceedings of the AAAI Conference on Artificial Intelligence, B. Recht, R. Roelofs, L. Schmidt, and V. Shankar. (2) With out-of-domain unlabeled images, hard pseudo labels can hurt the performance while soft pseudo labels leads to robust performance. Our finding is consistent with similar arguments that using unlabeled data can improve adversarial robustness[8, 64, 46, 80]. First, we run an EfficientNet-B0 trained on ImageNet[69]. To noise the student, we use dropout[63], data augmentation[14] and stochastic depth[29] during its training. Agreement NNX16AC86A, Is ADS down? Hence, whether soft pseudo labels or hard pseudo labels work better might need to be determined on a case-by-case basis. We use our best model Noisy Student with EfficientNet-L2 to teach student models with sizes ranging from EfficientNet-B0 to EfficientNet-B7. Our largest model, EfficientNet-L2, needs to be trained for 3.5 days on a Cloud TPU v3 Pod, which has 2048 cores. There was a problem preparing your codespace, please try again. But during the learning of the student, we inject noise such as data This invariance constraint reduces the degrees of freedom in the model. Also related to our work is Data Distillation[52], which ensembled predictions for an image with different transformations to teach a student network. For this purpose, we use a much larger corpus of unlabeled images, where some images may not belong to any category in ImageNet. [2] show that Self-Training is superior to Pre-training with ImageNet Supervised Learning on a few Computer . Noisy Student Training is based on the self-training framework and trained with 4-simple steps: This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. In the above experiments, iterative training was used to optimize the accuracy of EfficientNet-L2 but here we skip it as it is difficult to use iterative training for many experiments. After using the masks generated by teacher-SN, the classification performance improved by 0.2 of AC, 1.2 of SP, and 0.7 of AUC. Use, Smithsonian In Noisy Student, we combine these two steps into one because it simplifies the algorithm and leads to better performance in our preliminary experiments. To achieve this result, we first train an EfficientNet model on labeled ImageNet images and use it as a teacher to generate pseudo labels on 300M unlabeled images. Chowdhury et al. Self-training 1 2Self-training 3 4n What is Noisy Student? For example, with all noise removed, the accuracy drops from 84.9% to 84.3% in the case with 130M unlabeled images and drops from 83.9% to 83.2% in the case with 1.3M unlabeled images. Self-training with noisy student improves imagenet classification, in: Proceedings of the IEEE/CVF Conference on Computer . Code is available at this https URL.Authors: Qizhe Xie, Minh-Thang Luong, Eduard Hovy, Quoc V. LeLinks:YouTube: https://www.youtube.com/c/yannickilcherTwitter: https://twitter.com/ykilcherDiscord: https://discord.gg/4H8xxDFBitChute: https://www.bitchute.com/channel/yannic-kilcherMinds: https://www.minds.com/ykilcherParler: https://parler.com/profile/YannicKilcherLinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/If you want to support me, the best thing to do is to share out the content :)If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):SubscribeStar (preferred to Patreon): https://www.subscribestar.com/yannickilcherPatreon: https://www.patreon.com/yannickilcherBitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cqEthereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9mMonero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n Using self-training with Noisy Student, together with 300M unlabeled images, we improve EfficientNets[69] ImageNet top-1 accuracy to 87.4%. C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Train a classifier on labeled data (teacher). We sample 1.3M images in confidence intervals. We iterate this process by putting back the student as the teacher. Although noise may appear to be limited and uninteresting, when it is applied to unlabeled data, it has a compound benefit of enforcing local smoothness in the decision function on both labeled and unlabeled data. We iterate this process by putting back the student as the teacher. The method, named self-training with Noisy Student, also benefits from the large capacity of EfficientNet family. We train our model using the self-training framework[59] which has three main steps: 1) train a teacher model on labeled images, 2) use the teacher to generate pseudo labels on unlabeled images, and 3) train a student model on the combination of labeled images and pseudo labeled images. We found that self-training is a simple and effective algorithm to leverage unlabeled data at scale. First, it makes the student larger than, or at least equal to, the teacher so the student can better learn from a larger dataset. Are you sure you want to create this branch? Qizhe Xie, Minh-Thang Luong, Eduard Hovy, Quoc V. Le. The abundance of data on the internet is vast. Probably due to the same reason, at =16, EfficientNet-L2 achieves an accuracy of 1.1% under a stronger attack PGD with 10 iterations[43], which is far from the SOTA results. As we use soft targets, our work is also related to methods in Knowledge Distillation[7, 3, 26, 16]. As shown in Table3,4 and5, when compared with the previous state-of-the-art model ResNeXt-101 WSL[44, 48] trained on 3.5B weakly labeled images, Noisy Student yields substantial gains on robustness datasets. Copyright and all rights therein are retained by authors or by other copyright holders. Hence, a question that naturally arises is why the student can outperform the teacher with soft pseudo labels. Whether the model benefits from more unlabeled data depends on the capacity of the model since a small model can easily saturate, while a larger model can benefit from more data. Papers With Code is a free resource with all data licensed under. We iterate this process by putting back the student as the teacher. Figure 1(b) shows images from ImageNet-C and the corresponding predictions. Are you sure you want to create this branch? Since we use soft pseudo labels generated from the teacher model, when the student is trained to be exactly the same as the teacher model, the cross entropy loss on unlabeled data would be zero and the training signal would vanish. Further, Noisy Student outperforms the state-of-the-art accuracy of 86.4% by FixRes ResNeXt-101 WSL[44, 71] that requires 3.5 Billion Instagram images labeled with tags. The biggest gain is observed on ImageNet-A: our method achieves 3.5x higher accuracy on ImageNet-A, going from 16.6% of the previous state-of-the-art to 74.2% top-1 accuracy. The performance consistently drops with noise function removed. Chum, Label propagation for deep semi-supervised learning, D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling, Semi-supervised learning with deep generative models, Semi-supervised classification with graph convolutional networks. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In our experiments, we also further scale up EfficientNet-B7 and obtain EfficientNet-L0, L1 and L2. We also study the effects of using different amounts of unlabeled data. Not only our method improves standard ImageNet accuracy, it also improves classification robustness on much harder test sets by large margins: ImageNet-A[25] top-1 accuracy from 16.6% to 74.2%, ImageNet-C[24] mean corruption error (mCE) from 45.7 to 31.2 and ImageNet-P[24] mean flip rate (mFR) from 27.8 to 16.1. Noisy StudentImageNetEfficientNet-L2state-of-the-art. For example, without Noisy Student, the model predicts bullfrog for the image shown on the left of the second row, which might be resulted from the black lotus leaf on the water. We iterate this process by putting back the student as the teacher. Code for Noisy Student Training. Similar to[71], we fix the shallow layers during finetuning. Self-training with Noisy Student improves ImageNet classificationCVPR2020, Codehttps://github.com/google-research/noisystudent, Self-training, 1, 2Self-training, Self-trainingGoogleNoisy Student, Noisy Studentstudent modeldropout, stochastic depth andaugmentationteacher modelNoisy Noisy Student, Noisy Student, 1, JFT3ImageNetEfficientNet-B00.3130K130K, EfficientNetbaseline modelsEfficientNetresnet, EfficientNet-B7EfficientNet-L0L1L2, batchsize = 2048 51210242048EfficientNet-B4EfficientNet-L0l1L2350epoch700epoch, 2EfficientNet-B7EfficientNet-L0, 3EfficientNet-L0EfficientNet-L1L0, 4EfficientNet-L1EfficientNet-L2, student modelNoisy, noisystudent modelteacher modelNoisy, Noisy, Self-trainingaugmentationdropoutstochastic depth, Our largest model, EfficientNet-L2, needs to be trained for 3.5 days on a Cloud TPU v3 Pod, which has 2048 cores., 12/self-training-with-noisy-student-f33640edbab2, EfficientNet-L0EfficientNet-B7B7, EfficientNet-L1EfficientNet-L0, EfficientNetsEfficientNet-L1EfficientNet-L2EfficientNet-L2EfficientNet-B75. on ImageNet, which is 1.0 This is probably because it is harder to overfit the large unlabeled dataset. In our experiments, we use dropout[63], stochastic depth[29], data augmentation[14] to noise the student. Stochastic Depth is a simple yet ingenious idea to add noise to the model by bypassing the transformations through skip connections. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Specifically, we train the student model for 350 epochs for models larger than EfficientNet-B4, including EfficientNet-L0, L1 and L2 and train the student model for 700 epochs for smaller models. Noisy Student Training is a semi-supervised learning method which achieves 88.4% top-1 accuracy on ImageNet (SOTA) and surprising gains on robustness and adversarial benchmarks. Stochastic depth is proposed, a training procedure that enables the seemingly contradictory setup to train short networks and use deep networks at test time and reduces training time substantially and improves the test error significantly on almost all data sets that were used for evaluation. The top-1 accuracy reported in this paper is the average accuracy for all images included in ImageNet-P. Callback to apply noisy student self-training (a semi-supervised learning approach) based on: Xie, Q., Luong, M. T., Hovy, E., & Le, Q. V. (2020). Apart from self-training, another important line of work in semi-supervised learning[9, 85] is based on consistency training[6, 4, 53, 36, 70, 45, 41, 51, 10, 12, 49, 2, 38, 72, 74, 5, 81]. This model investigates a new method. ImageNet . A new scaling method is proposed that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient and is demonstrated the effectiveness of this method on scaling up MobileNets and ResNet. By showing the models only labeled images, we limit ourselves from making use of unlabeled images available in much larger quantities to improve accuracy and robustness of state-of-the-art models. In all previous experiments, the students capacity is as large as or larger than the capacity of the teacher model. On robustness test sets, it improves ImageNet-A top-1 accuracy from 61.0% to 83.7%, reduces ImageNet-C mean corruption error from 45.7 to 28.3, and reduces ImageNet-P mean flip rate from 27.8 to 12.2. It is found that training and scaling strategies may matter more than architectural changes, and further, that the resulting ResNets match recent state-of-the-art models. This result is also a new state-of-the-art and 1% better than the previous best method that used an order of magnitude more weakly labeled data[44, 71]. We thank the Google Brain team, Zihang Dai, Jeff Dean, Hieu Pham, Colin Raffel, Ilya Sutskever and Mingxing Tan for insightful discussions, Cihang Xie for robustness evaluation, Guokun Lai, Jiquan Ngiam, Jiateng Xie and Adams Wei Yu for feedbacks on the draft, Yanping Huang and Sameer Kumar for improving TPU implementation, Ekin Dogus Cubuk and Barret Zoph for help with RandAugment, Yanan Bao, Zheyun Feng and Daiyi Peng for help with the JFT dataset, Olga Wichrowska and Ola Spyra for help with infrastructure.